Iterative File-Based Item:Item Similarity Computation

This is a system for computing item:item similarity for use with the Project Aura datastore.

Project Aura is an open source project from Sun Microsystems that aims at providing an extensible recommendation engine. There have been some innovations such as the application of TF-IDF to the textual data surrounding songs, and this part of the system is about adding a more traditional method of predicting similarity: item:item comparison.

Item:Item Similarity

ToDo: Describe item:item

Computation

One of the primary advantages of item:item similarity is that the differences between items should change relatively little over time. This characteristic means that even though item:item similarity is more complex than some user-based metrics, it can be computed occasionally with a relatively small effect on the results for a particular user. Aura, therefore, uses the traditional method of periodic batch updating of item:item similarities.

Project Aura is specifically designed to operate on the Project Caroline utility computing system. Caroline, as a grid gains much of it's power from its efficiency from distributing processes across multiple machines and then coordinating them through network connections and shared network storage. The Aura datastore accomplishes this with a distributed binary tree.

Step #1: Get the Data Out of the Datastore

All data requests are managed through a central DataStore interface. Requests are filtered down a few levels until they are farmed out to a PartitionCluster.

The first step in generating the similarity is to start a ReplicantProcessor for each of the PartitionClusters. This processor will simply iterate over all the items in each partition and generate a file for each where each line is of the form:

(Artist, User, Listen Count)

The Cosine Distance

The numerator of the cosine distance is the dot product and the denominator is the product of the lengths of the vectors.

a˙b= aibi= a b cosθ cosθ= a˙b a b = aibi ai2 bi2

Step #2-1: Generate Dot Product Components

The n partition dump files can be merged with the each of the other partition dumps (including itself) using the username as the merge key. For each element in the Cartesian product, produce a new value:

(Artist Dual = Artista.Artistb, Listen Product = Listen Counta * Listen Countb)

Step #3-1: Partition Dot Product Components

In order for a simple processor to generate a dot product, all the components need to be in the same file. There is a relatively simple method for grouping unique items across multiple files. All of the element keys are hashed and the element is written to the file numbered with the hash modulo the number of files. A file locking system has to be used to prevent over writes, but the locks will only be held briefly and there is no possibility of a deadlock.

Step #4-1: Sum Dot Product Components

Each of the m dot product collected files now contain all elements that are components of the same dot product. A processor can be run on each on of the files to sum all the elements with the same key and output a single element whose value is the sum of all the elements with that key in the file.

Because the elements in the dot product component files are unsorted, the files can't simply be processed a line at a time. For this particular implementation, the entire file is heaped in memory so that elements can be processed in groups before being output.

(Artist Dual, Dot Product = ∑ Listen Producti)

Step #2-2: Vector Lengths

This step is also numbered two because its only prerequisite is the partition dump files. This process can be completed in parallel with the generation of the dot products.

The partition dump files are merged on the artist name as the key. Because the ProcessClusters are keyed on the artist id, all the components for each vector are already in a single file. If fact they are already grouped, so the processing can be performed by reading only a single value at a time.

(Artist, Vector Length = Listen Counti )

Step #3-2: Vector Length Combinations

As with the dot product components the next step is to generate the Cartesian product to generate all possible combinations of vector lengths.

(Artist Dual = Artista.Artistb, Vector Product = Vector Lengtha * Vector Lengthb)

Step #4-2: Partition Vector Length Combinations

The n2 vector combination files need to be combined with the m dot product files. There is no natural alignment between the files, and a simple way to create one is to partition the vector combinations into m files using the same method as the dot product partitioning.

Step #5: Compute Cosine Distances

Each of the m files can now be merged together on the Artist.Artist keys. Each file will have each key once and the resultant output will contain each key only once.

(Artist Dual, Cosine Distance = Dot ProductiVector Producti )

Implementation

This design is a pretty straightforward copy of some ideas from map-reduce with some additional operations lifted from relational algebra.

Getting Data From Files

A more complete system would be based out of more complex relational topics, but since all joins in this system are on a single key it is possible to use a simpler record-based model. A RecordSet is simply a group of Maps identified with String keys. Each Map + key combination is a Record.

The RecordSet has two primary functions: