This is a system for computing item:item similarity for use with the Project Aura datastore.
Project Aura is an open source project from Sun Microsystems that aims at providing an extensible recommendation engine. There have been some innovations such as the application of TF-IDF to the textual data surrounding songs, and this part of the system is about adding a more traditional method of predicting similarity: item:item comparison.
ToDo: Describe item:item
One of the primary advantages of item:item similarity is that the differences between items should change relatively little over time. This characteristic means that even though item:item similarity is more complex than some user-based metrics, it can be computed occasionally with a relatively small effect on the results for a particular user. Aura, therefore, uses the traditional method of periodic batch updating of item:item similarities.
Project Aura is specifically designed to operate on the Project Caroline utility computing system. Caroline, as a grid gains much of it's power from its efficiency from distributing processes across multiple machines and then coordinating them through network connections and shared network storage. The Aura datastore accomplishes this with a distributed binary tree.
All data requests are managed through a central DataStore
interface. Requests are filtered down a few levels until they are farmed out to a PartitionCluster
.
The first step in generating the similarity is to start a ReplicantProcessor
for each of the PartitionClusters
. This processor will simply iterate over all the items in each partition and generate a file for each where each line is of the form:
(Artist, User, Listen Count)
PartitionCluster
)The numerator of the cosine distance is the dot product and the denominator is the product of the lengths of the vectors.
The n partition dump files can be merged with the each of the other partition dumps (including itself) using the username as the merge key. For each element in the Cartesian product, produce a new value:
(Artist Dual = Artista.Artistb, Listen Product = Listen Counta * Listen Countb)
In order for a simple processor to generate a dot product, all the components need to be in the same file. There is a relatively simple method for grouping unique items across multiple files. All of the element keys are hashed and the element is written to the file numbered with the hash modulo the number of files. A file locking system has to be used to prevent over writes, but the locks will only be held briefly and there is no possibility of a deadlock.
Each of the m dot product collected files now contain all elements that are components of the same dot product. A processor can be run on each on of the files to sum all the elements with the same key and output a single element whose value is the sum of all the elements with that key in the file.
Because the elements in the dot product component files are unsorted, the files can't simply be processed a line at a time. For this particular implementation, the entire file is heaped in memory so that elements can be processed in groups before being output.
(Artist Dual, Dot Product = ∑ Listen Producti)
This step is also numbered two because its only prerequisite is the partition dump files. This process can be completed in parallel with the generation of the dot products.
The partition dump files are merged on the artist name as the key. Because the ProcessCluster
s are keyed on the artist id, all the components for each vector are already in a single file. If fact they are already grouped, so the processing can be performed by reading only a single value at a time.
(Artist, Vector Length = )
As with the dot product components the next step is to generate the Cartesian product to generate all possible combinations of vector lengths.
(Artist Dual = Artista.Artistb, Vector Product = Vector Lengtha * Vector Lengthb)
The n2 vector combination files need to be combined with the m dot product files. There is no natural alignment between the files, and a simple way to create one is to partition the vector combinations into m files using the same method as the dot product partitioning.
Each of the m files can now be merged together on the Artist.Artist keys. Each file will have each key once and the resultant output will contain each key only once.
(Artist Dual, Cosine Distance = )
This design is a pretty straightforward copy of some ideas from map-reduce with some additional operations lifted from relational algebra.
A more complete system would be based out of more complex relational topics, but since all joins in this system are on a single key it is possible to use a simpler record-based model. A RecordSet
is simply a group of Map
s identified with String
keys. Each Map
+ key combination is a Record
.
The RecordSet
has two primary functions:
Record
s from a fileRecord
s into groups if required. If Record
s are grouped then they don't necessarily come in any particular order, but all Record
s with a particular key in the set will sequential and the order needs to be consistent across RecordSet
s.