Project Aura has been examining using a method from comparing documents for doing music recommendations. I have been tasked with comparing this result with results from more traditional collaborative filtering algorithms, specifically Amazon's item to item algorithm.
This document is divided into the following sections:
Term Frequency-Inverse Document Frequency (tf-idf) is a method for comparing the similarity of documents within a corpus.
TF-IDF is a "bag of words" approach, meaning that the positions of the words relative to each other is not important, only how frequently they are present in the document. For the sake of comparison, a master set of terms is defined which has each term present only once.
Consider a simple two sentence corpus: (Some of the methods described will not have productive results because this corpus and documents are too short to see statistical trends in word distribution.)
The corpus term tuple for this corpus would be:
A convenient metaphor then for conceptualizing documents is to think of the corpus term tuple as defining an n-dimensional space. For the example, one axis is the prevalence of the word "you," another axis is "will," and another "know." This idea allows for using similarity comparisons using concepts from geometry and physics. Documents are generally defined as vectors in .
A simple method for characterizing documents is to simply make the percentage of words in that are term . This is called the "term frequence":
Minion does not divide through by the number of words in the document. Since the proportions of the elements in the vector don't change by dividing through by a constant, this doesn't affect the angle between the vectors.
For the previous corpus, the term frequence vectors would be:
you | will | know | the | truth | and | make | free | set | but | first | it | piss | off | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
(you, will, know, the, truth, and, the, truth, will, make, you, free) | ||||||||||||||
0 | 0 | 0 | 0 | 0 | 0 | |||||||||
(the, truth, will, set, you, free, but, first, it, will, piss, you, off) | ||||||||||||||
0 | 0 | 0 |
An issue with this approach is that these vectors tend to be dominated by general purpose frequently used terms like "the" and "an." TF-IDF weeds out commonly used words by using the "document frequence" of each element of where document frequence is the proportion of documents where the term appears:
For the example corpus, the document frequence vector is:
you | will | know | the | truth | and | make | free | set | but | first | it | piss | off | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
The document frequence can then be used to adjust the weight of the term frequencies. The exact proportion of how these terms are combined varies, but a common method is:
Other weighting methods are also common. The method used in Aura is:
Note that as Fj → 1, log(Fj) → 0. So a term that is present in all documents is removed from consideration entirely.
The final result for the first example document in the corpus then is:
(you, will, know, the, truth, and, the, truth, will, make, you, free) | ||||||||||||||
you | will | know | the | truth | and | make | free | set | but | first | it | piss | off | |
0 | 0 | 0 | 0 | 0 | 0 | |||||||||
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Note again that this was an unnaturally small corpus with an intentionally homogeneous vocabulary, so the patterns in how the terms combined are not generalizable.
The reason for mapping documents into a vector space is there are methods for comparing the "similarity" of elements. There are two commonly used methods:
Both of these concepts seem a bit strange for n-dimensional geometries, but for any two non-overlapping vectors there is a single 2D plane that contains both of them. For visualizing these properties, it is simpler to think of them as taking place in that plane.
Each method has advantages and disadvantages, and the method that is generally used with tfidf is the cosine distance. Two vectors have a similarity of 1 if the angle between them is 0°, and -1 if the angle is 180° (π radians). The transition isn't linear, but it is constantly decreasing.
The cosine of the angle between two vectors can be derived from the dot product:
The basic idea of TF-IDF is as vectors, but the vectors tend to be very sparse (have lots of zeroes). That is particularly true in the type of recommendations we are doing based on rank or attention data. Our of the millions of songs in the world, most people have listened to a very few. A more efficient way of conceptualizing the data is as tuples.
In all these definitions, the following symbols are used:
Because we are unconcerned with the order of terms in a document, the corpus may be represented as where:
Term frequence then is where:
Document frequence is where:
The TF-IDF vector components then can be computed as where:
Finally, the cosine similarity can be represented as: where:
One of the possibilities that Aura is exploring is to leverage an existing TF-IDF implementation in minion. There are a couple different mappings under consideration, but the one considered here is:
TF-IDF Term | TF-IDF Variable | Aura Mapping | Aura Variable | Explanation |
---|---|---|---|---|
term | listener | |||
document | artist | |||
corpus | all artists | |||
term frequence | listener fanaticism | |||
document frequence | listener promiscuity | |||
tf-idf | dedicated fanaticism |
The basic reasoning here is: a user who is a big fan of a limited number of artists is more likely to be focused within similar artists.
The cosine similarity is identical since both methods produce the tuples (or vectors) with the same positional meanings. The question is how similar the results will be.
One issue with system like this is scalability. Consider our example where each artist has a vector where the number of positions is the number of listeners. In a system with, say, 40 million users and 750,000 artists, the obvious m×n table for storing data would have 4×107(7.5×105) = 3×1013 positions. At a byte per position, that's 30 terabytes of data. Even though modern drives can certainly provide that much storage, there are significant overhead and data access issues. It is also unnecessary because the bulk of the spaces in the matrix are empty. Most listeners will have listened to a few hundred artists and even a listener who has listened to 7,500 different artists will only have 1% filled.
This is why tuples end up being a more efficient method of storage. Not only does access to a 30tb indexed block of memory not have to be managed, but even if the storage space is tripled in converting to triples, if you drop the 99.5% empty space in the table, there should still be an order of magnitude of improvement.
Tuples as a computing method have been around for a while. Prolog is an old school example, but tuples are present in the cutting edge of data representation. RDF is an XML format making significant headway in the semantic web movement. Map-reduce is an algorithm pioneered by Google that leverages the parallelizability of tuple-based operations. Hadoop, an open source map-reduce implementation, recently set a new speed record sorting a terabyte of data in 3.5 minutes. The tuple technology I'm interested in however is perhaps the best established: SQL.
Relational database systems are generally taught as connecting sets of tables, but that's not really where their idealogical roots lie. The literature and research into how these systems operate is almost exclusively in terms of relational algebra on sets of tuples. So, I am running my tests in SQL.
Running on our eight-core 2ghz machine with 32gb of RAM, this script took 0:27:43 to run. (Though it used an average of 5% of the processor time and 600mb of RAM, so I may be able to increase that significantly.) That's to compute the similarity for:
The final analysis produces 9,071,840 cosine similarities. If there was a similarity for every artist / artist pair there would be 41,3322 = 1,708,334,224 pairs. Because this data is very sparse, similarity can only be computed for 0.5% of artists. This makes sense, since the of the original 495,364,020 potential meaningful datapoints (user / artist pairs), only 598,034 (~0.1%) were present.
Item to item similarity is a method popularized by Amazon for computing the similarity of items in its catalog. The reasoning is that item similarities are more static that user similarities and so in situations where finding the similarity requires extensive computation they have an advantage in robustness to infrequent updates. In the paper Amazon.com Recommendations: Item-to-Item Collaborative Filtering the algorithm is applied to purchases, though I think conceptually it makes more sense to consider then in terms of pageviews. I'm not likely to purchase a bunch of similar items, but I am likely to look at a bunch of them consecutively while shopping.
The algorithm as described in the paper is:
Converting this to artist and listeners is simple transliteration:
Since cosine similarity is being used, it is easiest to think of this initially as being about the construction of vectors. The algorithm could be rewritten as: "For each artist, construct a vector where each position in the vector represents a particular user. The similarity between any two artists is the cosine of the angle between those vectors."
In Amazon's system, either an item has been purchased, or it hasn't, so the vectors are boolean: either one or zero. For listen or pageview data however, the vector components can have more complex values. The question then becomes one of if and how those values should be altered to give the best results.
It is worth noting that TF-IDF as described here is simply one method for weighting the elements of those vectors. Analyzing the effectiveness of TF-IDF can be done by comparing the results of TF-IDF to the output of other methods. Even without a definitive ground truth as to which artists are more or less similar, it is possible to say that two systems of computing similarity are similar or dissimilar.
The simplest option is to just create the artist vectors from the raw tag counts without normalizing them at all. The SQL to generate that comparison took 0:24:07 to run.
To test these ideas I loaded a dataset scraped from LastFM of users top 50 most listened to artists along with listen counts. I used a simple python script employing the python-mysqldb package on Ubuntu.
The tables used are: (I list the SQL because the language occasionally eludes me and these graphs might be flawed in some way.)
CREATE TABLE user (id INT AUTO_INCREMENT PRIMARY KEY, name VARCHAR(32) UNIQUE)
CREATE TABLE artist (id INT AUTO_INCREMENT PRIMARY KEY, name TEXT, mbid VARCHAR(48) UNIQUE)
CREATE TABLE listen (user_id INT NOT NULL, artist_id INT NOT NULL, count INT NOT NULL DEFAULT 1, FOREIGN KEY (user_id) REFERENCES user(id), FOREIGN KEY (artist_id) REFERENCES artist(id))
A rough look at the data is:
Truncating the data at 50 artists per user will likely affect the operation of the algorithm. Consider this number of listeners per artist plot (generated by R and compressed with python):
SELECT COUNT(user_id) AS user_count FROM listen GROUP BY artist_id ORDER BY user_count DESC
This is pretty much impossible to see any differentiation in. Consider the same data plotted with a log scale.
Note that over half (~60%) of the artists have only a single listener. The question then is how this will affect artists that were truncated. Truncating artists that are popular will cloud the results a bit, but truncating artists for which a user is the only listener means that artist is now removed from the system. The question then is whether less popular artists are listened to less frequently. This graph of number of listens per number of listeners shows a mostly random distribution with a slight trend:
SELECT user_count, avg(count) as average FROM (SELECT COUNT(user_id) AS user_count, avg(count) AS count FROM listen GROUP BY artist_id) AS t1 GROUP BY user_count
Consider, however, only artists with 10 or fewer listeners (which is ~87% of artists):
SELECT user_count, avg(count) as average FROM (SELECT COUNT(user_id) AS user_count, avg(count) AS count FROM listen GROUP BY artist_id) AS t1 WHERE user_count <= 10 GROUP BY user_count
This means that there is likely a large percentage lower popularity artists which were truncated from the system entirely.
The one possibility that would avoid this is if there is a set of indie focused artists that are all listening to eclectic music and don't share any artists with anyone else. This would mean that unique artists (only a single listener) are not being pushed out by popular artists. The number of unique artists per user should help see if this is the case:
SELECT COUNT(user_id) AS user_count FROM (SELECT COUNT(user_id) as user_count, user_id FROM listen GROUP BY artist_id HAVING user_count = 1) as t1 GROUP BY user_id ORDER BY user_count
So, the majority of users have a single or no unique artists, and there is only one person with 30.