]> Content-Based Roles

Content-Based Roles

18 June 2014

I'm working through the following paper. I want to tease out some of the math and step through it to better understand.

The idea is that instead of looking at the graph of social connections to evaluate someone, you describe the content of their interactions and then cluster it.

The axises along which content is described are:

• Personality: PE = (number of friends, number of personal posts, [0 = public | 1 = private | 2 = secret])
• Behavior: BE = (number of public posts, number of comments, number of likes)
• Action Sequence: AS = ([s = status) | l = link | p = photo | v = video]*) where each entry in the list describes the type of a post
• Affectivity: AF = (positive score, negative score) based on emotional weightings assigned to words
• Recognition: RE = ($$\frac{\mbox{number of comments from other users}}{\mbox{number of posts}}$$, $$\frac{\mbox{number of posts shared by other users}}{\mbox{number of posts}}$$, $$\frac{\mbox{number of likes}}{\mbox{number of posts}}$$)

The vectors that result from concatenating these features are then clustered according to c-means fuzzy clustering. Fuzzy means that a vector belongs to all the clusters to varying extents.

A set of feature vectors for n users, $$X = \{ x_1, x_2, …, x_n \} \ni x_i = \{ PE, BE, AS, AF, RE \},$$ are grouped into c roles, $$\tilde{F} = \{ \tilde{F_1}, \tilde{F_2}, …, \tilde{F_c} \}$$, by minimizing:

$$J_m =\sum\limits_{j=1}^{n} \sum\limits_{i=1}^{c} (\mu_{ij})^m D(x_j, c_i)$$

Where:

• $$\mu_{ij} = \left[ \sum\limits_{k=1}^{c} \left( \frac{D(x_i, c_i)}{D(x_i, c_k)} \right)^\frac{1}{m-1} \right]^{-1}$$
• $$\sum_{j=1}^{c} \mu_{ij} = 1$$
• $$D(a, b) = \begin{cases} D_c = 1 - \frac{a • b}{||a|| ||b||};& PE, AF\\ D_e = \sqrt{\sum_i (a_i - b_i)^2};&\\ D_c D_e;& BE, RE\\ \mbox{edit distance};& AS \end{cases}$$
• $$c_i^\prime = \frac{\sum_{j=1}^{n} (\mu_{ij})^m x_j}{\sum_{j=1}^{n} (\mu_{ij})^m}$$
• $$1 ≤ m = \mbox{"fuzziness"}$$

The process is:

1. Pick $$c$$ random cluster centers: $$C$$
2. Calculate the membership of user vector, $$i$$, in each cluster, $$j$$: $$\mu_{ij}$$
3. Find the centroids of the clusters, $$C^\prime$$
4. Compare $$J_m$$ and $$J_m^\prime$$, if the difference is greater than $$\epsilon$$, repeat using $$C^\prime$$ as the centers

The input set of massages is partitioned according to time, and for each a set of membership vectors can be generated, $$m_j = \{ \mu_{1j}, \mu_{2j}, …, \mu_{cj} \}$$. These are then quantized to vectors $$\{ (r_1,q(\mu_{1j})), (r_2,q(\mu_{2j})), …, (r_c,q(\mu_{cj})) \}$$ according to the following rule:

$$q(x) = \begin{cases} L;& 0 \leq x \le 0.25\\ M;& 0.25 \leq x \le 0.75\\ H;& 0.75 \leq x \leq 1 \end{cases}$$