Thursday, March 12, 2015

Chinese Whispers - an Efficient Graph Clustering Algorithm

     Clustering is defined as the task of grouping together objects in such a way that those objects that are similar to each other comes in the same group. The Chinese Whispers algorithm provides a basic yet very effective way to partition the nodes of a graph.

Some to the notations used are as follows:

Let G=(V,E) be a weighted undirected graph with nodes (vi)∈V and weighted edges (vi, vj, wij) ∈E with weight wij.The degree of a node is the number of edges a node takes part in.The neighborhood of a node v consists of all the nodes that are connected to v.

The Adjacency matrix AG of a graph G with n nodes is an n×n matrix where the entry aij denotes the weight of the edge between vi and vj . If no edge exists this weight will be zero.

The Class matrix DG of a Graph G with n nodes is an n×n matrix where rows represent nodes and columns represent classes (ci)∈C. The value of dij (row i and column j) represents how much vi belongs to class cj.

Chinese Whispers Algorithm

The Chinese Whispers Algorithm tries to find out the groups of nodes that broadcast the same message to their neighbors. The algorithm is as follows:


Initialize:
     forall vi in V: 

          class(vi)=i;
while changes:
     forall v in V, randomized order:
          class(v)=highest ranked class in neighborhood of v;




The algorithm works as follows in a bottom-up fashion.
Initially all the nodes are assigned to different classes. Then the nodes are processed for a small number of iterations. In each iteration the nodes inherit the strongest class in the local neighborhood. This is the class whose sum of edge weight is maximum to the current node. If there are multiples Strongest class available, one of them is randomly chosen. One important thing to note here is that the classes are updated immediately. This means a node can obtain classes that were introduced in the same iteration. 

                                   Fig 1. Clustering of 11 nodes with CW
                                                        for 2 iterations.


 Formally, the algorithm does not converge. But usually the class do not change after few iterations. The number of iterations depends on the diameter of the graph. Larger the diameter more number of iterations will be required.

Chinese Whispers as Matrix Operations 

     The Chinese Whispers is a special case of Markov-Chain-Clustering(MCL). MCL is the parallel simulation of all possible random walks up to a finite length on a graph G. The idea is that random walkers are more likely to end up in the same cluster where they started than walking across clusters. MCL simulates flow on a graph by repeatedly updating transition probabilities between all nodes, eventually converging to a transition matrix after k steps that can be interpreted as a clustering of G. This is achieved by alternating an expansion step and an inflation step. 
The expansion step is a matrix multiplication of MG with the current transition matrix. The inflation step is a column-wise non-linear operator that increases the contrast between small and large transition probabilities and normalizes the column- wise sums to 1. The k matrix multiplications of the expansion step of MCL lead to its time-complexity of O(k*n2).



The CW process keeps one single largest entry per row.This leads to drastic optimization possibilities. Let maxrow(.) be an operator that operates row-wise on a matrix and sets all entries of a row to zero except the largest entry, which is set to 1. The algorithm is as follows :

D0 = In
for t=1 to iterations
     Dt-1 = maxrow(Dt-1)
     Dt = Dt-1AG

 


 The Time complexity is O(k.|E|). There is a chance the classes might not converge and just oscillate.Fig2 shows an example.



                                    Fig2 : Oscillating States in matrix CW

This is caused due to stepwise update of the class matrix. To avoid this the following measures can be taken:

  • Random mutation: with some probability, the maxrow-operator places the 1 for an otherwise unused class.
  • Keep class: with some probability, the row is copied from Dt-1 to Dt.
  • Continuous update: the classes get updated immediately.


Experiments

Bi-partite Clique

    A n-bipartite-clique is a graph of two n-cliques, which are connected such that each node has exactly one edge going to the clique it, does not belong to.

                                         Fig3 : A 10-bipartite clique

  We expect CW algorithm the cut the 2 cliques apart. So the algorithm has 2 choices. Either it can separate it into 2 clusters or combine them into 1 cluster. This is largely dependent on the initial iterations.

                          Fig4 : Percentage of obtaining two clusters when
                                    applying CW on n-bipartite cliques

 From Fig4, We can see that CW algorithm fails on graphs of small size.But this problem does not occur if the graph is large in size.



Conclusion

The Chinese Whispers Algorithm is an efficient Clustering algorithm. This technique seems not to work for graphs of small size. But the main power of CW algorithm lies in Clustering of large graphs in reasonable time. The applications of CW algorithm are enormous. It can be primarily used in NLP.

References

1. C. Biemann. 2006. Chinese whispers - an efficient graph clustering algorithm and its application to natural language processing problems. In proceedings of TextGraphs, 73–80, New York, USA.
2. http://en.wikipedia.org/wiki/Cluster_analysis.

No comments:

Post a Comment