[Update 1] Louvain-Performance and Self-Loops

January 6, 2016 · by Christina Tong · in News, Questions, Reflections

For background, please see my earlier post.

Recall that we sometimes want the presence of self-loops to be stronger than inter-community links. That is, despite the weight of some inter-community edge (e_1, e_2), the presence of heavy self-loops (e_1, e_1) and (e_2, e_2) may indicate that not placing e_1 and e_2 into the same community is the best move to get closer to the intuitive clustering.

My first thought was to make a modified version of performance that accounts for weighted self-loops. My initial plans for a modified performance metric were as follows:

We draw upon how modularity deals with self-loops, and how performance deals with weighted non-self-loop edges to design this modified algorithm.

Recall that modularity compares the number of intra-community and inter-community edges to a random graph with similar characteristics. It counts self-loop edge weights as part of the intra-community edge score. (This is obviously true, as a node is always in the same community as itself.) Modularity’s normalization factor relies on the sum of all edge weights. Naturally, this sum includes the weights of self-loops. So, self-loops will be included in the normalization (random graph), and in the intra-community edge score (of the graph to be analyzed).

As discussed in an earlier blog post, there are versions of performance that deal with weighted edges. These equations rely on a meaningful maximum M, dependent on the data set, that meaningfully upper bounds the edge weights.

Self-loops are but sums of existent edge weights. Given this, they may be many times larger than the M given for the original graph. To account for this, we have two choices:

1) Algorithmically determine M for later passes of the Louvain algorithm. For example, set M to the max weight of all edges (including self loops), or set M to be the statistical definition of the upper outer fence for extreme outliers. We are in effect creating a new graph on every pass, so keeping the old value of M doesn’t make sense.

2) Introduce an M_s, maintained separately from M, that is the meaningful bound for self-loop weights. It doesn’t make sense to increase M to bound the self-loop weights, as the weight of self-loops are conceptually/meaningfully distinct from the weights of edges in the original graph. Additionally, increasing M would drastically affect (reduce) the influence of non-self-loop edges. So we introduce an algorithmically-determined M_s that changes with every pass, and is used to normalize only for the self-loops. Meanwhile, M doesn’t change, and is used to normalize the non-self-loop edges.

It’s certainly possible to modify the metric, as vaguely outlined above. (This would generally have the effect of decreasing the performance score for any given graph.) However, unlike modularity, which compares the current clusterings to a random clusterings, modified performance doesn’t have an external benchmark with which to measure self-loops. The self-loop information is there, but it’s not useful.

In fact, because self-loops are always counted in the intra-community score, the contribution of self-loops toward the performance score is unaffected by moving nodes into different communities. Performance with self-loops wouldn’t influence Louvain’s clustering decisions. Modifying performance to account for self-loops wouldn’t actually improve Louvain-performance.

Finally, because performance is a measure of correctly-interpreted edges in a clustering, and inherently does not analyze self-loops (is a loop to oneself ever correctly interpreted, if it can’t be influenced by any clustering?), modifying it to include self-loops would corrupt it to as be unrecognizable as performance.

So, the big conclusion is: I decided not to modify performance for self-loops. The good news is that none of the input graphs have self-loops. Only the intermediate graphs, created by Louvain, have self-loops.

In my next post, I try to answer the question: how can I utilize the Louvain method’s intermediate clusterings to glean some useful information from Louvain-performance, even while using a somewhat handicapped metric?

Tags: Louvain, performance, self-loops