Comps readings: community detection

Last set of comps readings, I talked about sense of community:
belonging, having influence, fulfillment of needs, and emotional
support.  Now, let's talk about the physics version of
"community" - cohesive subgroups.  In a graph, these are
groups of nodes in a graph that are more connected to each other than
to other parts of the graph. Clumpy spots.  If you read old href="http://www.worldcat.org/oclc/30594217">Wasserman and
Faust, you'll probably think of cliques, cores, and lambda
sets... some how these didn't do it for me - literally, when I was
trying to href="http://terpconnect.umd.edu/%7Ecpikas/ScienceBlogging/PikasEScience08.pdf">locate
communities in science blog networks, it didn't work..
 If you have a computer science or maybe even sociology
background you'll probably
just look at some sort of clustering (agglomerative or divisive).
 The hot thing for the
past few years comes from physicists and that's what's covered here.
 I did other posts on SNA
articles, so those are mostly
elsewhere. (BTW - if you ever take stats for the social sciences and
can substitute R for stata, do so and take the time to learn it. The
igraph package for R has all of the coolest community detection
thingies in it) (note, too, that these readings are not necessarily for
the dabbler in bibliometrics or social network analysis!)



Newman, M. E. J., & Girvan, M. (2004). Finding and evaluating
community structure in networks. Physical Review E (Statistical,
Nonlinear, and Soft Matter Physics), 69(2), 26113-21.
(just go here)

This article, like the ones from Barabasi, sort of kicked off this
flurry of research.  They use a divisive clustering technique
- so they start with the whole network, and break the connections with
the highest betweeness.  See figure. i-bdc88813f387e20943253874733099f1-bowtie.png

See how if you remove
that one line, how you completely break up the thing? That line has
high betweenness. So they calculate that for all of the lines using
whatever method, then take the line with the highest out, then
re-calculate and remove, and again. They then go on to talk about the
actual algorithm to use to efficiently do all of this betweenness
calculating and give some examples.  There's a lot in this
article, though, because they next talk about how to figure out when
you're done and if you've got decent communities. This measure is
modularity (see the article for the definition), but basically it's 0
if random and 1 is the maximum. If you calculate Q at each step, then
you can stop when it's highest. Note that any given node can only be in
one community, unfortunately. (in real life, people are nearly always
in multiple communities)



Reichardt, J., & Bornholdt, S. (2006). When are networks truly
modular? Physica D, 224(1-2), 20-26. doi: 10.1016/j.physd.2006.09.009
(or look here)

They review Newman and Girvan and suggest a new way that groups
connected nodes and separates non-connected
nodes.  They go through a process and end up
with an equation that's apparently like a Hamiltonian
for a q-state Potts spin glass (dunno, ask a physicist if you need more
info on that!).  This method allows for overlapping
communities because there could be times when you could move a node
from one community to the next without increasing the energy.
 They compared it for some standard graphs and it did better
than N-G. Instead of just stopping by minimizing modularity, they
compare the modularity to a random graph with the same degree
distribution.



Reichardt, J., & Bornholdt, S. (2007). Clustering of sparse
data via network communities-a prototype study of a large online
market. Journal of Statistical Mechanics: Theory and
Experiment, P06016. doi:10.1088/1742-5468/2007/06/P06016

In this one they test the spin glass community detection method against
the German version of ebay to look for market segmentation. The network
has bidders as nodes, and if they bid on the same item there is an
edge.  The spin glass method was successful at pulling out
clusters and using odds ratios, the authors showed that these clusters
corresponded to groupings of subject categories. The Q was much higher
than it would be for a random graph.

Categories

More like this