Connected Action

Sociology and the Internet, Social Media, Networks and Mobile Social Software

Aggregate Overall Metrics Feature: Finding patterns in collections of many networks using NodeXL

December 28th, 2010 by Marc Smith · No Comments

Once you start creating and collecting network graphs you may find you can build a significant collection: hundreds, thousands or tens of thousands of graphs may result from a study or on-going monitoring project. In a series of features in the NodeXL project we have enabled a workflow for constructing many social media  network graphs using the Network Server component (see: How to schedule the creation of a network with NodeXL and Windows Task Scheduler and: New NodeXL Network Server (v1.0.1.126) – Frequently Asked Questions).  This can result in a collection of *many* NodeXL (and GraphML) network files.  Then we implemented features that enabled “Automation”, the application of many operations in NodeXL (metrics calculation, autofill columns, layout and more) to many files without direct human engagement (see: Automatic for the people (who use the latest NodeXL!). Release v. and: Fully automatic: NodeXL can build your network graphs hands free).

A single workbook may contain data from a single NodeXL data collection, run on a particular day and collecting data from a few hours or days back from that moment  (depending on factors like the volume of activity around the selected keyword and the depth of the twitter search catalog, which is often not more than a week or two long and much shorter for active topics).  An example of a single network slice is this recent map of the connections among people who mentioned “microsoft research” in Twitter on a single day (December 18th, 2010):


This is a single slice of the network, a day out of months of activity.  A still frame can tell a rich story: this is a picture of a crowd that has gathered to discuss a topic of common interest: “microsoft research“.  It illustrates a structure common to many large discussions of popular topics — a large set of isolates (the rows at the bottom) who were not observed to have a followed, mentions, or replies relationship to anyone else who tweeted the same term.  These are casual mentioners of the topic.  At the end of these rows are a small number of dyads, triads, and small components of a handful of people who link to one another but not to the largest connected component. These are pairs or small groups discussing the topic among themselves, but none are connected to a larger component.  Above these rows is the “giant component” — the blob of people who do have a connection to someone else who also tweeted a message containing the same term who in turn have a connection that leads to a large number of others.  The giant component is itself composed of several sub-components of densely connected groups.  At the center of each component are the core users, the people who often hold their cluster together. Between these clusters are the bridges, the people who link otherwise disconnected sub-groups.  At the edges are the peripheral people who have just taken the first step up from being an isolate and have formed a single reply, mention, or follows relationship to someone else who also tweeted the search keyword and can bridge them back to the core of the giant component.  This is a large and active network with hybrid qualities.  There is a “brand” or broadcast element in it: the yellow cluster is a hub and spoke structure centered on the Microsoft Research Twitter account.  These people re-tweet what this account publishes but do not connect to one another.  Just a few of these people set off second and third waves of retweets.  Elsewhere in the graph there are other network structures present, for example the green and blue clusters feature people are centered around their own discussions of the term “microsoft research“.

If you collect many still frames of slices of network activity there is great value in exploring the way the network graph changes over time.  In the most recent release NodeXL provides the first step in a series of features related to time and graph comparison.  You can now create a workbook that aggregates the overall metrics (edge counts, vertex counts, connected component counts, etc.) for a folder full of NodeXL workbooks. In NodeXL follow the menu path: NodeXL>Analysis>Graph Metrics>Aggregate Overall Metrics to get this:

The result of this feature is a workbook with a row containing the summary data from each of the workbooks in the target folder.  Any arbitrary collection of network workbooks can be aggregated but this is particularly useful when the workbooks are sequential time slices.

An example is the time series plot below tracking the rise and fall of several Twitter volume and network measures for the “microsoft research” search term over several months:

This chart tracks the number of vertices (each vertex in this case is a person  our data collector saw tweet about the search term “microsoft research“) in each (almost) daily network snapshot.  In addition the unique edges or connections between these Twitter users are plotted along with the number of people with no connections (“Single-Vertex Connected Components”).  The size of the largest component in the network (“Maximum vertices in a connected component “) is a measure of the changing size of the core community of discussion participants.  Measures like the maximum and average “geodesic” distance provide a rough measure of how long and thin (high values) or generally spherical (low values) a particular network is shaped. A “geodesic” is the longest path that can be walked through the network.  Long skinny networks may indicate the presence of loosely connected smaller groups that have a few people who act as bridges.  Low geodesic values suggest dense networks with people connected to many others with few isolates and sub-groups.

The peaks are closely associated with major events on the Microsoft Research calendar, like the 2010 Microsoft Research Faculty Summit event I attended in early July.

I find the ratios between measures of the size of the large network component and the population of isolates to be interesting.  As events go on over a period of days more people connect with others who are talking about the same topic, growing the size of the large connected component.  But often the isolate population also grows during this time as people at the periphery of the topic network catch sight of mentions of the event and tweet about it.  I could imagine one goal of social media management to be the conversion of isolates to connected component members.  Those who follow, reply or mention even a single other person also talking about a topic are more likely to return and engage than those who have zero connections.  It is not clear if more connections provide a linear increase in continued engagement, I suspect that the main effect is at the zero/one divide and drops off in effect after the first dozen or so connections.  Encouraging cohesion and network density by replying to isolates and encouraging others to do so may help keep a social media population focused and growing.

This feature follows the work done in the ManyNets project ( at the University of Maryland by Manuel Freire, Catherine Plaisant, Ben Shneiderman, Awalin Sopan, and Miguel Rios.  ManyNets also created a framework for managing the metadata about collections of networks. ManyNets provides for  much richer interactions and linkages to the underlying networks than NodeXL can do so far.

Tags: Network metrics and measures · NodeXL · Performance scale parallel and cloud computing · Social Media · Social network · Visualization