ThreadMill 0.1: Social Accounting for Message Thread Collections

The Social Media Research Foundation is pleased to announce the immediate availability of ThreadMill 0.1.  ThreadMill is a free and open application that consumes message thread data and produces reports about each author, thread, forum, and board along with visualizations of the patterns of connection and activity.  ThreadMill is written in Ruby, and depends on MongoDB, SinatraRB, HAML, and Flash to collect, analyze, and report data about collections of conversation threads.

Threaded conversations are a major form of social media.  Message boards, email and email lists, twitter, blog comments, text messages, and discussion forums are all social media systems built around the message thread data structure.  As messages are exchanged through these systems, some messages are sent as a reply to a particular previous message.  As messages are sent in reply to prior messages, chains of messages form.  Message chains come in two major forms: branching and non-branching.  Branching threads are those that allow more than one message to reply to a prior message.  Non-branching threads are single chains, like a string of pearls, that allow only one message to reply to a prior message.  Many web based message boards are non-branching.  Many email systems and discussion forums are branching.

ThreadMill requires a minimal set of data elements to generate its reports.  A data table must minimally have a column of information for each message that includes the name of the message board, the forum, the thread, and the author, along with a unique identifier for each message and the date and time it was posted.  Optional data elements include the unique identifier of the message being replied to, the URL of the message, and the URL for a profile photo.

All forms of threaded message exchange can be measured.  Simple measures like the count of the number of messages or the number of authors are obvious and useful.  Other measures can be created from more sophisticated analysis.  For example, the network of connections that forms as different authors reply to one another can be extracted and analyzed using network analysis methods.  It is possible to calculate metrics from these networks of reply that describe the location of each person in the graph.

ThreadMill generates several data sets that can be used to create visualizations of the activity and structure of a message board collection.

A Treemap data set can illustrate the hierarchy of encapsulated authors within threads, threads within fora, fora within boards, and boards within collections.  Treemap visualizations of collections of threaded conversations can quickly highlight the most active or populous discussions.

An AuthorLine visualization takes the form of a double histogram, with bubbles representing each thread active in each time period sized by the volume of messages the author contributed, sorted by size.  Threads that have been initiated by the author are represented as bubbles above the center line.  Messages that the author contributes to threads started by other authors are represented as bubbles stacked below the center line.  AuthorLines quickly reveal the pattern of activity an author displays and identifies which of several types of contributors the author is.

A scatter plot visualization represents each author as a bubble in an X-Y space defined by the number of different days the author was active against the average number of messages the author contributes to the threads in which they participate.

A time series line chart reveals the days of maximum and minimum activity along with trends.

A network diagram reveals the overall structure of the discussion space and the people who occupy strategic locations within the network graph.

ThreadMill has received generous assistance from Morningside Analytics.  Bruce Woodson implemented ThreadMill.

NodeXL update: v.1.0.110 – New histograms of network metrics on Overall Metrics worksheet

In the most recent prior release of NodeXL we added new metrics that describe networks in terms of their number of components and the length of paths in those networks.  In this release we automate creation of histograms of network metrics.  It is useful to see the distribution of attributes like in-degree or betweenness to get a feel for the nature of a network.  Building a histogram in Excel is easy, but building seven (one for each of the metrics we create: degree, in-degree, out-degree, betweenness, closeness, eigenvector centrality, and clustering coefficient) is a chore.  Doing this repeatedly for several networks is too much work!  Now, when you calculate metrics in NodeXL we will create these charts for you and place them on the Overall metrics worksheet.

We will add axis markings and titles soon, making these charts ready to use in a variety of network reports.  These histograms will also appear in the Dynamic Filters dialog to guide users as they select segments of the distribution to include or filter out of the displayed network graph.

Other updates: (2010-02-03)

  • The Overall Metrics worksheet now includes more information about the degree, in-degree, out-degree, betweenness centrality, closeness centrality, eigenvector centrality, and clustering coefficient metrics when those metrics are computed. The additional information includes the minimum, maximum, average, and median metric values, and a histogram showing the metric value distribution.
  • The “Convert Old Workbook” item on the NodeXL, Data, Import menu in the Ribbon is now called “Import from NodeXL Workbook Created on Another Computer.” This menu item can be used to work around the following problem: NodeXL workbooks created on a 64-bit Windows computer cannot be opened directly in Excel on a 32-bit Windows computer, and vice-versa. (If you attempt to do so, you will get an error message whose details include “could not find a part of the path.”)
  • A Clear All Worksheet Columns Now button has been added to the Autofill Columns dialog box (NodeXL, Visual Properties, Autofill). Also, you can now clear an individual worksheet column by clicking a button in the dialog box’s Options column.
  • Bug fix: On large-font machines, the buttons at the bottom of the Autofill Columns dialog box didn’t fit within the dialog box.
  • Bug fix: In some circumstances, vertices were drawn below the bottom of the graph pane and were impossible to see. One such circumstance was when the selection was exported to a new workbook (NodeXL, Data, Export, Selection to New NodeXL Workbook). The graph pane in the new workbook acted as if it were taller than its real height, leading to vertices dropping off the bottom.

Path and Component Metrics, new in NodeXL v.

NodeXL has updated again (v. with new network metrics.  The application now calculates path length data for your network, reporting the Maximum Geodesic Distance and the Average Geodesic Distance.  The list of overall metrics NodeXL creates includes: Vertices (the number of nodes in the graph), Unique Edges, Edges With Duplicates, Total Edges, Self-Loops (Edges that point back at the node from which they originate), Connected Components (each set of connected nodes that are not connected to another set of nodes), Single-Vertex Connected Components (all the “singletons” of just one node in a component), Maximum Vertices in a Connected Component (the size of the “Giant” component), Maximum Edges in a Connected Component (the density of the “Giant” component), Maximum Geodesic Distance (Diameter) (the longest path that can be uniquely walked through the graph), Average Geodesic Distance (the average distance between two nodes in the graph (compare this to the “six degrees” standard), Graph Density (the density of the complete network).

More metrics and details on existing metrics are on the way!

What metrics do you need?

Summer 2009 – Stanford Media X Workshop: New Metrics for New Media: Analytics for Social Media and Virtual Worlds

Stanford University - Media X Program

I will lead a workshop with Martha Russell on social network analysis of social media as part of the Stanford Media X Summer Institute on New Metrics for New Media: Analytics for Social Media and Virtual Worlds this Summer.  I am looking forward to working with the folks at Media X which hosts a range of cutting edge events devoted to exploring the newest trends in technology and society.


It is also worth noting that the traveling exhibit “Places and Spaces” will be displayed through the MEDIA X program at Stanford until December 18th, 2009.  There is a May 18, 5-6:30pm Reception in Wallenberg Hall on the campus.  The show includes an image I worked on with Danyel Fisher and Tony Capone that represents an overview of Usenet newsgroups.

2005 Usenet Treemap

The show includes a variety of information visualizations and maps that illustrate the utility of graphical representations of complex concepts and terrains.  From the Media X site:

“The Places & Spaces exhibit, at Wallenberg Hall has two components. The physical component is available for display and allows for close visual inspection through high-quality prints. It is meant to inspire cross-disciplinary discussion on how best to track and communicate human activity and scientific progress on a global scale. It includes hand-on science maps for children. The online counterpart provides links to a selected series of maps and their makers along with detailed explanations of why these maps work.” [Link]

There will be a reception following the May 18 Seminar that will include Jeff Heer and students, Katy Borner (virtual presence) and other mapmakers of the Places & Spaces exhibit.