This is the collection of keyword pairs that appeared in two clusters of people who Tweeted about “Paul Ryan”, the Republican Congressman from Wisconsin who delivered the GOP rebuttal to the 2011 United States State of the Union Address. This network illustrates the ways that certain word pairs appears only or predominantly in one cluster (colored here Red and Blue) or the other. Terms that appeared in both clusters appear as purple.
Social networks are built from relationships between people. Keyword networks are built from relationships between words and other text strings. When two words appear in the same message, sentence, or alongside one another ties of different strengths are created. The networks that result can illuminate the relationships among topics of importance in a collection of messages.
Markus Strohmaier from the Technical University Graz (TUG) along with Claudia Wagner gave us inspiration in a paper:
C. Wagner, M. Strohmaier, The Wisdom in Tweetonomies: Acquiring Latent Conceptual Structures from Social Awareness Streams, Semantic Search 2010 Workshop (SemSearch2010), in conjunction with the 19th International World Wide Web Conference (WWW2010), Raleigh, NC, USA, April 26-30, ACM, 2010. (pdf)
in which they defined a range of ways two words (technically these are strings, they may not really be words) can be associated with one another. Words could be linked if they are in the same tweet, next to one another, or sequential among other ways to link terms.
NodeXL has not had any features for exploring the networks in texts. Now with the addition of a new macro from Scott Golder, it is fairly simple to extract pairs of keywords from collection of tweets. NodeXL’s Twitter importer can optionally include the content of the tweet that included the search term and this column of text can now be processed itself into a new network based on the ways words appear together in tweets.
This feature builds on the work of several people. Scott Golder from Cornell started the ball rolling with a simple but effective VBA script that allowed others to build and refine the models of what counts as a tie between two words. Vladimir Barash added several refinements including support for stop word lists to remove common terms. Scott then picked up the code again and added a set of features for selecting the nature of the graph and making it easier to select the options needed.
The code for the Keyword Network macro is below.
The instructions to use it take a few steps to complete:
1. Create a new workbook, eg a list of tweets or an import from a Twitter search, whatever. Save it as .xlsm. The m is important. This can be an existing NodeXL workbook.
2. Go to Developer -> Macros. Make up a name; it doesn’t matter because it’ll get overwritten. Then press Create. the VBA window will open.
3. In the big text are that says “Sub whatever() End Sub”, select all that text and delete it. Paste in the contents of the text file below.
4. Go to Tools->Reference. Check the checkboxes for “Microsoft Scripting Runtime” and “Microsott VBScript Regular Expressions”. Press OK. Save the file (File->Save) then exit (“Close and return to Microsoft Excel”).
5. Now go to Developer -> Macros. Choose CreateWordNet and press the Run button.
6. It’ll ask you for a worksheet name, a column and a start-row. Then it’ll create a new worksheet with the edgelist in it.
The edge list is not directed (there isn’t really a concept of direction in “co-occurs”) but is weighted. Each pair is weighted by the number of times it appears.
This version also includes options for edge creation.
First, it is now possible to suppress edges of weight=1, which is helpful in getting rid of a lot of garbage.
Second, it is now possible to defined edges by adjacency or co-tweeting. Given a tweet of words “w1 w2 w3″ adjacency will give edges w1-w2 and w2-w3, while co-tweeting will give edges w1-w2, w1-w3, w2-w3.
For edges defined by adjacency, you may choose directed or undirected edges. So a tweet of “Marc Smith Marc” (for example) would generate the weighted directed edges Marc,Smith,1 and Smith,Marc,1 while the sole undirected edge would be Marc,Smith,2. That is, for undirected edges (where ordering doesn’t matter) the words are alphabetized.
Start with a NodeXL workbook with a column of text for either Vertices or Edges (or any column of text). Here we have the tweet text of a recent Twitter Search Term network query.
Select “Developer” from the Excel menu and create a new Macro. I take the text of Scott’s macro and paste it here, replacing everything else in the code buffer.
Note the selection of Tools>References> needed to run this macro! Select Microsoft Scripting Runtime and Microsoft VBScript Regular Expressions 5.5.
Running the Macro:
Scott’s macro presents a series of dialogs to the user (I believe we could do this in a single dialog when we revise):
First we specify the worksheet in the workbook containing the text column to process:
Next we specify the column containing the text to process:
Next we specify the row in which the text starts in that column:
The macro will copy an edge attribute forward if specified (note, I think the *last* attribute for any AB pair is what is reported).
The user is asked if the results should omit the singleton edges, which can be useful.
Edges can be defined as co-sequential or co-cell: ie. ABCD can generate AB, BC, CD or AB, AC, AD, etc.
Users select if they want the edges to include their reciprocal (i.e. generate a “BA” edge for each “AB” edge).
The result is a worksheet with word pair edges and the weights of their frequency of occurrence.
This worksheet can then be imported into a separate NodeXL template using the Import from Open Workbook feature: