This is the collection of keyword pairs that appeared in two clusters of people who Tweeted about “Paul Ryan”, the Republican Congressman from Wisconsin who delivered the GOP rebuttal to the 2011 United States State of the Union Address. This network illustrates the ways that certain word pairs appears only or predominantly in one cluster (colored here Red and Blue) or the other. Terms that appeared in both clusters appear as purple.
Social networks are built from relationships between people. Keyword networks are built from relationships between words and other text strings. When two words appear in the same message, sentence, or alongside one another ties of different strengths are created. The networks that result can illuminate the relationships among topics of importance in a collection of messages.
Markus Strohmaier from the Technical University Graz (TUG) along with Claudia Wagner gave us inspiration in a paper:
C. Wagner, M. Strohmaier, The Wisdom in Tweetonomies: Acquiring Latent Conceptual Structures from Social Awareness Streams, Semantic Search 2010 Workshop (SemSearch2010), in conjunction with the 19th International World Wide Web Conference (WWW2010), Raleigh, NC, USA, April 26-30, ACM, 2010. (pdf)
in which they defined a range of ways two words (technically these are strings, they may not really be words) can be associated with one another. Words could be linked if they are in the same tweet, next to one another, or sequential among other ways to link terms.
NodeXL has not had any features for exploring the networks in texts. Now with the addition of a new macro from Scott Golder, it is fairly simple to extract pairs of keywords from collection of tweets. NodeXL’s Twitter importer can optionally include the content of the tweet that included the search term and this column of text can now be processed itself into a new network based on the ways words appear together in tweets.
This feature builds on the work of several people. Scott Golder from Cornell started the ball rolling with a simple but effective VBA script that allowed others to build and refine the models of what counts as a tie between two words. Vladimir Barash added several refinements including support for stop word lists to remove common terms. Scott then picked up the code again and added a set of features for selecting the nature of the graph and making it easier to select the options needed.
The code for the Keyword Network macro is below.
The instructions to use it take a few steps to complete:
1. Create a new workbook, eg a list of tweets or an import from a Twitter search, whatever. Save it as .xlsm. The m is important. This can be an existing NodeXL workbook.
2. Go to Developer -> Macros. Make up a name; it doesn’t matter because it’ll get overwritten. Then press Create. the VBA window will open.
3. In the big text are that says “Sub whatever() End Sub”, select all that text and delete it. Paste in the contents of the text file below.
4. Go to Tools->Reference. Check the checkboxes for “Microsoft Scripting Runtime” and “Microsott VBScript Regular Expressions”. Press OK. Save the file (File->Save) then exit (“Close and return to Microsoft Excel”).
5. Now go to Developer -> Macros. Choose CreateWordNet and press the Run button.
6. It’ll ask you for a worksheet name, a column and a start-row. Then it’ll create a new worksheet with the edgelist in it.
The edge list is not directed (there isn’t really a concept of direction in “co-occurs”) but is weighted. Each pair is weighted by the number of times it appears.
This version also includes options for edge creation.
First, it is now possible to suppress edges of weight=1, which is helpful in getting rid of a lot of garbage.
Second, it is now possible to defined edges by adjacency or co-tweeting. Given a tweet of words “w1 w2 w3” adjacency will give edges w1-w2 and w2-w3, while co-tweeting will give edges w1-w2, w1-w3, w2-w3.
For edges defined by adjacency, you may choose directed or undirected edges. So a tweet of “Marc Smith Marc” (for example) would generate the weighted directed edges Marc,Smith,1 and Smith,Marc,1 while the sole undirected edge would be Marc,Smith,2. That is, for undirected edges (where ordering doesn’t matter) the words are alphabetized.
Start with a NodeXL workbook with a column of text for either Vertices or Edges (or any column of text). Here we have the tweet text of a recent Twitter Search Term network query.
Select “Developer” from the Excel menu and create a new Macro. I take the text of Scott’s macro and paste it here, replacing everything else in the code buffer.
Note the selection of Tools>References> needed to run this macro! Select Microsoft Scripting Runtime and Microsoft VBScript Regular Expressions 5.5.
Running the Macro:
Scott’s macro presents a series of dialogs to the user (I believe we could do this in a single dialog when we revise):
First we specify the worksheet in the workbook containing the text column to process:
Next we specify the column containing the text to process:
Next we specify the row in which the text starts in that column:
The macro will copy an edge attribute forward if specified (note, I think the *last* attribute for any AB pair is what is reported).
The user is asked if the results should omit the singleton edges, which can be useful.
Edges can be defined as co-sequential or co-cell: ie. ABCD can generate AB, BC, CD or AB, AC, AD, etc.
Users select if they want the edges to include their reciprocal (i.e. generate a “BA” edge for each “AB” edge).
The result is a worksheet with word pair edges and the weights of their frequency of occurrence.
This worksheet can then be imported into a separate NodeXL template using the Import from Open Workbook feature:
Create Word Network VB Macro Vdb5
7 thoughts on “Keyword Networks: create word association networks from text with NodeXL (with a macro)”
I have been looking for a way to do something like this but with a large number of PDF documents to create a word cloud. Do you know if that is possible?
This extension to NodeXL is not what you want (it does not work on pdfs).
Perhaps Automap is what you need? See: http://www.casos.cs.cmu.edu/projects/automap/downloads.php
I’m trying to run the macro. It tries to read a stopwords file (common-english-words.txt). I’m not sure of the format of the file it’s trying to read. Can you describe it?
Thank you for the interest in the Keyword Network macro.
Sorry to have omitted that important fact!
A sample stop words file is located at:
The format is a list of words, comma delimited, with no spaces.
I am not having much luck with the downloaded VBA code in text file. Wondering if you could send me an excel with the working code included?
The text analysis macro has now been replaced with an integrated set of content analysis features in the main NodeXL installation. Please install a more recent version of NodeXL (v.219) and follow the instructions in this post:
Comments are closed.