My talk this year will focus on collecting and analyzing connections between digital objects (like users) and the insights these tools make possible.
Abstract: While digital content is archived in various ways, the “arcs” or links among people and their digital objects are not systematically saved. Efforts to store social media often overlooks including data about collections of connections. The Social Media Research Foundation is dedicated to open tools, open data, and open scholarship related to social media. It is producing tools that can collect, analyze and upload social media data, including the arcs that link people and objects. Using the free and open NodeXL application, users can collect, analyze and visualize complex networks and then upload the data to a growing archive on the web at NodeXLGraphGallery.org. As the group of researchers grows, an archive is being assembled to provide researchers around the world with the data about social media needed to understand the ways computer mediated communication tools shape society.
From family photographs and personal papers to health and financial information, vital personal records are becoming digital. At the same time, creation and capture of new digital information has become a part of the daily routine for hundreds of millions of people. But what are the long term prospects for this data?
The combination of new capture devices (more than 1 billion camera phones will be sold in 2010) with the move from older forms of media is reshaping both our personal and collective memories. The size and complexity of personal collections growing, these collections are spread across different media (including film and paper!), and the lines between personal and professional, published and unpublished are being redrawn.
Whether these issues are described as personal archiving, lifestreams, personal digital heritage, preserving digital lives, scrapbooking, or managing intellectual estates, they present major challenges for both individuals and institutions: data loss is a nearly universal experience, whether it is due to hardware failure, obsolescence, user error, lack of institutional support, or any one of many other reasons. Some of these losses may not matter; but the early work of the Nobel prize winners of the 2030s is likely to be digital today, and therefore at risk in ways that previous scientific and literary creations were not. And it isn’t just Nobel winners that matter: the lives of all of us will be preserved in ways not previously possible.
On Tuesday, February 16, the Internet Archive will host a small conference for practitioners in personal digital archiving.
A new book E-Research: Transformation in Scholarly Practice edited by Nicholas W. Jankowski on the ways social science research is being changed by the rise of social media has just been released by Routledge. My colleagues and I contributed a chapter on the ways that information visualization of social media is a useful technique to identify research questions and discover answers about the nature of human association when mediated by computation. The volume contains work from an all-star line-up of researchers who address the opportunities and challenges of performing research with computer-mediated data about social life.
The blurb about the book describes it as:
“No less than a revolutionary transformation of the research enterprise is underway. This transformation extends beyond the natural sciences, where ‘e-research’ has become the modus operandi, and is penetrating the social sciences and humanities, sometimes with differences in accent and label. Many suggest that the very essence of scholarship in these areas is changing. The everyday procedures and practices of traditional forms of scholarship are affected by these and other features of e-research. This volume, which features renowned scholars from across the globe who are active in the social sciences and humanities, provides critical reflection on the overall emergence of e-research, particularly on its adoption and adaptation by the social sciences and humanities.”
Our chapter is “A Picture is Worth a Thousand Questions: Visualization Techniques for Social Science Discovery in Computational Spaces”, co-authored by Howard T. Welser, Thomas Lento, Marc Smith, Eric Gleave and Itai Himelboim. In it, we describe the ways that using information visualizations of social media data sets is a useful way of discovering insights, patterns, and clusters. We illustrate the paper with several examples of social media information visualizations that display the range of behavior among contributors to social media spaces.
Talks from SenseNetworks and MIT made the vision of a continuous “trail” document assembled by location and biological sensors from every human on earth seem not so outlandish. Samuel Madden from MIT spoke about opportunistic mobile wifi connectivity in moving vehicles. MIT rebuilt the WiFi stack to enable 13ms associations instead of 13 second associations with an access point. The result is that a car with such a WiFi card can drive along Boston city streets and exchange about 200KB a minute with open unsecured access points along the way. Free bandwidth in the city. What do they do with it? They stream live telemetry of a fleet of cabs. The cabs have accelerometers on them and GPS which is reported in almost real time back to a server. Along with the engine computer’s data, they collect a ton of data about traffic and road surface quality. They can see changing patterns in the activity levels of the cabs and infer changing activity at businesses.
A major theme of several presentations was crowdsourcing for science, with talks about ebird.org and galaxyzoo highlighting a distinction between sites that enable a group to collect data (ebird) – with the associated issues of data validity — and those sites that enable a group to annotate data (galaxyzoo) that has already been expertly collected.
Matthew Salganik: Community-Generated and Community-Sorted Information In his presentation Matt made the remarkable connection between deliberative democracy and the cat comparrison site: Kitten Wars. His talk introduced a model for a kind of Am I Hot or Not for political discussions. His group built a web site that helped the student community at Princeton set its priorities for student government. The work has significant implications fo deliberation tools for organizations and enterprises. Unlike systems that simply encourage users to contribute ideas to a potentially long and never acted upon list, this system forces a comparison task that can be performed in one click but demands implicit contrasts and estimation of value. The use of the almost adictive “hot or not” style interface (or more accurately, kittenwars) allows users to decide between, for example, longer hours for the student cafeteria or expanded video rental services, and get presented with their estimate in the context of other’s choices and the opportunity to choose between two things again. After a population has run through a set of pair-wise contrasts a broader sense of the priorities of the community can be calculated.
In my talk, I focused on the idea that information want not to be free or expensive, rather, information wants to be copied. Like DNA, the goal of any string of bits is to make a duplicate copy of themselves. Several technical realities mean that while information may exist on a spectrum from private to public, it only moves in one direction (public) and almost never back. Once made public on the Internet, even if only for a moment, a photo, document, or other digital object is almost certainly to have been copied, indexed, backed up, or replicated. All efforts to delete a digital object once widely distributed is like trying to take wine out of water. This is because all cryptography become brittle over time, most bits end up exposed after they get distributed, and more events trigger widespread distribution of bits than expected (for example, linking a photo, and a location, to a tweet that gets copied to LinkedIn and Facebook, that then appears in an RSS feed and is copied from there to Friend Feed. As it travels, information looses more of the access controls that initially made it relatively private until it is effectively public.
Sadly, no picture for Luis Van Ahn’s talk: however, this presentation was a fascinating review of the capcha and re-capcha services and the new direction of providing translation services as language learning games. Luis Van Ahn invented capcha, felt bad about the cumulative human time wasted by filling out those squiggle word puzzels to get on a web site, and decided to harness capchas to a useful task: text recognition for books. To translate words from bad scans of books that the OCR software fails to recognize correctly, the garbled data is presented to humans, who, collectively, have translated millions of previously unintelligible words. Now, his new project is to expand the small user population of bi or multilingual speakers who can translate between languages. The approach applies the “Mechanical Turk” “human intelligence task” concept to language translation. His language translation service presents foreign language sentences to users with all dictionary words from a simple translation listed below. Users click on best word selection beneath each foreign word. The surprising results: pretty good translations AND users start learning a foreign language!