We will be liveblogging (when possible) from ICWSM 2010, going on now!
Keynote: Bob Kraut, CMU
implications for community design-offline theories of socialization helpful, not definitive
-online communities can build in good socialization practice-e.g. WP welcoming committee
Two Types of Commitments to Groups-identity based groups-bond based groups
Added Identity & Bond Features to MovieLens
Introduced Subgroups into MovieLens
Identity features that focus on subgroups
bond-based design:+11% loginsidentity-based design:+44% logins
interventions based on theory increased commitmen
why stronger effect of identity?-time course: social identity can form instantaneously, bonds take time
new approaches to translate theory to design-ABMs
test identity vs. bond design via abm
incorporate design into abm
what are the consequences of discussion moderation-what type of moderation should be imposed? when?
results: indiv moderation helps logins
-personalized mod improves info and social benefit
-comm level mod improves info benefit only in homogeneous communities
***Influence and Composition in Social Networks***
Measuring Influence of Neighbors in Social Networks (Cosley Huttenlocher Kleinberg, Lan Suri)
Measuring Influence in 2 types of socnet data
–provides state of network at specific time
–often have more than one
-Complete time seties
–often from database dump
–provides timestamped history of all events
-This talk: How do measurements of social influence in these two settings compare?
p(k) = Pr(individual adopts new behavior | k neighbors have)
Analyzing Influence in Wikipedia
-Wikipedia as a social network
–registered users (500K) represented by odes
–communicate through user talk pages
—Definition: Link between u and v forms at time t if one edited the other’s talk page at time t
-Definition: A community is a set of users who edited an article
-Analyze: Pr(user joins a community | k of his friends have)
Ordinal Time Method for Measuring Influence:
-p_o(k) = # instances (over all C) where u had k neighbors in C and joined/ # instances (over all C) where u had k neighbors in C
-advantage: very easy to interpret
-disadvantage: requires complete time series
Snapshot Time Method for Measuring Influence:
-2 snapshots of the network taken at times t1 and t2
p_s(k) = # instances (over all C) where u had k neighbors in C at t_1 and joined before t2 / #instances (over all C) where u had k neighbors in C at t1 and did not join before t1
-advantages: requires only a snapshot
-disadvantage: coarse-grained, don’t know changes in friends between t1 and t2
-results: p_o(k) behaves very differently from p_s(k) on Wikipedia
How are p_o(k) and p_s(k) different?
-accumulation effect in p_o(k) where community joining events can contribute to p_o(1)…p_o(k)
Approximating Ordinal Time from Snapshots
-At each snapshot know:
–which users joined which communities
–how many friends each user had in each community
-Don’t know exact number of friends user had when joined
-Assume a constant rate of getting friends
Ethnicity in Social Networks (Facebook Data Team)
-Questions about how ethnicities engage online are prevalent
-Our goal is to better understand ethnicity on social networks
–How do we estimate ethnic distributions on social networks?
–Many services (e.g. Facebook) do not ask their users about their ethnicity
–Our approach infers ethnicities from surnames (based on previous approaches)
Topic modeling approach
-words are names, topics are ethnicities
-Generative process: Draw ethnic bareakdown of aggregate population w/Dirichlet distribution
–For each person draw ethnicity of individiual z_n ~ Multinomial
–Draw surname of individual based on ethnicity ~ Multinomial conditioned on ethnicity
Because there is no FB ground truth, we test our method by scraping 10k users from Myspace
Results: Model does a lot better than naive guessing based on census, internet usage statistics
Also tested on FB data (no ground truth but can have qual results):
1. In 2006, AsAm and White overrepresented on FB, by 2008 ethnicities much closer to mark
2. Drew heatmaps of US, showed that Whites concentrated in the northeast, Asians on the coasts, etc.
3. Correlated with politics: AsAm more likely to be liberal, Whites more likely to be conservative/libertarian
-Strong racial homophily (strongest in Hispanics and Blacks, weakest in Whites)
Homophily by ordinal friendships: racial homophily is strongest for the first few friendships a person makes
Friend you communicate with the most is most likely to be homophilous friend
Who makes friends in social media and why? Rich get Richer vs Seek and Ye Shall Find (Zeynep)
Importance of friendship and social networks vs. debate about social isolation
Architecture of our Commons
-Sociality does not happen in a vacuum
-Mediated Relationships (through digital means)
–Are these relationships: Superficial, weak, fake? Supplementary to offline sociality? As real or more?
–Why does this debate refuse to die?
–Internet communication: Anonymous, text-based, fleeting
–Lack of visual cues
but today’s Internet:
–social media: non-transitory interactions, lots of visual cues
Two dominant theories: Rich get Richer vs. Seek and Ye Shall Find
-very diverse school
Results: feeling about whether online friendships are possible vs. not possible – split down the middle
Qualitative Component: why do respondents feel online friendships possible / not possible
*Not Possible reasons:
-body language and mannerisms
-shared emotions and experiences
-deeper connections – easier (“it may even be easier online as it is all dialogue and no physical characteristics involved”)
-deeper connections – judgment
-bonding is possible
-conversation as key
***Diffusion and Dynamics in Networks***
Social Dynamics of Activity in a Virtual World (Bakshy et al.)
Second Life paper
Second Life: persistent, immersive virtual world
-User driven objects, economy, society
RQs: 1. How much economic activity occurs in virtual world?
2. Role of groups?
3. How much virtual interaction?
a) 65 mln user-user transactions (virtual goods)
b) Buddy graph – 4.2 million users, 43 million relationships, focus on strong edges: reciprocal and permissive (can see each other’s online status)
c) chat between 14 million pairs of users
d) 520k groups, 23 million user/group memberships
Economic Activity in Second Life
-Same “sectors” as realworld: retail, real estate, entertainment
-29 million free transactions, 36 million paid, power-law distribution of exchange amounts
Seller in detail: shows that interaction does not necessarily correspond to exchange
Analysis hints at Seller roles? Profiles?
-show how friends of friends of initial buyers go on to buy as well. Is it influence or homophily? Not sure.
Role of Social ties
-39% of free transactions (but only 7% of paid) were between friends
-40% of users that chat exchange free items, 12% of users that chat engage in paid transactions
Free transactions may be more representative of social activity than paid transasctions
Role of Groups
-Groups indicate aspect of user interest
-Sellers more likely to be connected to buyers through co-group than through friendship
Long discussion of why groups are proxy for connections
Explaining Seller Success
Traditional success measures: revenue, repeat business. Regression analysis
-Amount made by friends (homophily? business partnerships?)
-Less chatting with customers
-The younger crowd (newer to 2nd Life)
For Repeat Business:
–Sharing a group, chatting
-More established crowd
-More diverse buyer base
Your Brain on Facebook (Fisher and Counts)
EEG: social vs. traditional media
Why do this? Possible input/feedback to social interaction systems. Short term: Inform design through better understanding of automatic info processing
EEG very good at detecting semantic mismatch
RQs: is Myspace associated with frivolity? Is connecting with friends on Facebook connected with feelings of intimacy?
Media and concepts:
-Media: TV, Books, Social, News
-Concepts: addictive, story, interesting, frivolous, personal, useful
-Showed both front page of FB and personal FB page (asked study participants to briefly friend study on Facebook)
Measurement and Method:
1. Timed, binary decision (low conscious reasoning)
2. Lijkert scale untimed survey (high conscious reasoning)
-Computer-based task: 24 combinations x 22 trials = 528 trials
Results (survey): FB more addictive, but less useful than news. Tells less of a story than books or TV. More frivolous than books, news. More personal than all other forms of media.
Results (decision taks): FB very addictive, tells less of a story than books or TV, a lot more personal than other forms of media. Overall, quite similar questionnaire responses.
Viz: heatmap of head. Time chart. In time chart, look for potential drop ~400 ms after stimulus. The more similar the media to the concept, the bigger the drop.
Results (eeg): FB equal to other media in terms of addictive, interesting, useful, frivolous. Tells less of a story than other media. Is less personal than other media.
Implications, suggestions, limitations
Media parity: FB as interesting, useful, addictive and not frivolous as other media
Personalization and self-identification: Purposeful connection building; easy switching to close friends, family; identity vs. bond attachment
Form of media found most personal by eeg? Books!
Telling stories – status updates on storylines?
Making it tangible: What’s the ratty, marked-up favorite book equivalent in social media? Photo album equivalent?
Limitations: only one form of social networking / media, limited subject population, can we believe EEG results?
Opportunities: compare online media, other subject populations, corroborate with other physiology, expand into real-time capture of physiology
Social Causality and Analysis of Interpersonal Relationships in Online Blogs and Forums (Roxana Girju)
Social causality: causal reasoning used by intelligent agents in a social environment
Modeling social causality – can guide conversation strategies, facilitate modeling and understanding of social emotions, bring new insights
Our focus: social causality as capture through analysis of interpersonal relations in social media
-Pervasive set of english reciprocal textual contexts encoding interpersonal relationships
-data: 11K reciprocal relationship contexts coded
Reciprocity in Language: “The Golden Rule”
Properties of interpersonal verbs and reciprocal instances: Symmetry, Affective Value (4-state HMM), Intentionality of Actions
-Intentionality and affective values of interpersonal verbs highly correlated with blame and responsibility
Analyzing Social Interactions:
-Overall 54.34% of dataset was encoded by ambiguous symmetric patterns
-top frequent verb pairs: need-need, love-love
-followed by: hate-hate, miss-miss…
-In general we love people who love/understand/care/need us
-Gender analysis: men initiate more often than females, retaliate more often than women, are more violent and aggressive (whereas women are more forgiving), but this depends on class of verb
-Men and women generally mutually respectful, it is only when respect is broken that responses may differ (e.g. women: cheat -> hate, despise, sue. men: cheat -> dump, divorce)
-Intentionality of actions: intentionality much more often perceived as intentional in bad-bad exchanges than in good-good exchanges
-Reciprocity Chains: Dyadic (formed between 2 people of the form A v B -> B v A -> A v B). Very useful in micro-levle social interaction analysis. General (between multiple people)
In generic chains: retaliation with increased magnitude chains, good for good chains (short), good for bad chains (turn the other cheek)
How does data sampling strategy impact discovery of information diffusion in social media? (Munmun)
How can we sample social web?
Many different modes of social interaction
Scale of interest – viral marketing, ad campaigns
Is there more in social media than just scale?
Social media can have enormous power, e.g. for diffusion
However, inference of such processes is based on the quality of data
Current methods: random walk, snowball – captures structure but not content or context
RQ1: what is role of context in sampling social phenomena, RQ2: how much should we sample to capture the process
Data: Twitter. Look at diffusion via RT feature, shared URl, same hashtag
Model: Diffusion series. Has slots of individuals involved in diffusion process, links between individuals based on relationships
Sampling strategies: given N, the number of nodes to pick, topic T, social graph G
Ignoring social graph: 1. Random sampling from seeds, 2. Attribute / context sampling.
Using social graph: Forest fire (again, random or attribute)
Diffusion Saturation Metrics: user-based (volume, participation, dissemination). Topology-based: (Reach, spread, cascade instances, collection size), Time-based: rate
Our sample S distorts some metric M
Diffusion response metrics: correlate diffusion on twitter to external behavior (search and news trends)
Reference Set: ~465K users, 836K edges, 30M tweets. 125 randomly chosen “trending topics” from Twitter between Oct and Nov 2009
Trending topic – theme association
Results: bias due to sampling consistent, best results come out of forest fire. For search trends, forest fire + location performs best, for news trends, forest fire + activity performs best
Larger-scale analysis: look at topic distribution by sampling strategy
Inferences about social data affected by sampling strategy. Topic + topology + seed attribute makes a difference to sampling.
Photo Tagging over Time: A Longtitudinal Study of the Role of Attention, Network Density and Motivations (Paul Russo)
Tagging over time
Overview: what factors influence users to tag?
1. How do individual motivations affect tagging,
2. what effect does receiving attention from users affect tagging tenure
Data: Flickr – focused on established users
-People tag for themselves (archiving, retreival) and for others (describing)
-Many ways to receive attention from others
-Huberman, Romeru, Wu demonstrated that on YuTube people whose work received more views tended to post more videos, leading to a submission cycle
-Lento et al. photo tagging behavior
New: look at network strucutre (clustering coefficient)
H1: attention, enjoyment, commitment should increase tagging. Density should decrease tagging. ((blogger comment: what about social grooming?))
Method: 90 days of tagging on Flickr, used only “pro” users w/3 months tenure
-Combines user-reported (survey) data and system data: what people say and what they do
-Attention: comments, Density = network, motivation = survey, measured on Likert scale.
-DV = #tags/photo over 90 days
-Results: generally bear out hypotheses.
-Interesting: low density net individuals tag more than high density for same lev of attention, but higher levels of attention lead to more tagging regardless of density. On the other hand, density has positive slope w.r.t. to commitment and tagging only for low density networks, for high density networks higher levels of attention lead to less tagging for high density.
Microblogging Inside and Outside the Workplace (Ehrlich and Shami)
Method: Twitter vs. Bluetwit (internal tool)
Data: 34 users from 15 countries and 8 business units that used both Twitter and Bluetwit and 20 posts in each over 4 month period
-Twitter much more active than Bluetwit
Dataset: data collected for BlueTwit and Twitter over 4 month period. 19K posts in two tools. Extracted 4 weeks of tweets.
Manually coded 5k microblogging posts
codes: status, providing information, retweet, ask question, directed, directed q
Categories of Posts: Most frequent use – provide information, directed posts
Internal use: ask questions, directed posts, style is work-oriented
External use: provide information, style is more “social”
Public vs. Private: clear sense of what is appropriate for an internal-only audience
Reputation Management: Internal: importance of giving back
External: Publicity and Promotion
Fostering Connections: developing better awareness of professional connections in advance of a future planned or unplanned meeting
Consumption of microblogs:
-Microblogs provide early access to human selected information
-How is microblogging useful in a work context?
–Form of crowd-sourcing – asking questions and providing answers
-What differs between public and private?
–Confidentiality (very clear)
–Style of writing
–Awareness of audience knowledge and interests
–Some erosion of boundaries – despite difference lots overlap
-Why are people consuming microblogs?
–Motivation for posting – Building reputation, awareness
–Motivation for reading – early, quality news
Measuring Influence in Twitter: the Million Follower Fallacy (Cha et al.)
Goal: characterize influence in social media and study its dynamics
1. How can we measure influence of a single user?
2. Does influence of user hold across topics?
3. What behaviors make ordinary users influential?
Why Twitter? One of most popular social media, social links are primary way how information flows, traditional media soruces and word-of-mouth coexist in this environment
-54m users, 2B follow links, 1.7B links
–8.5% of profiles private
–95% users belong to the giant component
–low reciprocity (10%)
–Power law degree distribution w/extremely large hubs (500 users have more than 100K followers)
–Low tweeting activity in general (only 11% of all users posted at least 10 tweets)
Three measures of influence:
Are these three measures related? Compared relative ranks of user across 3 measures using spearman’s rank corellation
Correlated for full population, but not for top 10% or top 1% (but retweets corr mention remains high)
top list of indegree = mix of news outlets and public figures, top retweets = celebrities, etc.
Million follower fallacy – Britney Spears has millions of followers but she doesn’t show up on top retweeted list
–Find users engage in multiple topics. Picked 3 popular topics in 2009 over 2 month period, iran election, death of MJ, swine flu
–Focused on 13k people who talked about all 3 topics
Conclusion: just because you have a lot of followers doesn’t mean they retweet you
Along a similar idea, Tweets vs. followers in NodeXL:
Directed Closure Process in Hybrid Social-Information Networks w/Focus on Link Formation on Twitter (Romero and Kleinberg)
How do directed closed triads form?
Triadic closure vs. directed closure:
-Triadic: an edges connects 2 nodes who already have a common neighbor
-Directed: a node A links to node C to which it already has a 2-step path (through node B)
-An edge in directed graph exhibits closure if it completes a 2-step path
-Closure ratio of node C is the fraction of C’s incoming edges that exhibit closure
-could indicates how many nodes discovered C by following nodes that follow C
Data: random sample of 18 twiter micro-celebrities: Users with between 10K and 50K followers
Notation: user A is k-linked to C if A follows C and also follows k followers of C. Let s_k(C) denote set of followers k-linked to C. f(s_k(C)) = fraction of set whose edge to C exhibits closure.
Q1: Is directed closure a significant process? Randomization test
A: up to large k, f(s_k(C)) is significantly bigger than one would expect under random ordering
Observation: f(s_K(C)) increases with k, but flattens out. Why??
Properties observed: closure ratio saturates to a positive constant f, constant f is different for different micro-celebrities, constant f not closely related to total in-degree of micro-celebrity
Heuristic calculation suggests that the sum of in-degrees of incoming nodes closely predicts closure ratio
Improved Model: predict closure ratio not just by in-degree but by sum of in-degree of incoming nodes. Also sum of in-degrees of incoming nodes from same community predicts closure ratio even better!
Conclusion: definition and methodology for directed closure, evidence for directed closure on Twitter, evidence that sum of in-degrees of incoming nodes & nodes from same community predicts closure ratio on Twitter
Characterizing Microblogs with Topic Models (Ramage, Dumais and Liebling)
Do people like the posts they see?
43 users at MSR looked at 60 posts, judged as: “not really worth reading <-> maybe worth time spent reading <-> worth the time spent reading.”
the average judgment was maybe or worse, nobody on average judged posts as worth time
Fundamental problem: people followed <> Tweets worth reading
What factors go into deciding if a user is worth following?
Method: Structured interviews with heavy users, Broader survey of 56 Twitter users
Kinds of Topics for Tweets. Hobbies, professions, news, products, events = Substance. Updates about meals, travel, hygiene = status. Making plans, networking, staying in touch = Social. Humor, wit, whininess, diction, worldview = Style. Different topics liked by different users.
Shows that content is important.
Content modeling: 8.2m “Spritzer” Tweets from 2008
-Surface word features = tf idf cosine similarity, etc.
-won’t look at deeper features (e.g. parsing) – not very appropriate for Tweets
-Desparsify: Topic Models, LDA.
-Want some labels (hashtags, emoticons, questions, etc.) = Naive Bayes, SVM, etc.
-Combined surface word features, LDA, labels = labeled LDA
Content modeling with labeled LDA:
1. Discover unlabeled topics w/ k=200 latent topic dimensions (e.g. politics, sleep)
2. model common labels = 500-1000 dimensions for hashtags, emoticons, etc.
Twitter content by category: manually aggregate topics into one or more of 45 categories.
Results: 38% style, 23% social, 27% substance, 12% status
Filtering: Tweet stream re-ranking
Split rater’s post into train 70% and test 30%
re-rank test set by distance to positive examples
consider judgment of maybe or worth time as “positive”
Mean reciprocal Rank @ 1 Relevant: best performance = Labeled LDA + tf-idf, .75
Finding: User recommendation task
Rater’s followed user: train 6/7 followers and test 1/7. Find the test user among 8 other non-followed users. Ranking task: score by reciprocal rank of test user. Performance > .9
Next steps: better interfaces for finding and filtering and models that account for temporal dynamics