In recent years, the mass availability of social network data, most notably through the twitter API, has opened at least two new research paths.
The ‘big data’ approach allows scientists to record huge datasets using APIs such as the twitter streaming API, and then process them using data mining and network analysis techniques requiring accordingly huge computational power (see for example the work of Jure Leskovec and the Stanford Network Analysis Project). To them, the twitter API appears to offer the best of two worlds: data on the scale of web analytics, yet as clean and qualified as the data that traditional techniques, such as interviews, would provide. This approach has already been able to verify sociological principles on a larger scale than ever before.
The ‘just-in-time’ approach, on the other hand, leverages the ubiquity of social network data in a different way: it intends to “analyze social phenomenon as they unfold” using the real-time data provided, again, by APIs such as the twitter streaming API. The just-in-time approach has notably proven its worth during recent social events that relied heavily on social networks such as the Occupy Wall Street movement. Such data would have taken months to collect and process manually.
But what happens when one is too late for real-time? What remains of the real-time data after real-time is over? A recent study suggests that the rate of decay of web resources meant for immediate consumption is extremely quick. And though it remains available, twitter data becomes difficult to search and access after a few days so it cannot be relied on exclusively.
As part of my PhD, I am trying to map the path of an erroneous quote that circulated on social networks following the death of Osama Bin Laden on May 2, 2011. The corpus contains web articles, pages, blog posts and tweets. These research notes will focus on the collection and initial processing of the twitter data.
We chose this example because of several interesting properties:
– The inception point of the cascade was known precisely which is unusual, especially when it is located on Facebook.
– We wanted our example to spread across a wide range of sites and services. The different components of the cascade allow us to study various phenomenons while still discussing the same example: tracking a quote, a link, a tag, a conversation.
– There were investigative accounts of the incident written by professional journalists. These articles can be used as secondary sources and reference points.
On may 2, 2011, US President Barak Obama appeared on television to announce the death of Osama bin Laden. The news had already leaked before the official announcement and crowds had began to gather in symbolic places such as outside the White House or New York’s Times Square. After the announcement, many Americans simply got outside their homes to rejoice, often waving flags or chanting “U-S-A!”. Meanwhile, a sense of unease began to grow among other Americans. Jessica Dovey, an American working as a school teacher in Japan, posted the following message on her Facebook wall:
At some undetermined point in time, an altered and misattributed version of the quote began to circulate twitter, resulting in a large and multi-layered information cascade:
“I mourn the loss of thousands of precious lives, but I will not rejoice in the death of one, not even an enemy.”
Martin Luther King, Jr
In a matter of hours, reports that the quote had been wrongly attributed to Martin Luther King began to surface, which gave rise to a sub-cascade that ran parallel to the previous one.
This incident attracted a fair amount of media coverage, so it is already well documented. The purpose of our work is to determine what remains of it a year later, using only resources freely available on the web. As noted in introduction, these research notes will focus on the twitter portion of the cascade.
In all experiments, we focused on the time period ranging from midnight on the May 2, 2011 to midnight on the May 8, 2011. All Python scripts used are available upon request.
twitter was designed first and foremost to be used in real-time. The limitations of the twitter API reflects this design choice. It only allows to search the last 6-9 days of tweets, which makes it optimal to know “what’s happening right now”. Searching for archived tweets is clearly not an intended use.
Several services offer partial twitter archives. Among those, Topsy was the only one to provide free access, to support complex queries (with time boundaries, crucially), and to offer an API. Topsy also has the interesting property of keeping a record of tweets and twitter users even after they were deleted from twitter.
However Topsy has its own limitations. It records only a subset of the complete twitter output, and does not record the friends and followers of twitter users. Topsy also archives content beyond twitter such as links and images.
The ‘Quote’ corpus
Our first task was to collect the various tweets containing the altered Martin Luther King quote. The ‘quote’ corpus was collected using otter, the Topsy API. We wanted to identify as many mentions of the fake Martin Luther King quote as possible, even those containing abbreviations or spelling errors. We queried otter for the most distinctive words, “mourn” and “enemy”, whose combination seems unusual enough within 140 characters to provide reasonably precise results.
A quick survey of the Topsy website shows the heuristic to be effective: the only results in the targeted time frame (the first week of May 2011) show a variant of the fake quote.
The API query brought back 2742 pieces of content archived by Topsy, including 2657 tweets and totaling 2615 authors. The various types of content archived break down as follows:
video : 1
tweet : 2657
link : 80
image : 4
The links contain mostly Facebook posts, blog posts expanding on the quote, and services such as TweetLonger. The video shows the famous “I have a dream” speech. The erroneous quote was used in the video description, presumably to draw more traffic.
We are only interested in the tweets so all other types of content are discarded.
The curve starts relatively flat before showing an inflection: the quote went viral after being posted by several high-profile twitter users, notably Penn Jilette.
This observation is consistent with the findings of Cha et al.: on twitter, celebrities have disproportionately high influence on topics outside of their area of expertise.
The retweet (RT) functionality of twitter allows for the quick dissemination of information, creating so-called information cascades.
The Topsy API indicates that a tweet is a RT, but since the introduction of the official RT feature in 2009, RT do not keep track of the path of a tweet between its author and the user who retweeted it. When user ‘a’ retweets a tweet from user ‘b’, we can infer that a path exists between ‘a’ and ‘b’, but neither its length nor its topology.
To extract cascades from the ‘quotes’ corpus, we grouped the retweets of the same tweets and sorted them by date. We identified 339 micro-cascades ranging from 2 to 97 nodes.
The tweets contain few hashtags, which is not surprising. The quote is already 110 characters long, adding up to 137 with punctuation and attribution.
#mlk : 21
#obl : 10
#martin : 5
#a : 4
#fb : 3
#osama : 2
#things : 2
#quote : 2
#binladen : 2
#sillyusa : 1
#viral : 1
#fail : 1
#powerful : 1
#dangersofsocialmedia : 1
#yokosonews : 1
On a side note, after it was revealed that the Martin Luther King quote was in fact a fake, the hashtag #fakeMLKquotes knew a brief spike of popularity. This kind of meta-commentary by twitter users on a twitter-centric phenomenon concludes the cycle.
The first tweet identifying the Martin Luther King quote as a fake that we found was this one:
This tweet went on essentially unnoticed: Topsy recorded only one RT, while twitter shows two. With 15000 followers, “political math” certainly has a wider than average audience, but he is nowhere near the almost two million followers of Penn Jillette (there are conflicting reports on the average number of followers of a twitter account, but most figures have 3 digits. A recent and well documented blog post gives the figure of an average of 235 followers for active users).
Moreover, the work of Hsu et al. on Occupy Wall Street has shown that dissenting voices have a hard time being heard on twitter. Neither providing a source to substantiate his claim, nor being a celebrity, “political math” had virtually no chance of being heard.
Nevertheless, the fact that the quote had been misattributed spread on twitter not from a celebrity account but from outside. the other most influential force on twitter: the traditional media.
The ‘fake’ corpus was also collected from Topsy but in a less straightforward way. Contrarily to the ‘quote’ corpus, we had no way of knowing in advance the content of the tweets we were looking for.
We used a iterative approach. First, a series of three two-words seed queries were used:
mlk+quote : 2903 results
fake+mlk : 1247 results
fake+quote : 278 results
After removing duplicates, we are left with 2921 tweets, which seems low given the amount of media coverage surrounding the controversy. Another issue is that our queries are rather vague and could potentially bring back noisy results. We need a way to boost both recall and precision.
The backlash phase originated in websites outside of twitter, which resulted on many tweets including an URL pointing to a source. The importance of URL in the information contagion is well studied, for example by Galuba et al. in 2010. Recently, Myers et al. showed that URL mentions on twitter can be the result of both internal and external influences, and found politics and world news to be the most externally-driven topics.
To extract the URLs, we searched the tweets of the ‘fake’ corpus using regular expressions. twitter users make use of various URL-shortening services to accommodate the 140-character limit of tweets. We had to expand the collected URLs to obtain a list of the web resources designated by our short URLs.
With our newly expanded link list, we query Topsy again, this time looking for any tweet that mentions one of the landing pages. This technique expands our results and helps correcting for the vagueness of the initial queries, as a tweet pointing to one of the URLs has a higher chance of being relevant. We are left with 4721 tweets in the ‘links’ corpus.
While this technique enhances precision and recall, it should be noted that the very first tweet identifying the quote as a fake does not contain an URL and is thus absent from the ‘links’ corpus.
Short URLs are an interesting artifact to track information circulation. URL-shortening services create a unique short URL for each user submitting a regular URL. Sometimes the shortened URL can circulate from user to user, indicating an unacknowledged contagion.
Identifying the underlying following / followers network proved to be the most daunting task of all. Topsy does not record this data so we had no choice but to query twitter directly. The limitations of the twitter API are even more stringent as those of otter, with only 350 queries per hour allowed.
Moreover, in about 18 months, this network has probably changed completely. Of the 2615 accounts in the ‘quote’ corpus, more than 15 percent were inaccessible at the time of our work:
Closed : 296
Hidden : 94
Suspended : 4
Even among those whose account remained public and active, the friends and follows have inevitably changed during the months that spanned between the events and the time of this experiment.
In conclusion of a recent ‘big data’ article on information propagation on twitter, Romero et al. call for “more fine-grained analyses as well, understanding how patterns of variation at the level of individuals contribute to the overall effects that we observe.” This is the kind of contribution we hope to offer.
There are still two aspects left unexplored in the twitter data we collected.
The first one is the textual content of the tweets. Even in the ‘quotes’ corpus, where all tweets are supposed to contain the exact same sentence, small variations exist, e.g. in quotation marks used or in the way ‘Martin Luther King’ is written.
The second one we would like to investigate is the @replies, as they could provide valuable data on relationships between users. The issue is that there is no straightforward way to access all the replies to a given tweet using the twitter API, and we could not find another service offering this data. One could probably build a scraping script to access the data displayed by the twitter web client, but that would violate the twitter TOS. Any change to the twitter web client would render such a script useless.
Evaluation of our results
The remanence and decay rate of real-time data is difficult to evaluate. The data we collected is not a proper sample, since our gathering process is obviously influenced by the services we used, mainly Topsy. In other words, an evaluation of the results we presented amounts an evaluation of Topsy’s coverage.
Topsy only records a subset of the 400 millions of tweets published each day: tweets that were retweeted or contain an external URL, but there is no way to know exactly how much is discarded.
It is worth noting that the new terms of service of the twitter API, which have been generally regarded as unfavorable by developers since they were announced last summer, are actually welcomed by Topsy. Topsy’s new product, Pro Analytics, now claims to provide “exact counts for any term, any date range, instantly.” When describing its social analytics technology, Topsy states that it provides exact values : “No estimates, no sub-samples. Comprehensive.” This comes in contrast to Otter, the free Topsy API we used for this paper: Otter’s histogram query, for example is only “accurate up to 3 months for most terms and up to 1 month for very popular terms (like iPhone and Justin Bieber)”
In a broader perspective, empirical observations suggest that a large part of the cascade we are studying happened outside of the reach of twitter. As Myers et al. recently noted, cascades on twitter appear to jump from place to place, suggesting a larger interaction network of which twitter is only a part. The Martin Luther King quote itself was born on Facebook to which we only have a very partial and indirect access. The second cascade debunking the fake quote is fueled by articles published by several blogs and traditional news organizations. Other tools and techniques will be needed to analyze these other components in order to complete the map of the global cascade.