R | Just-In-Time Sociology

In this contribution, we present the R packages twitteR and igraph, R being a statistical and open-source software. These two benevolently developed and open-source tool boxes permit respectively the extraction and analysis of content from the micro-blogging website Twitter, on which users can exchange messages (“tweets”) that don’t exceed 140 characters. This website (or online social network) offers a subscribing structure from a user to another user, as well as ways to interact with other users via public mentions and “retweets” (the diffusion of someone else’s tweet to one’s own subscribers, also called “followers”). These processes are public in the majority of cases, which produces an important amount of data. To extract some of it – for scientific or private purpose –, Twitter gives access to a performant API, unlike Facebook which is mostly closed. From the data obtained via the Twitter API, it is possible to extract different structures (which we will later consider as social networks) based on interactions (“who mentions/retweets whom ?”) or subscriptions (“who follows whom ?”).

The twitteR package is an interface between Twitter and R, which allows to extract and save data as data frames in R, data frames being tables with observations in rows (tweets) and variables in columns (author, date created, geolocation, text, etc.). The igraph package is a fundamental tool in the analysis and visualization of complex and social networks, containing many functions to manipulate networks, compute structural measures like centrality, providing algorithms for community detection, or many layout algorithms to produce quality visual outputs. R oversees the whole and provides efficient statistical tools, helping analyzing degree distributions (Pareto’s law, scale-free networks) or estimating distances in the network (small-world property), as well as regression methods to test effect of some structural elements in a social process.

After installing R, then the two packages, and finally loading them in R, we can download data from Twitter (tweets and users’ informations) via different methods. The one we present here focuses on keywords, and in this special case on hashtags, a way for Twitter users to organize themselves within the chaos of published tweets (thousands every second). The function we use is called searchTwitter(…) and can take as arguments the string used by the sample under study, the number of tweets to return, the date, a geocode, etc. (Example)

This function has some limitations, like the delay of availability of data (seven to eight days), and the maximum number of results (around 1500). There are ways to avoid them. Firstly, twitteR comes with a package called ROAuth, which lets the researcher gets authentified as a “developer” and have access to a larger number of requests and results. Secondly, it is also possible to call directly the Twitter API via php scripts, which demands better programming skills than R, but also gives access to a much larger amount of data.

At this point of the presentation, we focus on the structure of interactions through public messages (tweets and retweets). For that, we need to circumscribe the population. A search for a hashtag (like “#EnLD” for the Swiss radio show organizing debates every evening) returns messages written about the context under study, and will automatically restrict the sample to the participants of the debate.

Then, some R manipulations via regular expressions permit to extract who mentions whom, and build a list of arcs (directed edges), a format that igraph can translate into a network. It is generally recommended to do that with two data frames, one for the arcs, and one for the vertices, obtaining by this way a network object containing all the available attributes of actors and relations (via the function graph.data.frame(…) ). The possibilities of analysis are then various, from global (diameter, degree distribution, centralization, etc.) and local (centrality, clustering, transitivity, etc.) measures based on the structure of the community under study, to a textual analysis of the tweets (via the tm package for example) in comparison of positions in the network or an exponential random graph modeling of the network, to study influence from micro-processes on the network structure. [Example of dynamic visualization.]