[Tool presentations] R, twitteR and igraph

In this contribution, we present the R packages twitteR and igraph, R being a statistical and open-source software. These two benevolently developed and open-source tool boxes permit respectively the extraction and analysis of content from the micro-blogging website Twitter, on which users can exchange messages (“tweets”) that don’t exceed 140 characters. This website (or online social network) offers a subscribing structure from a user to another user, as well as ways to interact with other users via public mentions and “retweets” (the diffusion of someone else’s tweet to one’s own subscribers, also called “followers”). These processes are public in the majority of cases, which produces an important amount of data. To extract some of it – for scientific or private purpose –, Twitter gives access to a performant API, unlike Facebook which is mostly closed. From the data obtained via the Twitter API, it is possible to extract different structures (which we will later consider as social networks) based on interactions (“who mentions/retweets whom ?”) or subscriptions (“who follows whom ?”).

The twitteR package is an interface between Twitter and R, which allows to extract and save data as data frames in R, data frames being tables with observations in rows (tweets) and variables in columns (author, date created, geolocation, text, etc.). The igraph package is a fundamental tool in the analysis and visualization of complex and social networks, containing many functions to manipulate networks, compute structural measures like centrality, providing algorithms for community detection, or many layout algorithms to produce quality visual outputs. R oversees the whole and provides efficient statistical tools, helping analyzing degree distributions (Pareto’s law, scale-free networks) or estimating distances in the network (small-world property), as well as regression methods to test effect of some structural elements in a social process.

After installing R, then the two packages, and finally loading them in R, we can download data from Twitter (tweets and users’ informations) via different methods. The one we present here focuses on keywords, and in this special case on hashtags, a way for Twitter users to organize themselves within the chaos of published tweets (thousands every second). The function we use is called searchTwitter(…) and can take as arguments the string used by the sample under study, the number of tweets to return, the date, a geocode, etc. (Example)

This function has some limitations, like the delay of availability of data (seven to eight days), and the maximum number of results (around 1500). There are ways to avoid them. Firstly, twitteR comes with a package called ROAuth, which lets the researcher gets authentified as a “developer” and have access to a larger number of requests and results. Secondly, it is also possible to call directly the Twitter API via php scripts, which demands better programming skills than R, but also gives access to a much larger amount of data.

At this point of the presentation, we focus on the structure of interactions through public messages (tweets and retweets). For that, we need to circumscribe the population. A search for a hashtag (like “#EnLD” for the Swiss radio show organizing debates every evening) returns messages written about the context under study, and will automatically restrict the sample to the participants of the debate.

Then, some R manipulations via regular expressions permit to extract who mentions whom, and build a list of arcs (directed edges), a format that igraph can translate into a network. It is generally recommended to do that with two data frames, one for the arcs, and one for the vertices, obtaining by this way a network object containing all the available attributes of actors and relations (via the function graph.data.frame(…) ). The possibilities of analysis are then various, from global (diameter, degree distribution, centralization, etc.) and local (centrality, clustering, transitivity, etc.) measures based on the structure of the community under study, to a textual analysis of the tweets (via the tm package for example) in comparison of positions in the network or an exponential random graph modeling of the network, to study influence from micro-processes on the network structure. [Example of dynamic visualization.]

Advertisements

7 thoughts on “[Tool presentations] R, twitteR and igraph

  1. This contribution presents an interesting overview of the statistical open-source software R and two of its packages, twitteR and igraph (the latter also available for C, Python, Ruby), for retreiving data from Twitter and analyzing/visualizing social networks. The draft is clearly written and presents its findings in a comprehensible way. The presented software R is a free, standard statistical tool, and the two packages have been released very recently (twitteR) or are under continuous development (igraph), therefore they represent state-of-the-art tools for social data analysis and are highly relevant for Just-in-time Sociology.

    Besides all the presented features, it would be interesting to see some of the following additions to the draft:

    – Comparison/”Benchmarks”: Are there comparable tools, and how do the presented tools perform in relation to those? What are the distinguishing features of the presented tools, what are their strong and weak points? For example, there is a number of network analysis packages for python, such as NetworkX or zen, but it is hard to see at a first glance what their pros and cons are. Criteria could be anything from speed to “completeness” (i.e. does it have functions for measuring a specific network property or for generating a network from a specific model etc.)

    – Error handling: How does twitteR deal with issues such as downtimes, data inconsistencies, etc., or how should the user expect/deal with those?

    – Store to DB: Using twitteR and R, how easy is it to store tweets in a database? Is it feasible to run twitteR for collecting a stream of tweets or is it better to use alternatives (and which?) for this?

    – Are there any “success stories” apart from the one linked to, i.e. scientific work that has used the presented tools successfully?

    – The dynamic visualization in the end looks very nice. What is it about? Has igraph a built-in function to create animations, or was it done manually from stills?

    • Thanks for your comments.

      – Comparison/”Benchmarks”. To retrieve Twitter data, there exists many tools from different origins (e.g. 3rd party applications via OAuth, scripts for Yahoo pipes/Google spreadsheets, scripts on IFTTT, network analysis software plugins https://gephi.org/plugins/retweet-monitor/), but since they all relate to the Twitter API, this latter is the definite and more complete tool. For network analysis, there is a huge choice of softwares and packages (http://en.wikipedia.org/wiki/Social_network_analysis_software), aiming at visualization (Gephi, Visone), analysis (statnet, NetworkX), large data (NodeXL, Pajek), dynamics (RSiena), and the most often at these aspects altogether.

      It would be difficult to discuss all the weak and strong points of all these tools. The twitteR package has the advantage to deal easily with json format (via the rjson package) and produce data in R format (data frames). The data retrieved contain mostly all attribute Twitter API can furnish, and search deals with timestamps, geocodes and radius, IDs (in the case of searches with potentially too many outputs).

      The igraph package is, as you say, under continuous development. To my opinion, this is one of the very best tool for network analysis, and it deals with concepts from physics to social sciences. The ability to write one’s own scripts (in R or C++) and functions from R and its packages give access to a lot of possibility (and it has many generating models implemented).

      – Error handling. Since twitteR doesn’t include streaming possibilities, the package won’t fail when downtimes happen (there will just be no results, since nothing is posted). It must be noted that the package may stop working (completely or partially) at each update of the Twitter API (http://lists.hexdump.org/pipermail/twitter-users-hexdump.org/2012-October/000121.html).

      – Store to DB. Very easy, with the function twListToDF (= twitter list to data frame). To my knowledge, there is no streaming function in twitteR. A solution could be to run a loop with some functions used for delaying (e.g. sys.sleep), but it looks easier and more reliable to use the Twitter REST API (https://dev.twitter.com/docs/api/1.1).

      – Success stories. Two examples about transportation : http://www.sciencedirect.com/science/article/pii/S1877042812027917# and http://jeffreybreen.wordpress.com/2011/07/04/twitter-text-mining-r-slides/ (As the author of the package – Jeff Gentry – says : “Now I’ve got something cool to point to when people ask for an example of it being useful !”)

      In my opinion, there are many other possible areas, like psychology, social sciences, linguistics, etc. Scientific work in these areas should appear any day now.

      – Dynamic visualization. The nodes are Twitter users, arcs are mentions (with each arc wide proportional to the number of mentions). Arcs are drawn in red when a mention was made in the last minutes. Actors of this network where live-tweeting the French presidential electoral debate in the frame of a swiss radio show. The video was realized by printing the network every minute, then a script used mencoder to compile it all in a video. R packages to create videos or gifs exist (e.g. “animation” http://cran.r-project.org/web/packages/animation/index.html).

  2. The contribution is well written and very interesting. The contribution can certainly have its place in the jitso conference for various reasons:
    – The extraction and analysis toolkit presented are based on Twitter API, which is generally consider as one of the best real-time dataset of online discourse.
    – The final dynamic visualization proves that the techniques could be used for real-time analysis.
    – The fact that R and the packages described are both open-source and largely is certainly a plus.

    In order to ameliorate the presentation, I would recommend:
    – To insist more on the methodology connected to the use of these tools. The current proposal is focused on the technical aspects and gives few suggestions on how the tools can be used in a sociological investigation. Though this is not difficult to imagine, some ideas or examples from the authors would be welcomed.
    – Less important, though the dynamic visualization presented in the end looks certainly interesting, it is a pity that the position of the nodes is not dynamically spatialized. It would be interesting to know if this is due to a technical problem or to a design decision.

  3. The objective of the article is to present a “ready-made” environment, based on free software that are R and igraph, for studying Twitter. The “wrapper” TwitterR is a recent adaptation proposed in the R ecosystem to realise the encapsulation of the Twitter API functionalities in an R formalism. This specific module makes accessible the Twitter stream for the quantitative analysis (R) and also for the analysis of the relationships (Igraph).
    R is a strong tool and well-tried in the domain of data analysis and of statistic analysis, especially among communities of Humanities and Social Sciences. For this kind of users, this solution is inscribed in logic of continuity and of extension of usage of a well known tool to the field of social network analysis. Igraph completes the device in offering both functions of graph analysis and visualization. Thus, the proposed environment articulates levels of functionalities necessary for: the capturing of stream of data, the statistical treatments, the analysing and visualizing of graphs. This set of functionalities only covers a part of the needs. Other modules like tm (NLP) evoked in the text are fundamental to reach the content analysis and to go beyond the strict event.
    The article presents in a clear way the technical proposal that are the couple TwitterR and R extended to the igraph module. The simplification of the technological discourse makes more understandable the interest of such a solution for a not technologist public. The clarification effort is justified and appreciable. The necessity of conciseness due to the format of article makes it harder to express technical details justified by such an experience and that we would wish to be more deepened especially in introducing a comparison with other available solutions (like Gephi, etc.).
    From a methodological point of view, the proposed instrumentation seems – but, is it really the case? – to allow in “real-time” the observation of the interactions and of the messages circulation in the Twittersphere. The proposed example associated to the hashtag #EnLD leads to numerous interrogations, especially about hypotheses that allow transposing relational structures derived from internal mechanisms of publication on Twitter into a system of interpersonal relationship.
    The innovative nature of the article is, beyond the tool, in the evolution (even the renewal) of the investigation methods and of the instrumental practices in the frame of the domains of the empirical research from Social Sciences. It appears more and more important for our communities to report back of this practices and experiences to feed a methodological and epistemological thinking. It would be appreciated, if the author provides some elements of this kind.

    • Thanks for your comment.

      We have identified 3 main remarks :

      – Comparison. Solutions like Gephi or Pajek offer, to my knowledge, great and various ways (maybe top of the genre in the case of Gephi) to visualize networks and obtain new intuitions. Gephi is a complementary solution to R. Since data formats are compatible (e.g. edge list or adjacency matrix), when exploring the data, Gephi may be used first. But the network analysis still must be made by igraph or a similar tool.

      – Real-time. Sadly, this not a twitteR feature. This could be done via a loop, but R is not optimized for such a method. It is best to use the Twitter REST API and then import the data in igraph. We can also imagine an R script loading automatically the new elements in the output file.

      – Investigation methods and instrumental practices. Our words were intended to be imprecise on this aspect of our work, since we need more elements in our research to be able to assert that this online relational structure can be understood as a system of interpersonal relationship (hence “social network” in italic). We will be happy to discuss this point further during the workshop.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s