Visualising PolySocial Reality (revised)

Sally Applin, Michael Fischer(Centre for Social Anthropology and Computing, University of Kent, UK)

Kevin Walker (Information Experience Design Programme, Royal College of Art, London)




Physiologically, all humans sense, move and communicate in similar ways and have the same sensors, but different experiences (Applin and Fischer, 2011). Furthermore, as humans socialise within different cultures, they interpret signals and interact with each other in vastly different ways. This heterogeneity of human culture happens in both analog and digital communications. Turkle (2011) has used this distinction to differentiate the virtual from real-world experience; however there is some debate as to whether or not a clear distinction of only two worlds can be drawn in such a dualist manner (e.g., Jurgenson, 2011).  The Internet as a human construction has enabled new capabilities for humans (Applin and Fischer, 2011), who are now able to spread awareness of individual cultural practices, while simultaneously creating and facilitating new behaviours that cut across cultures. When we remotely access data and interact with others, while simultaneously moving and interacting within our immediate locale, our individual experiences become multi-threaded, or ‘poly-social.’ As a result, our very conceptions of time and space have become personalised and mostly asynchronous, with the mobile phone often acting as a tool for applying a ‘just-in-time’ model for social planning.

PolySocial Reality (PoSR) describes the multiple, sometimes overlapping, network transaction spaces that people traverse synchronously and asynchronously with others to maintain and use social relationships, and has been developed as a theoretical framework for the global interaction context within which people experience the mobile social web and other forms of communication and social interactions, whether co-located or mediated by technology. We expect new analytics and visualisation tools to enable us to depict and extract new meaning from PolySocial Reality, and our findings in turn can contribute to the development of new and existing tools and technologies, specifically through the integration of real-time datastreams and the interoperability between physical and digital contexts. Thus, the tools can inform social theory as an iterative, ongoing, process. Our data collection methods include a novel form of ‘citizen social science.’

PolySocial Reality as a Theoretical Framework

PoSR is a theory proposed by Applin and Fischer (2012; 2011) that defines relations across the aggregate of all the experienced locations and communications of and between all individual people in multiple networks and/or locales at the same or different times. A simple example of a PoSR context occurs when two people are walking together down a street, while simultaneously texting or communicating through both digital and analog channels that are partially or wholly interleaved, or that replace face-to-face interactions with interactions in a single, dual or multiplexed fashion. PoSR is based upon the core concept that dynamic relational structures emerge from the aggregate of multiplexed asynchronous or synchronous interactions of all individuals within the domain of networked, non-networked, and/or local experiences. We intend to use PoSR to describe and analyse principles underlying instantiations of the emergent ‘network’ comprised by the union of all individual networks, such that patterns in an overall graph representing these can be identified, node-centric projections examined, and sub-graphs compared. Simply put, PoSR helps identify the extent and impact of shared and unshared experience when people are interacting in social networks (both analog and digital).

Primarily we are concerned with whether or not these asynchronous and synchronous multiplexed and/or individual messages are received and acted upon in a way that facilitates human cooperation. If too many messages become overwhelming, meaning can be lost, and lost meaning can manifest, among other things, as misunderstandings that have consequences going well beyond whether or not someone got the message to pick up milk at the store. In the United States, the transportation industry has been transformed by new regulations, as automobile drivers, train engineers, barge captains and airline pilots have all missed messages that resulted in complications from the fatal, as in a barge disaster in Washington D.C., to humorous, as in the case of airline pilots missing their airport by 300 miles, while being distracted by an iPad.

The practical applications of understanding the global and individual impact of the complex system of interactions represented by PoSR are potentially great, both with respect to improving users’ quality of experience, and the capacity of people to collaborate and collectively contribute to meeting the challenges and opportunities arising from social media.

To go further in understanding the complex relationships PoSR represents requires data. There are a raft of problems in acquiring, analysing and representing this data. However, by looking at local graphs representing the activity of interacting individuals who are partly connected through single and multiple social networks, we can identify some properties of local projections of PoSR to inform hypotheses about individual impacts and the aggregate of these across PoSR more generally. Empirically populating with data a complete PoSR structure is not possible. However, a sample can be drawn from sufficiently large datasets, and examined with tools for analysis and visualisation. Thus our next step is to use publicly available online datasets and analytics tools to test and further develop the PoSR framework.

Researching PolySocial Reality

We are initially addressing three basic questions:

  • Which data sources and behaviours best inform the theory of PolySocial Reality?
  • How do these data sources reflect PolySocial networks?
  • How could pathways through complex interactions—PolySocial trails—be utilised for data analysis and understanding of formative principles underlying PoSR?

To address our first Question, about the types of data to test our hypotheses about PoSR, we are designing a series of brief case studies to investigate different types of data to inform PoSR. These will be comprised of, first, an analysis of quantitative data including location, search content and trend data, which are secondarily and individually analysed in subsequent self-contained scenarios with a voluntary sample of users, in order to collect qualitative data. Data sources initially include sites with public data APIs such as Twitter, Flickr,, Foursquare and Ning, and later adding a bespoke project social portal that aggregates participant activity on agreed channels for our voluntary participants.

We view the qualitative data collection phase as a form of participatory or ‘citizen science.’ which has gained prominence in recent years as a means to, for example, classify galaxies, fold proteins, find new planets, or identify museum artifacts or historical sources; by using human capabilities for pattern recognition, this complements computational approaches to analysing quantitative data. Previous research undertaken by members of our team has investigated the value and dynamics of participatory e-science (See Smith et al, 2009). Our current research extends this work, and is the first we are aware of in the area of citizen social science: utilising voluntary users to contribute qualitative data to inform social theory. For us, such data is intended to illuminate particular micro-level behaviours; at a certain scale it will enable us to extrapolate macro-level social and cultural practices. Together these are intended to build up a rich picture of PoSR.

Data Representation, Visualisation and Analysis

To empirically investigate PoSR we need a means of sampling user activity and flexibly representing relationships – both co-located and digitally networked. We conceptualise the structure of social networks as dynamic graphs. Simple social graphs have been used to study the structure of online social networks (e.g. Mislove et al, 2007), as depicted in Figure 1.

simple social graph

Figure 1. Simple social graph depicting various unidirectional and bidirectional links between individuals as nodes..

However, this representation says nothing about the form or content of these exchanges. When multiple channels or media through which people communicate (including face-to-face) are considered, this can result in a more complex or ‘multiplexed’ network of exchanges. Additional parameters include the strength and quality of ties, the types of exchanges, frequency and duration of contact, and degree of relatedness (See Haythornthwaite, 2005). Gjoka, et al (2011) address this through multigraph sampling. Multigraphs – graphs whose nodes may be connected by more than one line – support one of the aspects of PoSR: the multiplicity of different relations that may underly multiple intersecting social networks. Because of the multiplicity of connections, each type representing a different context for social relations, a projected multigraph will be more likely to have a higher degree of connectivity. For example, the simple scenario described above, of two co-located individuals, can be represented with multiple connections to each other; they may be connected as friends as well as being co-located in a particular event, as well as through a social networking service, which connects them as well to others (Figure 2).


Figure 2. Multigraph showing multiple connections between two individuals, and to their online social networks.

Figure 2 shows how a simple instance of PoSR can get complex very quickly. However, combining multigraphs with metagraphs (Basu and Blanning, 1992) appears a reasonable initial mathematical representation for an exploration of PoSR. A metagraph is a graph where each node is a set, and edges thus connect sets (see Fig. 3). A multigraph that includes metagraphs permits us, at least, to represent the data in a form that is interoperative and can be converted into different forms such as matrices, XML or relational data suitable for online analytic tools for which a range of algorithms for methods of analysis have been established.


Figure 3. Metagraph with sets of individuals as nodes.

Our proposed meta-multigraph can be easily translated into a wide range of more conventional projections of the data, and a range of existing open source tools for visualisation and analysis can be employed, such as Cytoscape (Smoot et. al 2011), GraphViz (Ellson et. al. 2003), Jung (O’Madadhain et. al. 2003)  and R (R Development Core Team 2008).

One way of exploring this data representation is through trails and aggregations of trails. The concept of trails through informational ecologies was initially proposed by Bush (1945) and developed by Peterson and Levene (2003) as a form of navigational or ‘ampliative’ learning. Originally conceptualised for representing individuals moving through an informational space, trail visualisation was developed by Schoonenboom (2007) as multigraphs as well as activity diagrams and interaction maps. Walker (2012) has investigated trails using mobile technologies, in which individuals are conceptualised as situated in overlapping personal, social and physical contexts (see Fig. 4)


Figure 4. Individual situated in personal, social and physical contexts.

Applying this to our original model, this enables us to regard two individuals as nodes in a multigraph, situated in particular contexts, with their communications mediated by available tools and resources. This approach thus opens the possibility to include contextual data such as location in our analysis (Fig. 5).

meta-multigraph with contexts

Figure 5. Meta-multigraph with individual nodes situated in contextual sets.

We aim to investigate trails – essentially linear paths through nonlinear graphs – through meta-multigraphs representing ecologies of networked individuals. PoSR is a model that includes multiplexity as a basic network property. We are looking, however, at the properties of subnetworks whose nodes have differential distributions of multiplexity structure. In particular, we are interested in the relative density and distribution of information between nodes in a PoSR fragment based on the extent of common shared nodes in a multigraph, and at the impact of this of mechanisms for mobilising the unshared information of others in the network.

We are aiming at both social scientists and the developer community with our tools and our  framework, methods for using existing online data analysis and visualisation tools, and new tools and visualisation developed as a result of merging PoSR, trails, multigraphs and metagraphs. Such findings and tools will be made available online at and in open source repositories.


Applin, S.A. and Fischer, M.D. (2012). PolySocial Reality: prospects for extending user capabilities beyond mixed, dual and blended reality. Workshop on Location-Based Services in Smart Environments (LAMDa’12), in Proceedings of the 17th International conference on Intelligent user interfaces (Lisbon, Portugal, February 14-17, 2012) IUI ’12. ACM, New York, NY, 393-396.

Applin, S.A. and Fischer, M.D. (2011). A cultural perspective on mixed, dual and blended Reality. Workshop on Location-Based Services in Smart Environments (LAMDa’11), in Proceedings of the 16th international conference on Intelligent user interfaces (Palo Alto, CA, February 13-16, 2011) IUI ’11. ACM, New York, NY, 477-478.

Basu and Blanning (1992) Enterprise Modeling Using Metagraphs. In T. Jelassi, M. R. Klein, and W. M. Mayon-White (Eds.), Decision Support Systems: Experiences and Expectations. Amsterdam: North.

Borwn, M. F. (2008) Cultural Relativism 2.0. Current Anthropology 49:3, pp. 363-383.

Bush, V. (1945) As we may think. Atlantic Monthly, July 1945.

Ellson, J., E. R. Gansner , E. Koutsofios , S. C. North , G. Woodhull (2003). Graphviz and dynagraph – static and dynamic graph drawing tools. URL, accessed 8-10-2012.

Gjoka, M., C. T. Butts, M. Kurant, A. Markopoulou (2011). Multigraph Sampling of Online Social Networks. arXiv:1008.2565v2 [cs.NI]. (accessed 20 Sept 2012).

Jurgenson, N. (2011). Digital Dualism vs Augmented Reality. The Society Pages. Cyborgology Blog. (February 24, 2011.) Retrieved December 2, 2012 from

Lee, S. H., P.-J. Kim, and H. Jeong (2006) Statistical properties of sampled networks. Physical Review E, vol. 73, p. 16102.

Mislove, A., M. Marcon, K. Gummadi, P. Druschel, and B. Bhattacharjee (2007) Measurement and analysis of online social networks. In Proc. 7th ACM SIGCOMM Conf. on Internet measurement, San Diego, CA, pp. 29–42.

O’Madadhain, J., D. Fisher, S. White, and Y. Boey (2003). The JUNG (Java Universal Network/Graph) Framework. Technical Report UCI-ICS 03-17. School of Information and Computer Science University of California, Irvine. URL, accessed 8-10-2012.

Peterson, D. and Levene, M. (2003) Trail records and navigational learning. London Review of Education 1(3).

R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL, accessed 8-10-2012.

Rasti, A. H., M. Torkjazi, R. Rejaie, and D. Stutzbach (2008) Evaluating Sampling Techniques for Large Dynamic Graphs. Univ. Oregon, Tech. Rep. CIS-TR-08-01.

Smith, H., J. Underwood, G. Fitzpatrick, K. Walker, J. Good, R. Luckin,D. Rowland and S. Benford (2009) Sustainability requirements for online science communities and resources. CAL’09 conference , 23-25 Mar 2009.…/4b1d37ee-7ead-432a-b636-c9201da52bceShare (accessed 30 Sept 2012).

Spiro, M. E. (1986). Cultural Relativism and the Future of Anthropology. Cultural Anthropology 1:3, pp 259–286.

Smoot, M., K. Ono, J. Ruscheinski, P. Wang, T. Ideker (2011), Cytoscape 2.8: new features for data integration and network visualization Bioinformatics. 27(3): 431–432.

Turkle, S. (2011). Alone together: Why we expect more from technology and less from each other. New York: Basic Books.


Removal of temporarily published articles

As decided during the JITSO 2012 programme discussion, eight contributions have been accepted (cf. the schedule of JITSO 2012). We have unpublished contributions which have not been accepted. Where an accepted contribution has been revised, we have also unpublished the un-revised versions of the contributions.

[Research notes] How to use on-line data on September 11th?

How to use on-line data on September 11th ?
(Application of P.F.Lazarsfeld’s ‘elaboration model’ in just-in time research.)
Hynek Jerabek
Charles University, Prague, Czech Republic


  1. Can we find some general, culturally independent, repeated spontaneous reactions to the exceptional events of September 11th?
  2. Can we validate some culturally independent (repeated in different parts of the world) pattern of communication behaviour in the hours after the September 11th events?

We have voluntary responses from a non-representative population – 2578 on-line questionnaires, filled in mostly by frequent Internet users, the highly educated, younger and most active members of their societies (2090 Czechs and 488 other nationalities around Europe and the world).
The just-in-time research on September 11th had to be carried out within a very short period of time. It was essential for the respondents to be able to recall as precisely as possible what they did right after the event, their opinions, and their spontaneous reactions at that time. We wanted to determine what their opinions and reactions were like before they were influenced by the mass media, which understandably were commenting on the event in the hours and days after it occurred and that had a certain influence on the views of the population. The 41 questions were concerned mostly with the initial reactions of people and their communication behaviour in the hours after the September 11th events.

This speed with which the research had to be done influenced its research design. There was no time to organise a mass collection of data using the traditional methods of F2F or CAPI research. There were no financial resources prepared for use in such data collection. From this it is obvious that there was a lack of resources and time for selecting a representative sample of respondents.

On September 11th we invented a questionnaire for the purpose of an on-line international communications research. Just two days after the September 11th events, the Network Media Service (Eva Veisová and her colleagues) began an on-line collection of data in three languages: English, German and Czech. The recruitment of respondents was made through electronic newspapers, press and through a snowballing e-mail distribution of web addresses of questionnaires.

The original aim of the researchers was a very ambitious one. We wanted to address ‘people around the world‘, including respondents from Asia, Africa, and the Middle East. This is apparent from the identification questions in the questionnaire, which were formulated with this aim in mind, for example the way in which the level of education is coded as ‘education in years’, characteristics ascertaining religious affiliation (Christian, Muslim, Jew, Buddhist, Atheist, other), language, and continent. It is also apparent from the closed question about the emotions the respondents felt, where we asked, particularly with a view to respondents in the Middle East, or other countries, what are the most appropriate emotions. Alongside the options ‘I was surprised’, ‘I was astonished’, ‘I was shocked’, ‘I was afraid’ and ‘I was angry’, we also included the options ‘I was happy’, ‘I was satisfied’, ‘I was glad’, ‘I was pleased’, which we assumed would certainly not be selected by inhabitants from western Europe or the United States. They were clearly intended for future comparative analyses among various cultures and civilisations on our planet. However, unfortunately we received no response from respondents from countries and continents outside Europe and North America. We can recommend to future researchers who find themselves in a similar situation to have contacts prepared in advance to websites in distant countries to turn to with requests to post the questionnaires. Unfortunately we didn’t have these contacts

The research design as a whole had to work with almost no budget. And so the costs of the research involved the work of a team of enthusiastic researchers, several translations of a short questionnaire, and the fees for posting the questionnaire on websites. The speed required in order to capture people’s immediate reactions was also reflected in the data collection. Most data was collected in the first 14 days. The results of the just-in-time research were presented on the international field very quickly. The first paper written up on the results of our research on September 11th was presented very soon after at a special roundtable on ‘September 11th’ at the WAPOR Annual Conference in Rome, Italy, on September 21, 2001, e.g. 10 days after the event. First publication (in English and in Czech) has been published in November 2001 in Prague (Jerabek-Veisova 2001). As a result the research as a whole was not just very quick but also very inexpensive.

However, it was necessary to think about a method of analysing the data and a model for presenting the results that would guarantee the validity of the scientific findings. The sample of respondents in the research was certainly not representative.

The final on-line data file contains 2578 completed questionnaires used in the analysis. The respondents included 2090 Czechs and 488 foreign nationals from literally all over the world: 88 Slovaks, 143 Germans, 104 inhabitants of other European countries, 81 citizens of the United States, Canada, Great Britain and Australia (53 of them from US & CAN), and 72 people from Latin America, Asia, and the Middle East. 59% of the respondents were men and 41% were women. The gender composition is balanced in all language and nationality groups.

How can these data be used to provide some more generalizable statements about the first reactions and communication behaviour of Czech, European, and American populations?
We are aware of the problems related to Internet surveys, as pointed out, for example, by Janet Hoek, P. Gendall, and B. Healey. In their view, the primary question surrounding Internet surveys is: ‘… the extent to which email or web-based surveys can produce estimates that are generalisable to the wider public…’ [Hoek, Gendall & Healey 2001: p.2.] According to these authors on-line surveys of the general public ‘are likely to result in highly skewed samples with a strong bias towards younger, better educated and higher income males.’ [Hoek, Gendall & Healey 2001: p.6]
George Terhanian and his colleagues from Harris Interactive applied an approach to Internet-Based Surveys of Non-probability Samples using a combination of a large non-probability sample drawn from voluntary responses on the Internet and a substantially smaller probability, and thus representative, sample of responses. [Terhanian-Black 1999, Terhanian at al. 2001] With the use of ‘propensity score adjustment‘, the results relating to sub-groups of respondents of Internet audiences are re-weighed on the basis of characteristics that influence the respondent’s probability of being an on-line respondent.
We were faced in our comparisons with an even more difficult task than Terhanian and his colleagues. In ‘Just in time’ research it was impossible for us to carry out a comparative survey of probability sample respondents. Nevertheless, we were and are still interested in finding a way to use the acquired data for statements whose validity is applicable beyond the ‘Internet population‘. Our Strategy has been based on the ‘Survey Analysis’ approach.
We strove to apply Paul Lazarsfeld’s methodological principles of elaboration [Lazarsfeld 1955 {orig.1946}, Lazarsfeld – Kendall 1950, Rosenberg 1964, Zeisel 1985, Babbie 1994] in a situation of incomplete data representation. We must defend, explain and validate that our separate non-representative groups of volunteer respondents represent in some respect the other parts of the population.
If the statement is repeatedly valid, under the control of all relevant external variables, for all different cultural settings, we can confirm our statement as universal in relation to our scope of inquiry.
Our analysis was aimed at tracing the universal spontaneous reaction, opinions and models of communication behaviour repeated in many cultural and linguistic environments. Any valid statement was conditioned by repetition in different settings and we strove to formulate it as a finding only in the case of the recurrence of identical statements in sub-populations, which mutually and significantly differ from one another.
If the relationship recurred among people from different countries (Germany, US & Canada, Czech Republic), from different parts of the world (Europe, America), we would have grounds for concluding that the original relationship was a genuine and general one. [Compare: Babbie 1994: p.395]
Altogether 99% of the Internet population discussed the events in the USA with friends or relatives. This figure is valid for all territorial parts of our on-line voluntary sample (98% ‘– 100%). Perhaps only a few of us can recall an event that registered as much attention from people throughout the world. Our main hypothesis was confirmed – the events of September 11th became an important stimulus to conversation for people everywhere.
How long did people continue to discuss these events?
The comparative data from our on-line sample show that we can distinguish two groups of respondents. 90% of Czechs, Europeans and Latino-Americans talked with their close ones or with friends for at least one hour. Voluntary on-line respondents from US+Canada and Asia+Near East discussed the event significantly less – only about 75% of them for one hour or longer. The distribution of short (contra time-consuming) discussions probably reveals some cultural differences. In America, the percentage of people among the most active and highly educated population (both men and women) who discussed these events for 10’ – 59’, was significantly higher (many times) than in the rest of the world.
In spite of these cultural differences we can conclude that interpersonal communication – conversations about this international event – was thus without a doubt one of the main activities of that day, afternoon and evening, in many families throughout the world.
We used (and are interpreting) in our questionnaire two types of questions: open-ended and closed questions. There are two types of reactions to the events of September 11th. The first group of respondents understand the rational meaning of the question as ‘asking for a description’. The second group understand the emotional meaning of the question as ‘asking for sentiments and feelings’.
The ‘open-ended’ question permitted respondents to use their own words, and to use many more different words. People mostly reacted emotionally, and therefore we found in all sub-samples a predominance of words with an emotional content. In the spontaneous reactions of our on-line respondents we counted, without significant territorial differences, 81.6% of words with emotional content: tragedy, catastrophe, angry, shock, afraid, bestiality, madmen, barbarians, apocalypse, Armageddon, horror, sci-fi, unbelievable. The other words (18.4%) have more of a factual content, such as: ‘terrorism’, ‘war’, ‘attack’, ‘global attack’, ‘revenge’, ‘expectation’; or referred to uncertainty: ‘hard to say’, ‘don’t know’, ‘other’. Generally speaking, in the spontaneous reflection of the September 11th events emotional reactions prevailed in all tested territorial populations.

If we asked for specific words, for an evaluation of each of them on the scale between
‘the most appropriate’ and ‘the least appropriate’, we found the word ‘terrorism’ in first place among the ‘most appropriate words’ (in all sub-samples) (See table 1). The next words from this group of ‘rational’ meanings (attack, war attack, war) were found in third place. The word ‘tragedy’ was in second place (also in all cases) and the word ‘disaster’ was found in fourth (for most sub-samples). Word ‘assassination’ was very frequent in Europe, especially in German subsample, where it took a third position.
In our on-line research the most frequent answer to the questions: ‘How can you best express the emotions you felt when you first received the information? If we ask you directly now and offer a few options, which would you pick as the most appropriate?’, was ‘I was shocked’. 81% of respondents chose this statement as the most suitable description of their emotional state, and 90% of respondents indicated it as ‘most appropriate’ or ‘appropriate’. Other options frequently indicated were: ‘astonished’ (48% – the most appropriate statement) and ‘surprised’ (44% – the most appropriate statement). All groups of respondents indicated positive emotions as an absolutely inappropriate expression of emotion. No one felt happy, no one was glad, satisfied, or pleased. An almost identical outcome was found in all our sub-samples. We can see the results in Figure 1.

Note: A higher average indicates a more appropriate description of the emotional state; 4 = the most appropriate; 0 = the least appropriate

What could be our methodological conclusion from our ‘Just in time’ communication research? We know very well that our experience with the analysis and interpretation of non-representative data could be an unrepeatable and truly exceptional case. The unique, exceptional situation of the September 11th events provided an extraordinary interplay of circumstances in which social scientists had the opportunity, and also the obligation, to study some universal, global tendencies and some inter-culturally repeated relations and facts, which cannot usually be studied. It was for this reason that we did it. We hope that the uniqueness of the September 11th situation can, in some respect, vindicate the really extraordinary sequence of non-traditional deviations from the mainstream of public opinion methodology that we used.

We looked for recurring patterns of spontaneous reactions to the event of September 11th across the different cultures and civilisations the respondents came from. However, this was possible only to the extent to which the sample of respondents allowed. Naturally we also looked for differences within the studied sample of respondents. We found differences by education. However, we only discovered them after our research was repeated by Robert Chung on a representative sample of 520 CATI respondents of from the Cantonese-speaking Chinese population in Hong Kong. (Chung, R., Jerabek, H. & Veisova, Eva 2002) Our sample was too homogenous from the perspective of education. The ‘internet population’ in the Czech Republic and in Europe at that time was characterised by above-average education levels. We did not detect differences by age because our respondents were on the whole just young people. We found differences between men and women. They recurred in all the studied sub-samples of compared groups of countries. Sometimes perfectly, sometimes partially. In the case where fear was acknowledged there was an interesting difference in this respect between inhabitants of the United States and Europeans. In the United States the difference in the acknowledgement of fear between men and women to a ‘real threat’ was smaller than the difference in Europe, whose inhabitants could ‘only imagine’ the threat, which was very far away.
We know very well that the uncontrolled attempts of unskilled scholars to repeat our sociological inquiry, without an awareness of the hazards of possible failure, can lead to nonsense. We are prepared to retest all our findings using any other data on the same or similar issue, situation or topic, and we are also prepared to defend our methodological approach against any objections. We are of course also ready to modify, reinterpret or specify our conclusions if there are arguments and evidence strong enough to change our conviction that we controlled our procedures carefully and entered all relevant explanatory factors into our analyses.
I am sure that the Lazarsfeld’s elaboration model is robust enough for the non-standard application we used it for. The methodological approach that we applied in this case was really the only way in which to analyse and interpret our incomplete but extremely interesting and challenging data. I hope that the scientific discussion this paper will hopefully inspire could introduce some new theoretical results into our scientific field and I believe that some of the analytical innovations of our approach could open up a fruitful discussion on methodology.

Babbie,E.(1994): The Practice of Social Research. Chapter 16: The Elaboration Model. Wadsworth Publ., New York -, pp. 388-403
Chung, R., Jerabek, H. & Veisová, Eva(2002): 11th September in Cross-cultural Comparissons. Czech Republic & Hong Kong. (Opinions & Communication Behavior in ON-LINE and CATI Surveys). WAPOR 55th ANNUAL CONFERENCE, St. Petersburg Beach, Florida, USA (May 14-16, 2002), CD – ROM Proceedings, Session 5: Public Opinion and Media Responses to September 11.
Hoek, Janet, Gendall, P. & Healey, B.(2001): Web-Based Polling: An Evaluation of Survey Modes. In: WAPOR 54th ANNUAL CONFERENCE, Rome, Italy (September 20-22, 2001), CD – ROM Proceedings, Session A: Internet and survey research, paper A2
Jeřábek, H. & Veisová, Eva (2001): 11th September. International On-line Communication Research. Edition: Sociological papers 01:11. Institute of Sociology Czech Academy of Sciences. (
Lazarsfeld,P.F. (1955): Interpretation of Statistical Relations as a Research Operation. In: Lazarsfeld,P.F. & Rosenberg,M. (eds.): The Language of Social Research. New York, The Free Press, pp. 115-125
Lazarsfeld,P.F. & Kendall, Patricia L.(1950): Problems of Survey Analysis. In: Continuities in Social Research: Studies in the Scope and Method of ‘The American Soldier’. Glencoe, Ill., The Free Press, pp.133-196
Rosenbaum, P. R., and Rubin, D. B. (1983): The Central Role of the Propensity Score in Observational Studies for Causal Effects. Biometrika, 70 (1): 41-55.
Rosenbaum, P. R., and Rubin, D. B. (1984): Reducing Bias in Observational Studies Using Subclassification On The Propensity Score. Journal of the American Statistical Association, 79 (387): 516-524.
Rosenberg, M.(1964): The Logic of Survey Analysis. New York, Basic Books
Terhanian, G., & Black, G. S. (1999): Understanding the on-line population: Lessons from the Harris poll and the Harris Poll On-line. In the Advertising Research Foundation’s Towards Validation: On-line Research. New York, Advertising Research Foundation.
Terhanian,G., Bremer,J., Smith, Renee & Thomas, R.K. (2001): A Multi-Method Survey Design Approach for Reducing Error in Internet-Based Surveys of Non-Probability Samples. In: WAPOR 54th ANNUAL CONFERENCE, Rome, Italy (September 20-22, 2001), CD – ROM Proceedings, Session A: Internet and survey research, paper A4
Zeisel,H.(1985): Say it with Figures. New York, Harper & Row

Appendix 1
Part of the Original On-line QUESTIONNAIRE – English version
Please Fill-in the Questionnaire!
International Public Opinion JUST IN TIME Research
Thank you for your interest in the questionnaire. It is a part of an
international public opinion research, which aims to reflect the events
that happened in the US on September 11th.
The time needed for filling-up the questionnaire is only 5 minute of your
time. The answers in the questionnaire are anonymous.
Thank you for your help!
Doc. PhDr.Hynek Jerabek
Charles University, Prague, Czech Republic
Faculty of Social Sciences, Dept. of Sociology
1. Which word would you use to spontaneously describe what happened on
Sept 11, 2001 in the USA?
2. If I ask you directly now and offer a few options, which of them will
you pick as the most/least appropriate? most suitable least suitable
war attack
26. Did you discuss with your friends or relatives the accident?
yes no
27. If yes, how long?
The whole rest of the day
more than 1 hour
half an hour
more than 10 minutes
less than 10 minutes
28. Did you discuss these events in more discussion groups?
yes no
29. How you can best express your emotions when you receive the first information?
30. If we ask you directly now and offer a few options, which of them will you pick as the most appropriate?
most apropriate least appropiate
I was surprised
I was astonished
I was shocked
I was afraid
I was angry
I was happy
I was satisfied
I was glad
I was pleased
31. Continent? select North America, South America, Africa, Europe, Asia, Australia and Oceania
32. Country?
35. Religion?  select Christian Moslem Jew Budhist Atheist other
36. Age?
37. Education? select
less than 5 years
5 – 9 years
10 – 12 years
more than 12 years
Thank you, submit!

[Research notes] The Wukan’s protests: just-in-time identification of international media events

Severo Marta – Université Lille 3 / Laboratoire GERiiCO
Giraud Timothée – CNRS / UMS RIATE
Douay Nicolas – Université Paris-Diderot / UMR Géographie-Cités 


Recently, the emergence of a huge amount of digital traces concerning social phenomena has deeply impacted the research on such items. Social scientists are trying to manage these new data and to find out how they can intervene in the study of their research objects. One of the most interesting perspectives that digital traces can open is surely the chance to study a “just-in-time” phenomenon, as it unfolds. This paper aims at analysing how digital traces can affect the research on international media events and at questioning whether, thanks to this new kind of data, it is possible to identify them as they unfold.
In the limits of this short research note, we are meant to present the research questions and the preliminary results of a research project, called GEOMEDIA, which intends to build a sensor of international media events, based on the just-in-time analysis of RSS feeds of newspapers. This project is piloted by the International College for Territorial Sciences ( and financed by the French national research agency ANR –

The identification of international media events

In the last decades, several scholars have worked on the definition and identification of media events (Galtung and Ruge, 1965; McCombs and Shaw, 1972; Dayan and Katz, 1992). Among them, some investigated cross-national media coverage of different types of events (Herkenrath and Knoll, 2011; Koopmaas and Vliegenthart, 2011) and focused on mechanisms that may explain diffusion of media attention. By creating a GEOMEDIA research group that combines specialists in geography, media studies and computer science, the International College for Territorial Sciences hopes to develop fruitful interactions between disciplines necessary to study international events (Wolton, 2003) with a multi-dimensional viewpoint (Steinberger et al. 2005) and thanks to a “just-in-time” sensor based on media digital traces. Among the numerous research issues that this project copes with, this paper will discuss two of them:

  1. The data. Where to find and how to collect media data useful for a “just-in-time” analysis?
  2. The spatio-temporal analysis of data. How to study diffusion of event news and notably the interactions between the special and temporal dimension?

These two issues will be developed though out the presentation of a case study: the analysis of the Wukan’s protests. Wukan is a small village of 20,000 habitants in southern of China. On the 23rd of September 2011, several newspapers published the news that villagers rioted over land grab (Douay, 2011; Douay et al, 2012). In few months about one thousand articles have been published on worldwide foreign dailies about the Wukan’s protests.

 Media data: from commercial databases to RSS fields

As known (Earl et al, 2004), the use of newspaper data for studying events such as collective actions (as in our case) may raise several critics concerning the data collection and the selection and description bias related to articles’ content (McCarthy and McPhail, 1996). Yet, one of the reasons that motivated our study was the chance of building a coherent and complete corpus of articles.
Another important issue related to this type of data is that they can be retrieved only in commercial databases such as DowJones Factiva (used in this research), LexisNexis or Europresse. The use of these databases is not only expensive, but it also raises several technical (i.e. it is not possible to extract more than 100 items simultaneously) and methodological problems (i.e. the lack of transparency concerning keywords and the the inhomogeneous coverage of sources). This is why that the use we have done so far of this data is limited to counting articles by periods (days, weeks, months or years).
For these reasons, as a first step of our project, we are focusing our efforts on the search of other kind of media data suitable for building a “just-in-time” sensor of international events. We are testing the interest of using RSS feeds provided by the online version of Worldwide newspapers. RSS are supposed to have three great advantages: they are free; they may be archived and tagged without limits; they are generally provided as the news is ready (and consequently as the event unfolds) and they can therefore be suitable for a “just-in-time” analysis. We propose to build a database storing RSS feeds associated with articles published in one hundred newspapers in different parts of the World and to extract two types of information: flows among countries and international events. As a first step of this research, we are carrying out some case studies, as this one about Wukan, to testing the validity of RSS (compared to the entire articles) to identify events and the interest of combining media and geographical data for studying international events.

A preliminary case study: the Wukan’s protests

For studying the Wukan’s protests we used two types of corpus. We started by analysing a traditional corpus of newspaper articles extracted by Factiva. Then, we compared the results obtained by this corpus with a corpus constituted by RSS feeds. Since our project has just started, the RSS database at our disposal is still incomplete and it was not possible to perform the same type of quantitative analysis that we performed on Factiva. Yet, it was possible to carry out a qualitative analysis in order to highlight advantages and drawback of the use of RSS feeds as media sensors for the just-in-time identification of international events.
So, first of all, by using Factiva, we collected 952 articles published in worldwide newspapers including the search string “wukan” from August 2011 until May 2012. We focused on the geographical and chronological distribution of the articles that talked about the event. To treat this data, we developed some R scripts and packages that are available online ( in French) and can be easily reapplied on other datasets.
As a second step, we built a corpus of RSS items from worldwide newspapers websites including the search string “wukan” in the title or in the description during the same period. To do it, we used the newspapers’ RSS feed archive that we are designing in the context of the ANR Geomedia project and is still in a alphe test phase. Currently, it archives 132 feeds of 41 countries. If this database has the advantage to provide the researcher with easy accessible, just-in-time and free data, it has the important limit that we are currently archiving only newspapers in French and English. Our dataset is therefore smaller compared to the one of Factiva that proposes media in all languages. We believe, however, that this data is sufficient to identify international media events. Our RSS corpus about the Wukan’s protest is constituted by 128 items.


Figure 1. Geographical distribution of articles about the Wukan’s protests published between September 2011 and May 2012. Source: Factiva

As regards the geographical distribution (fig. 1), articles concerning the event have been published in 41 countries. Most of them have been published in the United Kingdom (143), Hong Kong (141) and United States (78). As regards the Hong Kong newspapers, their interest in the events is obvious considering the geographical proximity to the Wukan village and their special attention in the events taking place in Mainland China. Yet it is also important to remind that several worthwhile English-language newspapers, both national (i.e. South China Moring Post) and international (i.e. Wall Street Journal – Asia edition), are located in Hong Kong.
As regards the geographical distribution emerged by the RSS corpus, most of items are in the same countries that we identified with the Factiva corpus, that is to say US with 48 items and UK with 38. Hong Kong items are less numerous because most of them are in Chinese and consequently are not included yet in our database. Moreover, even for Hong Kong English-language newspapers included in our database, such us the SCMP, it was not possible to identify all the occurrences because items about Wukan where included in feeds about national politics that are currently not stored in our database. This data makes it clear that in order to use RSS feeds as media sensors for studying the spatial dimension of international events (that is to say which countries are talking about an event or in generally which country is talking about which other country), it is necessary to have a RSS database including newspapers in all employed languages. Building such a database has two main obstacles: multi-language text analysis is still an emerging research field; and, more importantly, the current offer of newspapers’ RSS feeds doesn’t allow to cover all countries in a valid way. Yet, even considering these critical limits, further research may be done to verify whether a RSS newspaper database may be built to study spatial dimensions of international events (even if without a worldwide representativeness).

Figure 2. Chronological distribution of articles about the Wukan’s protests published between August 2011 and May 2012. Source: Factiva.

timeline_rssFigure 3. Chronological distribution of items about the Wukan’s protests published between August 2011 and May 2012. Source: Geomedia RSS database.

newyorktimesFigure 4. Chronological distribution of articles and items about the Wukan’s protests published between August 2011 and May 2012 by The New York Times.

As regards the chronological distribution, what is interesting is that the two corpora have a quite similar distribution (fig. 3 and fig. 4). We tested this result by comparing Factiva’s and RSS’s data of a same medium, that is to say The New York Times (fig. 5). Even if number of articles and items is not statistically relevant, the clear similarity in the distribution of the two sources encourage us to continue working on the validation of RSS as media sensors, especially for studying chronological dimension of international events.

Through the analysis of both corpora, we may highlight the same moments of the protest:

1. The beginning of the protest. In the first weeks (23rd of September 2011 – 4th of October 2011), the Wukan’s protests clearly didn’t set the media international agenda (only 28 articles), yet these facts drew the attention of some international newspapers that decided to cover the news.

2. Explosion of the protest. In the following weeks, newspapers interrupted the coverage of the Wukan events until the end of November when newspapers reported an escalation of violence through strikes, demonstrations and riots (Koopmans, 2004). A thousand police laid siege to the village on December 14th, preventing food and goods from entering the village. On December 21st, after several days of resistance, villagers won their “small victory” (Financial Times, 21/12/2012). It is in these days that we find the main media pick.

3. Elections. Another important pick in the press coverage corresponds to the period of the elections in February. On March 3rd the municipal election designated a seven-member village committee, including a village chief and his two deputies, who would control local finances and the sale and apportioning of collectively owned village land.

4. Punishment of involved officials. The last episode in the Wukan’s report happens on the last week of April, when the Chinese authorities have punished 20 officials and former village leaders of Wukan and expelled them from the Party.

Figure 5. Main events identified in the articles about the Wukan’s protests published between September 2011 and May 2012 (see the dynamic graph)


In this paper, using RSS feeds of newspapers, we analysed a case study, the protests of Wukan, and we compared results obtained with a traditional corpus extracted from Factiva in order to test the validity of RSS for studying international media events.  On the one hand, as regards the geographical distribution, Factiva data has clearly highlighted the impressive media coverage of this protest that transformed it from a local movement to a global media event. RSS data allowed identifying the countries that published more items on the event, but was not able to show the worldwide distribution. Further research is necessary to verify the possibility of building a RSS database to study the global spatial distribution of an international event. On the other hand, as regards the chronological distribution, results foster more optimism. RSS and Factiva data identified similar peaks. In both corpora, we could find and describe the same events. Even, decreasing at the level to a single newspaper, articles and items distribute in a similar way on time. Considering all that, we are encouraged to continue our research on RSS feeds as media sensors of international events.


Bandurski, D., 2012, “Chinese-language coverage of Wukan”, China Media Project, url: (retrieved on 8th June 2012)

Centre on Housing Rights and Evictions (COHRE), 2008, One World, Whose Dream? Housing Rights Violations and the Beijing Olympic Games, Geneva, Switzerland.

Dayan D. & Katz E., 1992, Media Events: The Live Broadcasting of History, Cambridge, Harvard University Press.

Douay N., Severo M. & Giraud T., 2012, “La carte du sang de l’immobilier chinois, un cas de cyber-activisme”, L’information géographique, Vol. 76, n. 1, pp. 74-88.

Douay N., 2011,“Urban planning and cyber-citizenry in China How the 2.0 opposition organises itself”, China Perspectives, n˚2011/1, Hong Kong, Centre d’études français sur la Chine contemporaine, pp. 77-79.

Earl, J., Martin, A., McCarthy, J. D., et Soule, S. A., 2004, “The Use of Newspaper Data in the Study of Collective Action”, Annual Review of Sociology, vol. 30, n. 1, pp. 65-80.

Galtung, J. & Ruge, H.M., 1965, “The structure of foreign news”, Journal of Peace Research, Vol. 2, n. 1, pp. 64-91.

Koopmans , R. and Vliegenthart  R., “Media Attention as the Outcome of a Diffusion Process—A Theoretical Framework and Cross-National Evidence on Earthquake Coverage Ruud and Rens”, European Sociological Review, Vol. 27, n. 5, pp. 636-653.

McCarthy, J., & McPhail, C., 1996, “Images of Protest : Dimensions of Selection Bias in Media Coverage of Washington, 1982 and 1991”, American sociological review, Vol. 61, n. 3, pp. 478-499.

McCombs, M.E. & Shaw, D.L., 1972, “The Agenda-Seting Function of Mass Media”. The Public Opinion Qarterly, Vol. 36, n. 2, pp.176-187.

Rawnsley G.D., 2006, The media, internet and governance in China. Url : (retrieved on 8th June 2012)

Steinberger, R., Pouliquen, B. et Ignatet, C., 2005, “NewsExplorer : multilingual news analysis with cross-lingual linking”, Proceedings of the 27th International Conference Information Technology Interfaces.

Tong J, 2009, “Press self-censorship in China: a case study in the transformation of discourse”, Discourse & Society, Vol. 20, n. 5,  pp.593-612.

Tong J & Sparks C, 2009, “Investigative journalism in China today”. Journalism Studies, Vol. 10, n. 3, pp. 337-352.

Wolton, D., 2003, L’autre mondialisation, Paris, Flammarion.

[Research notes] Journal of Digital Social Sciences

Not so many years ago, when online publication appeared in the scientific press, it was intended as nothing more than a side service. A scientific review was still a periodical collection of pages compiled, printed and distributed by an editor under the form of a booklet, which was bought, indexed and made available by academic libraries. The physical support was inseparable from its content to the point that we currently refer to a scientific article as a ‘paper’. However, since the day reviews started to be compiled with personal computers, editors realized that they could easily export an electronic copy of the articles they were collecting and post it on their websites. Often editors online provided just the title and abstract of the articles as a tease for readers. Sometimes, they uploaded the full-text as an extra service offered to libraries already subscribed to the hardcopy version.

Soon however, librarians started to realize how much simpler and less expensive it was to manage digital files instead of paper booklets. Consequently, they informed scientific editors they would rather buy just the electronic version of their reviews. To meet this demand, some editors set up online archives for their contents. Others, the majority, decided to rely on portals that gathered and standardized contents coming from different scientific editors. In a few years, the majority of scientific press became available through a handful of specialized portals. By this time, scholars had discovered how nice it was to search their bibliographic references online. In a couple of decades the entire chain of scientific publication (from paper submission, to bibliography compiling) has gone online.

The way articles are written, distributed and exploited is radically changing and yet, strangely enough, articles themselves are not: they are still papers. Unaffected by digital revolutions, articles continue to be made of (plain) text and (few) images.

The fact that papers are the only unchanged link in the chain of scientific press is even more remarkable as research practices themselves have been deeply affected by the impact of digital technologies. Disciplines such as physics, chemistry and biology have been thoroughly renovated by electronic computing and the same renewal is taking place, a few years later, in the human ( and social sciences (Lazer et al., 2009). Before computers, the only medium to know the same irresistible success in scientific community was the movable-type press. We know how deeply Gutenberg’s invention affected the birth of modern sciences (Eisenstein, 1979) and there is evidence that digital technologies may play the same role in the next few years. In social (as well as in natural) sciences, research practices are mutating under the influence of electronic tools. Not only existing methodologies are facilitated and enhanced by digital technologies, but new digital methods are emerging that were unthinkable just a few years ago (Roger, 2009).

This amazing effervescence spreading through laboratories all over the word is hardly visible when one looks at the actual products of social sciences. No matter how innovative the source or the analysis of the data may be, the results of social researches are still delivered under the two hundred year old form of a scientific paper. Even if both the chains of scientific production and of scientific dissemination have been completely renovated, the very link between the two of the remains unchanged. Scientific papers still have the exact same form they use to have when research was made and published on paper. This inertia prevents scientific press from divulging the full richness of computer-enhanced research and from taking advantage of the potential of online publication. Of course, scholars can publish rich multimedia accounts of the results of their research on their websites; of course portals on social sciences can be organized. Yet, when it comes to career evaluation (in particular as scientometrics indicator are concerned), scientific papers are the only publications that really matters. Papers are the bottleneck of scientific research.

The last sentence of the previous paragraph needs to be qualified: there are excellent reasons why papers remain the bottleneck of scientific research. For one thing, the ‘organized skepticism’ of modern science (Merton, 1973) would be impossible without a standardized way to identify and cite the ideas submitted to the evaluation of the scientific community. To support or criticize the work of their colleagues, scientists need to be able to refer to them unequivocally. This, in fact, is the main function of the system of papers and journals.

Still, an increasing number of scientific practices are emerging that cannot comfortably be squeezed through the bottleneck of paper publication. Online communication, in particular, offers scientific publishing two unprecedented possibilities: multimedia and interactivity.

Multimedia is the capacity of digital technologies to draw together formats originally developed for separated medias, such as texts, images, video, sounds… Multimedia is interesting for scientific publishing because it lets authors profit from new formats of publication (see, for example, how the Journal of Visualized Experiments – – is experimenting with the use of videos to present scientific protocols), without renouncing the advantages of the old ones. In a multimedia environment, the most diverse elements can find their place within the traditional textual structure of the scientific papers (in the same ways notes, images and tables are currently embedded in text).

Interactivity is the capacity of digital technologies to offer a non-linear exploration of a message. Though traditional formats have always allowed some interaction to their reader (through textual devices such as notes, references, citations and tables), the possibilities opened up by online media are unprecedented. Today, instead of just describing their research protocols, scholars can embed them (or part of them) in the publication.

While multimedia and interactivity may enhance scientific publication, it is crucial that the other features of traditional publishing are not lost in the process. As we said, there are good reasons why scientific papers have so far remained the only accepted format for scientific publications. Papers have four features that made them irreplaceable in modern science: they are citable (it is possible to identify them unequivocally); accessible (they can be accessed by anyone at a reasonable cost); durable (their maintenance is relatively easy) and stable (once published, they cannot be intentionally or unintentionally altered). Without these features, scientific publications would not be able to contribute to the dialogue of the scientific community.


From a technical point of view, there are no major obstacles preventing scientific publishing to exploit the potential of digital technologies. Web technologies, in particular, have proved to be a perfectly suitable support for scientific communication. Developed in an academic environment and inspired by the practices of academic publishing (Berners-Lee, 1999), web protocols have a natural affinity with the publication principles discussed above, thereby reducing the cost for migrating online.

Citability. According to many observers, the most important brick in the development of the Web was the introduction of the Uniform Resource Locators. URLs provide a system of unique addresses allowing any file available online to be reached by any computer connected to the Internet. Embedded in web pages as hyperlinks, URLs offer a convenient and unambiguous system of citation among any type of document.

The platform for digital scientific publishing we are proposing should draw significantly on the URLs system to assure the citability of its publication. Each article published on the platform will therefore be assigned a permanent URL (also called ‘permalink’) that will identify it unequivocally and enduringly. Elements within articles will be identified by anchors or sub-URLs (also permanent), allowing scholars to cite them directly (the same way it is possible to cite a specific page or paragraph within a traditional scientific paper).

Accessibility. Making web documents accessible to a growing number of people in the world required and still requires considerable efforts. In the last few yars, accessibility problems seem to be moving toward the ‘last mile’: the actual software that allows documents to be read online. This software, usually called a browser or reader, exists in several proprietary and open solutions, each one with multiple versions released in different times for different devices (computers, mobile phones, tablets…).

To be universally accessible, scientific publication should therefore be delivered in formats that can be properly read by as many browsers and readers as possible. This is why the digital scientific publishing platform we are proposing should rely on the Web standards as defined by the World Wide Web Consortium (W3C) and conform to the guidelines issued by the Web Accessibility Initiative (WAI).

Durability. Besides allowing the largest accessibility, recent web standards have another important advantage: they assure the forward and backward compatibility of the documents they encode.

Relying on web technologies, the scientific publishing platform we are proposing will be durable because the legibility of its content will be assured through time by the forward and backward compatibility of web standards. Though it is well known that digital technologies are affected by a very fast obsolescence, employing compatible standards guarantees that scientific publications will remain readable for a relatively long period of time without requiring conversion or other software maintenance.

Stability. The efforts deployed to assure the durability of web standards may be defeated by the facility with which it is possible to add, remove or modify any online file. The possibility to change online documents at an infinitesimal cost is surely one of the greatest advantages of web technologies, but it does create a problem for scientific publishing.

As concerning stability, the type of publication we are imagining will resemble more paper-print than e-publishing. Although no technical reasons prevent articles to evolve (and be traced in such evolution), the platform we are proposing will not allow such changes. Once a digital article is reviewed and published, it will not be allowed to change, unless as an entirely new submission.

Multimedia and interactivity. The advent of HTML 5 has marked the official acknowledgment of the role of multimedia and interactivity in online communication. Offering better support to handle data, videos, sounds, dynamic images, style sheets and making all these elements (and other) easily programmable through JavaScript, HTML 5 is greatly enhancing the multimedia and interactive potential of web standards (thereby bypassing non standard languages as Flash actionscript).

For all these reasons, HTML 5 and related technologies (in particular JavaScript) seems particularly suitable for scientific publishing. Other technologies exist, of course, that are vastly used in research and could be transmitted over the Internet (see for instance the work done by the project to develop an environment capable of transfering online any experiment of computational science – McLennan & Kennell, 2010). We feel, however, that web standards offers the best cost-benefit compromise. With a relative little effort for the publishers and the readers, web technologies can open scientific literature to a vast range of new formats and combinations of formats.


Berners-lee, T. (1999). Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web. New York: Harper and Collins.

Eisenstein, E. (1979). The Printing Press as an Agent of Change. Cambridge: Cambridge University Press.

Ince, D. C., Hatton, L., & Graham-Cumming, J. (2012). The case for open computer programs. Nature, 482(7386), 485–488.

Latour, B. (1995). The “Pédofil” of Boa Vista: a Photo-Philosophical Montage. Common Knowledge.

Latour, B., Jensen, P., Venturini, T., Grauwin, S., & Boullier, D. (2012). “The Whole is Always Smaller Than Its Parts” A Digital Test of Gabriel Tarde’s Monads. British Journal of Sociology, forthcoming.

Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabasi, A.-L., Brewer, D., Christakis, N., et al. (2009). Computational social science. Science (New York, N.Y.), 323(5915), 721–3. doi:10.1126/science.1167742

McLennan, M., & Kennell, R. (2010). HUBzero: A Platform for Dissemination and Collaboration in Computational Science and Engineering. Computing in Science & Engineering, 12(2), 48–53. doi:10.1109/MCSE.2010.41

Merton, Robert K. (1973), “The Normative Structure of Science”, in Merton, Robert K., The Sociology of Science: Theoretical and Empirical Investigations, Chicago: University of Chicago Press,

Raymond, E. S. (2001). Cathedral and the Bazaar. Sebastopol, Ca.: O’Reilly Media.

Rogers, R. (2009). The End of the Virtual: Digital Methods. Amsterdam University Press.

Shapin, S., & Schaffer, S. (1985). Leviathan and the Air-Pump. Hobbes, Boyle and the Experimental Life. Princeton: Princeton University Press.

Zeldman, J. (2010). Designing with Web Standards (3rd edition). Berkeley: New Riders.

Willinsky, J. (2006). The Access Principle: The Case for Open Access to Research and Scholarship. Cambridge Mass: MIT Press.

[Research notes] “Digital Paleontology” – Digging for Ancient Tweets


Slides & text here.

In recent years, the mass availability of social network data, most notably through the twitter API, has opened at least two new research paths.

The ‘big data’ approach allows scientists to record huge datasets using APIs such as the twitter streaming API, and then process them using data mining and network analysis techniques requiring accordingly huge computational power (see for example the work of Jure Leskovec and the Stanford Network Analysis Project). To them, the twitter API appears to offer the best of two worlds: data on the scale of web analytics, yet as clean and qualified as the data that traditional techniques, such as interviews, would provide. This approach has already been able to verify sociological principles on a larger scale than ever before.

The ‘just-in-time’ approach, on the other hand, leverages the ubiquity of social network data in a different way: it intends to “analyze social phenomenon as they unfold” using the real-time data provided, again, by APIs such as the twitter streaming API. The just-in-time approach has notably proven its worth during recent social events that relied heavily on social networks such as the Occupy Wall Street movement. Such data would have taken months to collect and process manually.

But what happens when one is too late for real-time? What remains of the real-time data after real-time is over? A recent study suggests that the rate of decay of web resources meant for immediate consumption is extremely quick. And though it remains available, twitter data becomes difficult to search and access after a few days so it cannot be relied on exclusively.

As part of my PhD, I am trying to map the path of an erroneous quote that circulated on social networks following the death of Osama Bin Laden on May 2, 2011. The corpus contains web articles, pages, blog posts and tweets. These research notes will focus on the collection and initial processing of the twitter data.

We chose this example because of several interesting properties:
– The inception point of the cascade was known precisely which is unusual, especially when it is located on Facebook.
– We wanted our example to spread across a wide range of sites and services. The different components of the cascade allow us to study various phenomenons while still discussing the same example: tracking a quote, a link, a tag, a conversation.
– There were investigative accounts of the incident written by professional journalists. These articles can be used as secondary sources and reference points.

On may 2, 2011, US President Barak Obama appeared on television to announce the death of Osama bin Laden. The news had already leaked before the official announcement and crowds had began to gather in symbolic places such as outside the White House or New York’s Times Square. After the announcement, many Americans simply got outside their homes to rejoice, often waving flags or chanting “U-S-A!”. Meanwhile, a sense of unease began to grow among other Americans. Jessica Dovey, an American working as a school teacher in Japan, posted the following message on her Facebook wall:

Jessica Dovey Facebook Post

At some undetermined point in time, an altered and misattributed version of the quote began to circulate twitter, resulting in a large and multi-layered information cascade:

“I mourn the loss of thousands of precious lives, but I will not rejoice in the death of one, not even an enemy.”

Martin Luther King, Jr

In a matter of hours, reports that the quote had been wrongly attributed to Martin Luther King began to surface, which gave rise to a sub-cascade that ran parallel to the previous one.

This incident attracted a fair amount of media coverage, so it is already well documented. The purpose of our work is to determine what remains of it a year later, using only resources freely available on the web. As noted in introduction, these research notes will focus on the twitter portion of the cascade.

In all experiments, we focused on the time period ranging from midnight on the May 2, 2011 to midnight on the May 8, 2011. All Python scripts used are available upon request.

twitter was designed first and foremost to be used in real-time. The limitations of the twitter API reflects this design choice. It only allows to search the last 6-9 days of tweets, which makes it optimal to know “what’s happening right now”. Searching for archived tweets is clearly not an intended use.

Several services offer partial twitter archives. Among those, Topsy was the only one to provide free access, to support complex queries (with time boundaries, crucially), and to offer an API. Topsy also has the interesting property of keeping a record of tweets and twitter users even after they were deleted from twitter.
However Topsy has its own limitations. It records only a subset of the complete twitter output, and does not record the friends and followers of twitter users. Topsy also archives content beyond twitter such as links and images.

The ‘Quote’ corpus
Our first task was to collect the various tweets containing the altered Martin Luther King quote. The ‘quote’ corpus was collected using otter, the Topsy API. We wanted to identify as many mentions of the fake Martin Luther King quote as possible, even those containing abbreviations or spelling errors. We queried otter for the most distinctive words, “mourn” and “enemy”, whose combination seems unusual enough within 140 characters to provide reasonably precise results.

A quick survey of the Topsy website shows the heuristic to be effective: the only results in the targeted time frame (the first week of May 2011) show a variant of the fake quote.

The API query brought back 2742 pieces of content archived by Topsy, including 2657 tweets and totaling 2615 authors. The various types of content archived break down as follows:

video : 1
tweet : 2657
link : 80
image : 4

The links contain mostly Facebook posts, blog posts expanding on the quote, and services such as TweetLonger. The video shows the famous “I have a dream” speech. The erroneous quote was used in the video description, presumably to draw more traffic.

We are only interested in the tweets so all other types of content are discarded.

The curve starts relatively flat before showing an inflection: the quote went viral after being posted by several high-profile twitter users, notably Penn Jilette.

This observation is consistent with the findings of Cha et al.: on twitter, celebrities have disproportionately high influence on topics outside of their area of expertise.

The retweet (RT) functionality of twitter allows for the quick dissemination of information, creating so-called information cascades.
The Topsy API indicates that a tweet is a RT, but since the introduction of the official RT feature in 2009, RT do not keep track of the path of a tweet between its author and the user who retweeted it. When user ‘a’ retweets a tweet from user ‘b’, we can infer that a path exists between ‘a’ and ‘b’, but neither its length nor its topology.

To extract cascades from the ‘quotes’ corpus, we grouped the retweets of the same tweets and sorted them by date. We identified 339 micro-cascades ranging from 2 to 97 nodes.

The tweets contain few hashtags, which is not surprising. The quote is already 110 characters long, adding up to 137 with punctuation and attribution.

#mlk : 21
#obl : 10
#martin : 5
#a : 4
#fb : 3
#osama : 2
#things : 2
#quote : 2
#binladen : 2
#sillyusa : 1
#viral : 1
#fail : 1
#powerful : 1
#dangersofsocialmedia : 1
#yokosonews : 1

On a side note, after it was revealed that the Martin Luther King quote was in fact a fake, the hashtag #fakeMLKquotes knew a brief spike of popularity. This kind of meta-commentary by twitter users on a twitter-centric phenomenon concludes the cycle.

‘Fake’ Corpus
The first tweet identifying the Martin Luther King quote as a fake that we found was this one:

This tweet went on essentially unnoticed: Topsy recorded only one RT, while twitter shows two. With 15000 followers, “political math” certainly has a wider than average audience, but he is nowhere near the almost two million followers of Penn Jillette (there are conflicting reports on the average number of followers of a twitter account, but most figures have 3 digits. A recent and well documented blog post gives the figure of an average of 235 followers for active users).
Moreover, the work of Hsu et al. on Occupy Wall Street has shown that dissenting voices have a hard time being heard on twitter. Neither providing a source to substantiate his claim, nor being a celebrity, “political math” had virtually no chance of being heard.

Nevertheless, the fact that the quote had been misattributed spread on twitter not from a celebrity account but from outside. the other most influential force on twitter: the traditional media.

The ‘fake’ corpus was also collected from Topsy but in a less straightforward way. Contrarily to the ‘quote’ corpus, we had no way of knowing in advance the content of the tweets we were looking for.

We used a iterative approach. First, a series of three two-words seed queries were used:

mlk+quote : 2903 results
fake+mlk : 1247 results
fake+quote : 278 results

After removing duplicates, we are left with 2921 tweets, which seems low given the amount of media coverage surrounding the controversy. Another issue is that our queries are rather vague and could potentially bring back noisy results. We need a way to boost both recall and precision.

‘Links’ Corpus
The backlash phase originated in websites outside of twitter, which resulted on many tweets including an URL pointing to a source. The importance of URL in the information contagion is well studied, for example by Galuba et al. in 2010. Recently, Myers et al. showed that URL mentions on twitter can be the result of both internal and external influences, and found politics and world news to be the most externally-driven topics.

To extract the URLs, we searched the tweets of the ‘fake’ corpus using regular expressions. twitter users make use of various URL-shortening services to accommodate the 140-character limit of tweets. We had to expand the collected URLs to obtain a list of the web resources designated by our short URLs.

With our newly expanded link list, we query Topsy again, this time looking for any tweet that mentions one of the landing pages. This technique expands our results and helps correcting for the vagueness of the initial queries, as a tweet pointing to one of the URLs has a higher chance of being relevant. We are left with 4721 tweets in the ‘links’ corpus.

While this technique enhances precision and recall, it should be noted that the very first tweet identifying the quote as a fake does not contain an URL and is thus absent from the ‘links’ corpus.

Short URLs are an interesting artifact to track information circulation. URL-shortening services create a unique short URL for each user submitting a regular URL. Sometimes the shortened URL can circulate from user to user, indicating an unacknowledged contagion.

Identifying the underlying following / followers network proved to be the most daunting task of all. Topsy does not record this data so we had no choice but to query twitter directly. The limitations of the twitter API are even more stringent as those of otter, with only 350 queries per hour allowed.

Moreover, in about 18 months, this network has probably changed completely. Of the 2615 accounts in the ‘quote’ corpus, more than 15 percent were inaccessible at the time of our work:

Closed : 296
Hidden : 94
Suspended : 4

Even among those whose account remained public and active, the friends and follows have inevitably changed during the months that spanned between the events and the time of this experiment.

Future work

In conclusion of a recent ‘big data’ article on information propagation on twitter, Romero et al. call for “more fine-grained analyses as well, understanding how patterns of variation at the level of individuals contribute to the overall effects that we observe.” This is the kind of contribution we hope to offer.

There are still two aspects left unexplored in the twitter data we collected.
The first one is the textual content of the tweets. Even in the ‘quotes’ corpus, where all tweets are supposed to contain the exact same sentence, small variations exist, e.g. in quotation marks used or in the way ‘Martin Luther King’ is written.
The second one we would like to investigate is the @replies, as they could provide valuable data on relationships between users. The issue is that there is no straightforward way to access all the replies to a given tweet using the twitter API, and we could not find another service offering this data. One could probably build a scraping script to access the data displayed by the twitter web client, but that would violate the twitter TOS. Any change to the twitter web client would render such a script useless.

Evaluation of our results

The remanence and decay rate of real-time data is difficult to evaluate. The data we collected is not a proper sample, since our gathering process is obviously influenced by the services we used, mainly Topsy. In other words, an evaluation of the results we presented amounts an evaluation of Topsy’s coverage.

Topsy only records a subset of the 400 millions of tweets published each day: tweets that were retweeted or contain an external URL, but there is no way to know exactly how much is discarded.
It is worth noting that the new terms of service of the twitter API, which have been generally regarded as unfavorable by developers since they were announced last summer, are actually welcomed by Topsy. Topsy’s new product, Pro Analytics, now claims to provide “exact counts for any term, any date range, instantly.” When describing its social analytics technology, Topsy states that it provides exact values : “No estimates, no sub-samples. Comprehensive.” This comes in contrast to Otter, the free Topsy API we used for this paper: Otter’s histogram query, for example is only “accurate up to 3 months for most terms and up to 1 month for very popular terms (like iPhone and Justin Bieber)”

In a broader perspective, empirical observations suggest that a large part of the cascade we are studying happened outside of the reach of twitter. As Myers et al. recently noted, cascades on twitter appear to jump from place to place, suggesting a larger interaction network of which twitter is only a part. The Martin Luther King quote itself was born on Facebook to which we only have a very partial and indirect access. The second cascade debunking the fake quote is fueled by articles published by several blogs and traditional news organizations. Other tools and techniques will be needed to analyze these other components in order to complete the map of the global cascade.