Blog 11: Statistical description of the corpus #RStats

16 janvier 2022


Research Notebook

'Mixology' is an open research project, which aims to extract opinions in times of crisis, here from a corpus collected via the Twitter API, from December 12 to 31, 2021.

The first part of the corpus of tweets in English has 311,882 entries after cleaning. In a Word document, that’s over 11,000 pages.

A statistical approach to a corpus presupposes to describe it well. Although different functions are available in other R packages, the aim was to combine all of these statistical functions in a single function and add text readability metrics. This function consists of an extension of the tool developed as part of my dissertation in Information and Communication Sciences and Technologies, dedicated to the possibilities and limits of automatic text generation in a journalistic context. This tool was also used as part of my thesis, which analysed the conditions of human-machine association in the development of automation within newsrooms.

Quality indicators in computational linguistics: https://ohmybox.info/linguistics/ (this tool can be used for small corpora)

Four readability metrics are used in the mixology_statistics function (the idea is to develop, later, a package with the main original functions designed within the framework of this research).

The Flesch-Kincaid Grade Level (formula derived from the Flesch score used in literature) calculates the degree of readability of a text. This score usually ranges from 0 to 100.

The Coleman-Liau Index is a readability test designed by Meri Coleman and TL Liau to assess the intelligibility of a text. The lower is the score; the less readable is the text.

In the American education system, the Flesch-Kincaid Reading Ease is equivalent to the number of years of understanding education a person must have for text. A score of 8 generally means that the text is readable for the general public. A score of 10-12 corresponds to a higher level of education.

The Automated Readability Index (ARI) is another formula designed to assess the level of education needed to understand a text. A score of 8 corresponds to the reading ability of a 14-year-old teenager and a level of 12 to that of a 17-year-old teenager.

Note that the R function mixology_statistics has retained the number of lines as equivalent to the number of sentences, given the short tweet format. This bias was confirmed using the Quanteda R package’s nsentence function, which doesn’t seem to do anything other than count the number of rows (probably because this corpus only contains short tweets).

The statistical analysis of this first part of the corpus in English emphasizes the short format of the tweets (in practice, they are limited to 240 characters). This format does not seem to facilitate readability: the Flesh-Kincaid Reading Ease is relatively low, which means that the texts are poorly readable, but this is not reasonably correlated by the Coleman-Yau Index, which does not give rise to such a poor result (although it is not particularly high); the Flesh-Kincaid score corresponds to a higher education level (which offers a clue about the profile of Twitter users); the ARI confirms that the authors of the tweets are generally adults.

See this page for the formulas used to perform these operations: https://ohmybox.info/linguistics/formulae.php



Flesch, R. (1948). A new readability yardstick. Journal of applied psychology, 32(3), 221.

Kincaid, J. P., Fishburne Jr, R. P., Rogers, R. L., & Chissom, B. S. (1975). Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Naval Technical Training Command Millington TN Research Branch.

Coleman, M., & Liau, T. L. (1975). A computer readability formula designed for machine scoring. Journal of Applied Psychology, 60(2), 283.

Kincaid, J. P., & Delionbach, L. J. (1973). Validation of the Automated Readability Index: A follow-up. Human Factors, 15(1), 17-20.

Senter, R. J., & Smith, E. A. (1967). Automated readability index. Cincinnati University, Ohio.


# # #

Read more

Blog 21: Politicians, experts, and journalists

Blog 20: For vaccination, against restrictions

Blog 19: Comparative Sentiment Analysis

Blog18: A health and political crisis

Blog 17: Anatomy of the “political/sanitary measures” sub-corpus (en)

Blog 16: Sentiment analysis of the ‘vaccination’ sub-corpus (en, part.2)

Blog 15: Comparative sentiment analysis of the ‘vaccination’ sub-corpus (en, part.1)

Blog 14: An adapted dictionary for the Covid crisis and sentiment analysis

Blog 13: Building a stop words list

Blog 12: Main Dictionaries for Sentiment Analysis

Blog 11: Statistical description of the corpus #RStats

Blog 10: Sentiment analysis or the assessment of subjectivity

Blog 9: Topic modeling of the ‘vaccination’ corpus (English)

Blog 8: Linguistic and quantitative processing of the ‘vaccination’ corpus (English, part.2)

Blog 7: Linguistic and quantitative processing of the ‘vaccination’ corpus (English, part.1)

Blog 6: Collecting the corpus and preparing the lexical analysis

Blog 5: The textclean package

Blog 4: Refining the queries

Blog 3: The rtweet package

Blog 2: Collecting the corpus

Blog 1: An open research project

The challenges of research on media use in times of crisis