From data quality to language quality: the challenges of the fitness for use principle

10 juin 2022



In data science, as in journalism, it is commonly accepted that poor quality data lead to poor quality information. Moreover, the quality of the language of the automated contents is strongly related to the field of application: one cannot write in the same way on atmospheric pollutants as on stock markets. These two aspects should not be considered separately as they are strongly linked. Although automated news productions can be delivered as it is to audiences, they are also used to provide automatic drafts that journalists will enrich with background information or expertise. This second option is here explored through the lenses of the users’ needs, considering that if the drafts do not correspond to the requirements or expectations of the journalists, they will not use them. I am referring here to the ISO 9000 standard related to quality management, according to which quality consists of the ability of the product or service to fit the implicit or explicit users’ needs (Boydens and van Hooland, 2011).


In data science, the concept of data quality can be approached through several dimensions that characterise the formal aspects of the data. On this topic, the principles of consistency, uniqueness, timeliness, accuracy, validity, and completeness are the ones that are the most often quoted both by scholars and practitioners. However, it is not sufficient to think about data quality in terms of a relation between the characteristics of the triple entity, attribute, and value. It is no more relevant to approach data quality from a deterministic point of view according to which a value is true if it is stored in a database, although it is something commonly admitted by journalists who play with data (Anderson, 2018). The definition of data quality is much more complex to tackle, because empirical data, which come from observations in the real world, only reflect a specific state at a T moment. Data values are likely to evolve, as is the case for their application domains. The users’ needs might also evolve, making it complex to define data quality (see the research works of Isabelle Boydens; Cappiello et al., 2004).


In computational linguistics, the concept of quality is also a multidimensional one. It refers to the lexical, syntactic, and semantics qualities of a text related to a given application domain with its own concepts that encompass particular words and reference expressions. In natural language processing, several metrics are commonly used to assess the quality of a text. Still, all the metrics do not fit natural language generation. They were developed to evaluate machine translation or text summaries, such as BLEU, ROUGE, METEOR or NIST, whose computations rely mostly on n-grams. In NLG, the text variability may also be considered the system’s capacity to provide different texts based on the same data or topics. The research about the linguistic qualities of automated content focused less on the intrinsic characteristics of the texts than on the user’s perception to detect if readers find them pleasant to read, well written or objective. The question of knowing if human judges make a difference between a text written by a human and a text written by a machine was often placed at the centre of these studies, which also focused on the texts’ credibility (Clerwall, 2014; Kraemer & Van Der Kaa, 2014; Graefe et al., 2015; Haim & Graefe, 2017; Melin et al., 2018, Wölker and Powell, 2018). 


The very notion of quality is more difficult to define in journalism. However, it refers less to the data than to how they are used or perceived by the readers. According to Clerwall (2014), the quality of journalistic information is a multifaceted concept that can refer « to a general, but a somewhat vague notion of ‘this was a good story ». It can also depend on the socio-cultural context of the recipient. For instance, a literature teacher may appreciate the structure of a journalistic piece while neglecting the use of poor sources. For Sundar (1998), the quality of journalistic information can be understood as the degree or level of excellence in the information. In the context of news automation, Diakopoulos (2019: 24) approached quality through the dimensions of accuracy, comprehensibility, timeliness, reliability, and validity. These dimensions are strongly connected to the data quality dimensions, showing common concerns between data science and journalism.


Altogether, these three research fields fed several assessment methods developed in the context of research that explored the human-machine relationship when automated news production is tackled as a tool that aims to free journalists from repetitive and time-consuming tasks or provide them with an additional source of information. Therefore, the scope of this research focused on the development of new professional practices. This research consists of two case studies conducted in French-speaking Belgium, where automated contents were approached as preliminary drafts for journalists. The first project aimed to provide automatic monitoring of air quality in Brussels to feed into a broader investigation of the causes and consequences of air pollutants. The second sought to automate stock market data to support time-consuming daily routines, allowing journalists to provide more insights into the broader socio-political context on which stock markets depend.


Although these two case studies took place in different socio-professional contexts, they were considered through their complementarity. Indeed, the quality issues were not the same from one experience to another. They depended as much on the data source as on the ability of the system to meet the users’ requirements. The assessment methods developed were designed to address quality challenges to prevent or correct problems. The main differences between the two observed newsrooms were not related to the number of journalists actively involved in the experiments. They are to be found in the size of the newsrooms, from small to large, to the habits of the journalists to deal with data, to the size of the audiences, and of course, the editorial approach privileged in each newsroom. The first experience covered a period of one year, and it was not funded, while the second covered two years and was constantly funded. In both cases, the overall research approach was framed by an action-research strategy, where the researcher was actively involved, supposing a tension between commitment and distance of the research objects (Dierickx, 2020). 


In the first case study, the identification of the users’ needs included participating in newsroom meetings, interviews, and email exchanges. As the journalists had a vague idea of what they expected, the analysis of their needs considered implicit and explicit requirements. However, it was agreed that the information system should provide real-time information about the air quality situation in Brussels, with texts, charts and maps and a summary of the observations throughout the experience. This kind of exercise implies a projection of the journalists into their end-uses, which was particularly difficult because journalists were not used to working with data. They expected a reliable and accurate system to help them set angles in a broader investigation of air quality. In addition, data do not explain the causes and consequences of air pollution, letting the journalists consider a functional approach of the tool to discover in which area of Brussels the air quality was the most problematic if it concerned precarious population and to which extent European and WHO norms were respected. The one year for the observations was decided because Brussels must submit an annual report to the European authorities and was already given formal notice in previous years for having exceeded the acceptable pollution thresholds.


For all these reasons, the focus was placed on the quality during the retrieval process of the data. Considering that they are publicly available on web pages and that this open format presents potential data quality issues, the first step consisted of prior monitoring of these web pages to observe the evolution of the values upstream of the development of the system. As it is connected to an application domain that requires a high level of scientific expertise, interviews with the data producers were organised to understand the data lifecycle. The assessment method that I developed aimed to define potential errors to correct them during the data aggregation. It consists of a framework that considers both the technical and the journalistical challenges of automating (Dierickx, 2017). 

This conceptual framework relies on an extensive literature review of research works focusing on data quality. It is also grounded in journalism studies, as automation is considered within a journalistic context. This framework examines how the data meet quality indicators through six axes that respond to the data quality dimensions, which were the most emphasised. The limitations of this conceptual model are mostly related to the fact that formal anomalies in a dataset can be subject to interpretations. For example, the NULL value can be interpreted in diverse ways: the information exists but is not known, the information is not relevant for the entity, information is relevant but does not exist for the entity, and the attribute value is equal to zero (Hainaut, 2012). In addition, poor quality data can coexist with correct data without generating errors (Wang & Strong, 1996).


Meeting the technical and the journalistic challenges can be answered through a set of questions related to the source, the accessibility, the documentation, the characteristics that allow automation and the journalistic relevance for automation. This data quality assessment model was consistently applied during this experience to manage and prevent errors. It has conducted regular adjustments in the database to answer the journalistic requirements of accuracy, immediacy, and verifiability. These human tasks, which were invisible to the journalists, appeared as time-consuming through constant monitoring of the data. However, this « learning by doing » showed that automation is not only a matter of process. It supposes human control of the raw material scraped from a moving format, such as a web page. However, it was insufficient to prevent all the errors observed during the experience. Good knowledge of the application domain and the data life cycle also improved the processes at work.


In this experiment, data quality aspects appeared as the angular stone of the whole project. Practically, the system generated texts, charts, and maps on a web platform. However, it also published content on Twitter. It also triggered the sending of a newsletter when an air pollutant exceeded one of the thresholds set by the world health organisation, which are more severe than European standards. The data structured all these contents. If we look at the real-time news displayed on the website, it relied on pre-written strings, a list of synonyms and reference expressions that were defined according to a corpus of reports about air quality. Different rules were applied considering the values of the data, their possible absence, and their variation from one day to another. It allowed us to facilitate content control and theoretically prevent the potential errors that may occur. In theory, 1.024 different syntagms were possible, but the journalist and the readers did not see many differences due to the repetitive structure of the text. In particular, the journalists found that it was explicit and objective, but it was not well written and pleasant to read. We can here consider that it is also due to the nature of an air pollution report.


Regular monitoring of the data was also organised during the whole experience. It was partly automated to ensure the accuracy and precision of the system for several reasons. Indeed, the data were not always available as expected or if they were available, they could present abnormal values, such as negative ones, as we see in these samples. In addition, the data published on the data provider’s website were likely to be corrected or adjusted several days later, as the measurement stations of the air pollutants could transmit abnormal values to be refined by the experts. As the data were stored in a relational database, as part of the project, it required manual corrections in the database to ensure the accuracy and precision of the weekly, monthly, and annual reports relying on these data.


In addition, unpredictable phenomena occurred during the experience, such as a breakdown of the air pollutant measuring stations, making it impossible to transmit data since there were no more to transmit. There was also a web server failure where the data were retrieved, preventing the transmission of information in real-time. To summarise, no data equals no content. Moreover, no accurate and precise data equals no accurate and precise content. The human work behind the system and which was necessary to solve these problems always remained invisible to the journalists, who believed that automation is something that is acquired once and for all. They also believed that computers do not make mistakes.


Did the journalists use the news automation system? Partially, there are several reasons to explain their non-uses: a lack of interest in the air quality topic, a lack of interest in data-driven journalism, and a fear of a technology that would supplement human journalists. The contents were never used as they were delivered. They remained first drafts that helped define journalistic angles or write papers providing the results of the experience. These papers also emphasised that the collecting and processing the data were not made by humans but by a « robot ». Paradoxically, these public discourses were relatively positive, showing a newsroom open to innovation and curious about the development of new technological tools. In the sociology of use, it can be more considered symbolic adoption than appropriation.


In the second case study, the identification of the users’ needs included participating in newsroom meetings, interviews, email exchanges and calls. The stakeholders involved in the project were not all related to the world of journalism, as they included the IT service of the newspaper, in charge of the implementation of the generated contents in the content management system used by the journalists. They also included linguists and computational scientists from a company hired to develop the news automation system. The object of the project was to deliver first drafts in the context of the live coverage of stock markets. The journalist would decide to publish these contents as it or to enrich them with contextual information. Reporting on the stock markets is a particularly time-consuming activity requiring the journalist to juggle between computer screens and spreadsheets to provide information quickly. Therefore, the users’ needs were mainly explicit, defining the stock markets to include in the reports and the type of content they needed, either texts or tables and charts. The journalists also expected a reliable and accurate system to help them speed up the process to free time for contacting experts or providing more analysis.


Journalists had a clear idea of what they wanted to automate based on their experience covering real-time stock markets. They defined the markets to cover, the stock markets moments to cover (from the opening to the closing bell), and additional information related to over-performances, under-performances, and evolution of several indices, such as Bel20 and DAX. Their idea of what to automate was so clear that they provided the templates to the company hired to develop the system, considering that the mastering of the process should be taken upstream. They worked on about thirty templates, providing several models of text variations and a list of expressions in the context of an application domain with a particular jargon. According to Dale and Reiter (1997), a template-based approach is less about syntactic realisation and process determination: it finds its meaning when the capacity of the generated texts is limited, which was the case here. For the journalists, it implied deconstructing their way of doing and writing to standardise it. An eventual emotional aspect appeared when a journalist said that the news automation system would contain a part of him.


Here, the challenge was less related to the data quality aspects, as the data that fed the system came from a provider with which the media company had a paid contract. The only issue found was about the market included in the agreement with the data provider, which led to a re-examination of which stock markets were included or not. It must be said that such a contract cost several thousands of euros per year, and it was not possible to change its terms for budgetary reasons. The challenge was strongly connected to the company developing the system, which had to meet the journalistic requirements. That supposes to go beyond the technical knowledge of the language, so far as the application domain has a lot of specificities. However, the expertise of the journalists was not shared with the people involved in this company, which failed to meet the journalist’s expectations.


Several assessment methods were set to objectify the correlation between the users’ requirements and automated content. One hundred thirty-five combinations of source texts, written by the journalists, and target texts, written by the information system, were compared. The metrics used were the Flesch-Kincaid score, which aims to provide a level of readability included between 0 and 100. The Coleman Yau Index also provides a score related to the level of intelligibility of a text. The lower the score, the less readable the text. The Flesch-Kincaid Grade relates to the American education system. This score defines the number of years of education a person needs to understand the text. A score of 8 generally means that the text is readable for a general audience. A score of 10-12 corresponds to a higher level of education. 

The Automated Readability Index (ARI) is a formula to assess the necessary level of education to understand a text. As an indication, a level of 8 corresponds to the reading ability of a 14-year-old individual and a level of 12 to that of a 17-year-old individual. These longitudinal assessments showed few differences between the scores of sources and target texts. Let’s compare the general average of the results recorded by the texts written by journalists with those generated automatically. We generally see better performance for the latter, although the differences are not very marked regarding the readability (Score Flesch-Kincaid) and level of education (ARI). The target texts would be a little more intelligible (Coleman Yau Index), while the source texts would target a slightly more educated audience (Grade Flesch-Kincaid).


Two other metrics were also used to assess the correlation between the users’ requirements and automated content: the Levenshtein distance, or edit distance, which measures the similarity between two strings; and the similarity rate, which aims to establish the percentage of similar characters between a source text and a target text. Here also, there was not much difference between the texts. All these results show the limited relevance of using automated metrics to assess the ability of a news automation system to meet the users’ requirements. Indeed, despite relatively good scores, the system was not used by the journalists.

The online tool was used to compute these different metrics is still available: https://ohmybox.info/linguistics/ 


The limited value of automatic metrics appeared when it came to analysing the results of human evaluations. Seven journalists participated in this process, following two different versions of the generated content, in March and December 2019. All the journalists said that the information system does not write like them, highlighting too many errors and that the texts are too standardised. The examination of the seven quality indicators shows that the average values are worse in the second evaluation. According to journalists, automated productions make less sense, are less readable, less reliable, and less usable.


Three years after the first human evaluations, is the news automation system used in the newsroom? The project manager did not give up but had to face some delay which was less related to the quality of the news automation system than to other priorities within the newsroom. The Covid-19 pandemic also contributed to shaking up the agenda. In January last year, the project was still in a test phase and mainly generated tables on stock market performances. However, Google News is displaying now a bit more content that is more regularly published.


To conclude, the issues related to the data quality and the language quality of automated news content are essential to consider. Still, they are only a part of the problem when it comes to thinking about news automation systems as a tool to support journalistic practices. Meeting the users’ requirements is not only sufficient, as uses and non-uses participate in a complex dynamic where the social and the technical interplay, and where the representations of the system also have an important role to play. However, a research approach based on assessment methods allows detecting and preventing errors from a data science point of view and putting human judgments into perspective from a computational linguistics point of view. In addition, data quality remains an important issue to consider, so far as poor data quality will continue to lead to poor information quality. Expertise in the application domain also remains important, as it conditioned the quality of the language both on lexical and semantic sides. It also illustrates the need for interdisciplinary in this research field while data science, computational linguistics, and journalism studies intertwine.

As rule-based systems were the most often used for news automation, the development of machine learning applications and stochastic models, such as BERT and GPT-3, may not blur the fact that data quality remains challenging as these systems require large amounts of data and that the quality of the data retrieved from the web or knowledge-based relying on user-generated content, such as Wikipedia, should not be taken for granted. In addition, systems based on machine learning are also challenging so far as the data quality concerns the whole process, including the data collection, the training datasets, and the test datasets. In this perspective, data quality appears more challenging to tackle.

See also: The challenges of data quality in machine learning




# # #