In machine learning, data quality directly impacts system performance and robustness. While it’s understood that without quality data, no machine learning model can produce reliable results, large language models (LLMs) introduce additional levels of complexity that make controlling data quality virtually impossible.
The issue of data quality in machine learning is all the more complex because it concerns the data that feed the model, the model’s representation, the evaluation metrics, the model’s accuracy, and the methods for finding the best model. Quality, therefore, is not confined to a single level but permeates the entire lifecycle of the system. Data quality assessment and validation thus occur both before and after these processes.
Machine learning relies on models trained on empirical data, allowing the production of results based on complex mathematical relationships that are difficult to derive through deductive reasoning or simple statistical analysis. The input data provides a factual basis for the system’s « reasoning, » in a process that involves translating it into a broader representation—or abstraction. This abstraction leads to what is called model generalization, that is, the ability to use this representation to produce actionable results in new situations.
The three fundamental components of machine learning algorithms are the model representation (e.g., features, parameters, or representation spaces), the measures for evaluating the model’s accuracy, and the methods for finding the best model in the model space, i.e., optimisation techniques.
Because these three components are closely linked, assessing data quality for machine learning applications is a complex task. They refer to three datasets: training, validation, and test. Validation data is used to fine-tune model parameters, while test data is used to evaluate its performance on new data.
Acquisition and validation
The data acquisition stage is crucial, as quality issues can arise at this point, depending on the nature and reliability of the sources. These problems are often more acute when working with open data, user-generated data, or data from multiple, heterogeneous, and poorly documented sources.
The data cleaning stage is fundamental. It involves not only normalizing and standardizing the data but also addressing classic data quality issues, such as missing data, duplicates, highly correlated variables, excessive numbers of variables, and outliers. While these problems are traditionally assessed before data use, in the context of machine learning, data quality is also evaluated afterward, based on the model’s performance and behaviour.
Data enrichment refers to data annotation, which serves as the basis for supervised or semi-supervised learning. This is a crucial preparatory phase, as the data is structured and qualified according to the task assigned to the model. This enrichment can be performed manually or automatically, with direct implications for reliability and consistency.
The concept of training refers to the process by which a model is adjusted to the input data it receives. The learned model is thus intrinsically dependent on the data on which it was trained. This involves profiling and evaluating the data to verify its suitability for machine learning tasks.
The validation of training data includes checking its integrity and distribution. A dataset can produce excellent results in the laboratory but fail when confronted with observable real-world data.
While data validation is essential for the reliability and quality of machine learning-based software systems, exhaustive validation of all the data feeding these systems is virtually impossible. Another source of complexity lies in the multitude of actors involved: several individuals or teams may be successively engaged in the collection, development, annotation, and maintenance of the various datasets, making it difficult to assign responsibility for data quality.
Machine learning, therefore, requires data with appropriate qualitative characteristics at each stage. This implies detecting errors as early as possible in the process and implementing validation procedures that identify unexpected values, inconsistencies, or deviations in the data.
The issue of bias and scarcity
Training datasets frequently incorporate past human decisions. Technically sound data can nevertheless carry social or organizational biases. If the training data is biased, the model’s results will also be biased. Otherwise, the model’s accuracy decreases. A dataset can produce excellent results in the laboratory but fail when confronted with observable real-world data.
While machine learning systems are designed to minimize error rates and maximize accuracy, this ambition encounters a fundamental challenge when discrimination is present in the original data. Training datasets frequently incorporate prior human decisions.
Thus, technically sound and well-trained data can still carry social or organisational biases. If the training data is biased, the model’s results will also be biased, reproducing and sometimes amplifying these biases.
Having high-quality data does not necessarily guarantee high-quality results. One of the paradoxes of machine learning lies in the need for data that is both relevant and available in sufficient quantity. This phenomenon, known as « data sparsity, » refers to a situation in which only a small fraction of the data contains genuinely relevant information for the task at hand. This sparsity can originate from missing values, but also from the data generation process itself, particularly when users generate the data.
LLMs: a challenge for data quality
Large language models (LLMs) introduce additional complexity regarding data quality. Data acquisition occurs on a large scale, from widely varied and heterogeneous sources, often without thorough validation or contextualization of initial use. This data is then aggregated and transformed during training, progressively losing its traceability and original semantic grounding.
Finally, they are stored and represented in abstract mathematical spaces, within which it is no longer possible to directly influence the quality of individual data points, but only the overall behaviour of the model. In this context, data quality emerges statistically from the model, and validation, correction, accountability, and governance rely on indirect, ex post mechanisms that are inherently difficult to control.
Small Language Models (SLMs) and their compactness offer an attractive compromise in terms of data quality: their limited functional scope and specialization in specific tasks or domains enable better control over datasets, both for selection and validation. This reduction in scale facilitates source traceability, bias identification, and the evaluation of model behaviour. However, it does not eliminate the structural challenges related to data quality, representativeness, or time-related changes.