Data quality plays a central role in developing machine learning technologies, which rely on families of algorithms designed to formalize and optimize a process fed by large volumes of data. Nevertheless, many quality problems are familiar to those observed in relational databases, where it is commonly accepted that poor-quality data cannot result in quality information.
This paper was written as part of the Meeting of the FNRS contact group « Critical analysis and improvement of the quality of digital information » on 05/18/2022. See also here, « Data quality challenges in the scope of a fitness for use » (SMALS).
Data quality is often defined as “fitness for use”, which consists of the ability of a data collection to meet user requirements. It is also defined through the lenses of several dimensions according to the principles of consistency, uniqueness, timeliness, accuracy, validity, and completeness, which are the ones the most often quoted both by the academic literature and practitioners. However, defining data quality is much more complex because empirical data, which consists of observations made in the real world, only reflect a certain state at a certain moment. Data values are likely to evolve over time, as is also the case for the application domains to which they refer. Also, the users’ needs might also evolve over time, making it more complex to tackle defining what data quality is.
Machine learning refers to computational models trained on empirical data to mimic human intelligence by transforming inputs into results based on mathematical relationships that are difficult to derive by deductive reasoning or simple statistical analysis (Kläs and Vollmer, 2018). Its purpose is to make sense of complex data. The input data provides a factual basis for « reasoning », a process that involves translating the data into a more significant representation – or abstraction – which will result in what is called a generalization of the model, i.e. the use of this abstraction in a form that can be used for action (Lantz, 2014).
In this area, data quality problems are all the more complex as they concern both the data that feed the model, the representation of the model, the evaluation measures and the precision of the model and the methods of finding the best model. Therefore, the evaluation of the quality of the data and their validation take place upstream and downstream of these processes, which cannot be envisaged other than in the context of their field of application. This principle of fitness for use is laid down by the ISO 9000 standard, relating to quality management (Boydens and van Hooland, 2014).
The three components of machine learning algorithms are model representation, model accuracy evaluation metrics, and methods of finding the best model in the model space (i.e., optimizing the model). As these three components are intertwined, assessing data quality for machine learning applications is complex. They refer to training data, validation data, and test data. Validation data is used to tune model parameters and test data to assess model performance (Gudivada et al., 2017).
The choice of a machine learning algorithm depends on the task you want to perform while considering the strengths and weaknesses of the chosen algorithm, including data quality. Let’s take the KNN, for example. It is a relatively fast and efficient algorithm. Still, one of its weaknesses is that it does not handle missing data well. This is also one of the weak points of multiple linear regression.
Naive Bayes is also simple, fast and efficient. Still, it is not ideal for datasets with many numerical features. A significant weakness of decision trees is that a minor change in the training data can lead to substantial changes in the logic of the outcome (Lantz, 2014).
A small number of representative data/observations will suffice for building and testing the model in a linear model. Even if many observations are used to build a linear model, this may not improve its performance (Gudivada et al., 2017).
Data lifecycle in machine learning
Mapping out the data lifecycle in a machine learning process helps to better understand the quality needs throughout this process. The data collection stage is critical since quality problems can already arise at this level, depending on the data source. These will potentially be particularly acute when working with open data, user-generated data, or data that comes from multiple sources (Hair and Sarstedt, 2021).
The cleaning step is also fundamental since it is a question of normalizing and standardizing the data and dealing with classic problems such as missing data, duplicates, strongly correlated variables, a large number of variables, or outliers. These quality issues are traditionally assessed before using data. Still, in machine learning, data quality is assessed upstream and downstream of model building.
Data enrichment work refers to the annotation of data, which will serve as the basis for supervised or semi-supervised learning. This is a lot of preparation work, as the data will be prepared according to the task assigned to the model. This enrichment can be carried out manually or automatically. In both cases, issues are likely to arise concerning the reliability and accuracy of this data. Their quality is essential because it directly impacts the whole process (Ridzuan et al., 2022).
Training data labelling
Large labelled datasets have been critical to the success of supervised machine learning in image classification, sentiment analysis, and audio classification. Yet the processes used to construct datasets often involve some degree of automatic labelling or crowdsourcing, inherently error-prone techniques, even when control procedures are in place for their correction (Northcutt et al., 2021).
In the case of crowdsourcing, i.e. annotated data from users, the quality of labelling comes up against a lack of expertise in the field of application, interest, concentration or other human factors relating to the subjectivity and socio-cultural referents of the annotators (Foidl and Felderer, 2019). This non-expert tagging, facilitated by online outsourcing systems such as Rent-A-Coder and Amazon’s Mechanical Turk, which assign workers to arbitrary (well-defined) tasks, can result in high corrections. Also, while there are tools and techniques to assess data quality for general cleaning and profiling checks, these are not applicable for detecting issues such as noisy labels or overlapping classes (Gupta et al., 2021).
The concept of training refers to adapting the model to the data at the input of the system. The model learned by the machine is therefore intrinsically linked to the data it contains. This requires profiling and evaluating the data to understand its suitability for machine learning tasks: failure to do so can lead to inaccurate analyzes and unreliable decisions (Gupta et al., 2021).
Machine learning assumes that the training data provided to the model is similar in distribution; otherwise, the model’s accuracy will decrease. This also means that it is essential to detect errors as early as possible in the process and to have validation procedures in place that can detect unexpected values or inconsistencies in values – for example, if the code of a country is « US » in capital letters for the United States then it is written in lowercase « us », this will be considered as the code for a new country (Polyzotis et al., 2018).
Data validation and model evaluation
Validating training data ensures that it does not contain errors that can propagate into the model. Added to this is the control of their integrity to ensure that they have the expected « shape » before launching the model. This can be related to a characteristic, such as a country code or a sufficient number of values (Polyzotis et al., 2018).
Although it has been pointed out that data validation is an essential requirement to ensure the reliability and quality of machine learning-based software systems, complete validation of all data fed by these systems is practically impossible. Moreover, there is still little scientific discussion of the methods that help software engineers of such systems to determine the level of validation of each feature (Foidl and Felderer, 2019).
The evaluation of the model’s performance consists of measuring its accuracy. We talk about bias when the model generates errors and differences between the expected value and the predicted value. We talk about variance when the model is unstable due to small fluctuations in the training data set. Since it is impossible to simultaneously minimize these two sources of error, we speak of a bias-variance trade-off, which applies to all types of supervised learning.
Other indicators are used to assess the model’s performance, such as entropy, recall rate, precision rate or cross-validation. Evaluating the effectiveness of a model is, therefore, also an essential step to allow the model to be improved for, of course, to improve the quality of the results. In some cases, additional operations will have to be performed, such as dimensionality reduction, to reduce the number of predictor variables in the training data.
Many challenges to overcome
If it can be assumed that machine learning systems are designed to minimize error rates and maximize their accuracy, how can this be achieved when discrimination is present from the start? Also, we regularly find data from previous human decisions in learning data.
Moreover, it is not because we have good quality data that the results will necessarily be of good quality: this is the paradox of machine learning technologies, where the data must also be available in sufficient quantity. This is the phenomenon of « data sparsity » or data scarcity, which refers to a small fraction of data containing relevant data.
Scarcity can come from missing values but can also appear during the data generation process, as is often the case when users generate data. Also, the scarcity of data is problematic for transactional data, which is traditionally the type of data used in marketing to define consumer behaviour. Still, it is also tricky when it comes to images, audio and video because it complicates the identification of the characteristics of this data. This, therefore, potentially impacts the predictive power of machine learning algorithms that use this data (Hair and Sarstedt, 2021). However, there are techniques to remedy these problems of scarcity or dispersion of data, in particular by relying on machine learning.
Another challenge pinpointed in quality management research – which is regularly overlooked in the corporate world – is that many people play different roles in the collection, development, and maintenance of these different datasets (Kim et al., 2017).
Machine learning, therefore, presents a different set of data quality problems, which consist of so many challenges to be overcome throughout the process.
See the page on the MASTIC – ULB website
References
- Boydens, I., & Van Hooland, S. (2011). Hermeneutics applied to the quality of empirical databases. Journal of documentation.
- Elouataoui, W., Alaoui, I. E., & Gahi, Y. (2022). Data Quality in the Era of Big Data: A Global Review. Big Data Intelligence for Smart Applications, 1-25.
- Foidl, H., & Felderer, M. (2019, August). Risk-based data validation in machine learning-based software systems. In proceedings of the 3rd ACM SIGSOFT international workshop on machine learning techniques for software quality evaluation (pp. 13-18).
- Gudivada, V., Apon, A., & Ding, J. (2017). Data quality considerations for big data and machine learning: Going beyond data cleaning and transformations. International Journal on Advances in Software, 10(1), 1-20.
- Gupta, N., Patel, H., Afzal, S., Panwar, N., Mittal, R. S., Guttula, S., … & Saha, D. (2021). Data Quality Toolkit: Automatic assessment of data quality and remediation for machine learning datasets. arXiv preprint arXiv:2108.05935.
- Gupta, N., Mujumdar, S., Patel, H., Masuda, S., Panwar, N., Bandyopadhyay, S., … & Munigala, V. (2021, August). Data quality for machine learning tasks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (pp. 4040-4041).
- Hair Jr, J. F., & Sarstedt, M. (2021). Data, measurement, and causal inferences in machine learning: opportunities and challenges for marketing. Journal of Marketing Theory and Practice, 29(1), 65-77.
- Kim, M., Zimmermann, T., DeLine, R., & Begel, A. (2017). Data scientists in software teams: State of the art and challenges. IEEE Transactions on Software Engineering, 44(11), 1024-1038.
- Lantz, B. (2019). Machine learning with R: expert techniques for predictive modeling. Packt publishing ltd.
- Lease, M. (2011, August). On quality control and machine learning in crowdsourcing. In Workshops at the Twenty-Fifth AAAI Conference on Artificial Intelligence.
- Northcutt, C. G., Athalye, A., & Mueller, J. (2021). Pervasive label errors in test sets destabilize machine learning benchmarks. arXiv preprint arXiv:2103.14749.
- Polyzotis, N., Roy, S., Whang, S. E., & Zinkevich, M. (2018). Data lifecycle challenges in production machine learning: a survey. ACM SIGMOD Record, 47(2), 17-28.
- Ridzuan, F., Wan Zainon, W. M. N., & Zairul, M. (2022). A Thematic Review on Data Quality Challenges and Dimension in the Era of Big Data. In Proceedings of the 12th National Technical Seminar on Unmanned System Technology 2020 (pp. 725-737). Springer, Singapore.