Optimizing Data Quality and Interpretability with Data Valuation

In the fast-evolving area of AI, the AI-DAPT framework stands out for emphasizing data quality and valuation as
driving factors for more accurate and robust Machine Learning (ML) models. Data valuation is a critical process that assesses the quality and the contribution of data in relation to ML models performance or an entire AI system. The data valuation encloses several aspects [6], including:

Feature Importance: Evaluate which features have the most impact on model results.
Identification of relevant Data Points: Assesses how each data point contributes to the overall performance of the model. When the application at hand involves images or videos, data valuation identifies the importance and utility of each in contributing to the overall prediction.
Bias Detection and Fairness: Identify any bias in the data that may influence the model’s predictions in an unfair way.
Data Quality: Assessment of how accurate, complete, and reliable the data is.
Economic and Business Value: Evaluate the economic impact of the data (cost of collecting the data, cost of determining what data point is the most valuable, etc).
Ethical and Legal Compliance: Ensuring that the data complies with ethical standards and legal regulations.
Data Traceability: Identify where the data comes from and how it was collected, and whether it has been modified.

In the context of the AI-DAPT framework, data valuation is considered multi-dimensional and supportive of the whole lifecycle of ML models. It features methods for assessing data quality, improving feature selection, detecting biases and optimizing model interpretability. This blog post discusses state-of-the-art data valuation methods and outlines how the AI-DAPT project will apply these methodologies to assess and enhance data quality.

DATA VALUATION METHODS AND PURPOSES

Data valuation is crucial for model performance: the higher the data quality, the more accurate and precise the model’s predictions will be [1]. A key aspect of assessing data quality is ensuring that models are not trained on irrelevant features. One common approach is systematically training models while excluding one feature at a time [2]. This can allow the identification of features that have the most significant impact, as well as those that might cause overfitting. Such methods are computationally expensive due to the need for retraining multiple models. One of the most widely used techniques for evaluating feature importance and data relevance is Shapley values, a game-theoretic approach that fairly attributes contributions to individual features [3]. This method addresses the “black-box” challenge accompanied with deep learning models, enhancing data and model explainability while also supporting tasks such as feature importance analysis and bias detection, which are crucial in the context of AI-DAPT.

DATA VALUATION IN AI-DAPT

The AI-DAPT framework is based on the principle of understanding that data quality plays a critical role in the performance of ML models. In domains where data quality and interpretability are particularly crucial, robust data valuation methods are essential. Additionally, AI-DAPT emphasizes computational efficiency and scalability, recognizing that effective data valuation must account for the resources required to process, analyze, and interpret large datasets in real-time. By combining computational techniques, statistical methods, and domain-specific considerations, AI-DAPT employs a multifaceted approach that ensures both the integrity and effectiveness of its models, while optimizing for performance and scalability in complex, data-driven environments.

The AI-DAPT project will collect datasets from different domains (healthcare, robotics, energy, and manufacturing). The value of each data point, as well as the dataset as a whole, will be important to be measured. To quantify the contribution of each data point toward specific tasks, we will utilize Shapley techniques. Moreover, we will assess feature relevance and, where comparable public datasets exist, analyze and compare the overall impact and value that each dataset offers.
AI-DAPT’s data valuation will assess the quality and fairness of datasets, aiming to identify potential biases that may affect outcomes. The motivation behind this is that if data input is biased, the output is likely to be biased as well [4]. Several methods will be utilized for this. Initially, exploratory data analysis will help detect anomalies and missing values. Moreover, class imbalances will be assessed, and reweighting or resampling techniques will be used when necessary. Finally, we will utilize state-of-the-art open-source tools like IBM AI Fairness 360 [5], which offers metrics and algorithms that detect and mitigate biases in data, helping ensure fairness across different attributes.

Our findings will be instrumental in optimizing data collection strategies in the field of our demonstrators. Therefore, the datasets that will be collected and utilized in the context of AI-DAPT in the domains of healthcare, robotics, energy, and manufacturing, will contribute to unbiased, high-quality findings and support future research in these emerging domains.

CONCLUSION

This aspect of data valuation that AI-DAPT focuses on shows one very simple yet important reality of AI: whatever comes out is only as good as what goes in. By improving the methodology behind data valuation, AI-DAPT is setting the stage for any future AI applications to ensure not just that they will be powerful and efficient but also that they are reliable and fair.

REFERENCES

[1] K. Jiang, W. Liang, J. Y. Zou and Y. Kwon, “Opendataval: a unified benchmark for data valuation,” Advances in Neural Information Processing systems, vol. 36, 2023.
[2] Ghorbani, Amirata, and James Zou. “Data shapley: Equitable valuation of data for machine learning.”International conference on machine learning. PMLR, 2019.
[3] Jia, Ruoxi, et al. “Towards efficient data valuation based on the shapley value.” The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 2019.
[4] M. Huang and R. Rust, “A strategic framework for artificial intelligence in marketing,” Journal of the Academy of Marketing Science, vol. 49, pp. 30-50, 2021.
[5] “AI Fairness 360 – IBM,” [Online]. Available: https://aif360.res.ibm.com/
[6] Miller, Russell, et al. “A Framework for Current and New Data Quality Dimensions: An Overview.” Data 9.12 (2024): 151.

Optimizing Data Quality and Interpretability with Data Valuation

DATA VALUATION METHODS AND PURPOSES

DATA VALUATION IN AI-DAPT

CONCLUSION

REFERENCES

more insights