Privacy-Aware Data Cleaning
Given the proliferation of sensitive user information, data privacy concerns have largely remained unexplored in data cleaning techniques. We explore a new privacy-aware data cleaning framework that aims to resolve data inconsistencies while protecting the sensitive information. We investigate an information exchange model that allows two parties A and B to work together to clean the data from A while disclosing minimal information from B. We propose a set of new repair operations that increase data utility while preserving data privacy. In a sister project, we consider an extended set of repair operations that provide more fine-grained choices on how to clean the data. This allows the user with options of how to improve data utility (i.e., cleanliness) while carefully controlling the level of information disclosure from sensitive data values.
Data Quality Metrics for Watson Analytics
Duplicated data leads to poor data quality as two or more references to the same entity may contain inconsistent information. In collaboration with IBM, we are developing new data quality metrics for IBM’s cloud based data analytics platform, Watson Analytics. We are developing new measures that provide finer measurement of duplicate values within an attribute, and to accurately identify duplicate records in a dataset. The metrics may be customized according to desirable properties based on a user’s data analysis task. This project aims to provide organizations with more accurate measurements to clean their data, thereby saving money and time to enable faster decision making.
CurrentClean: Spatio-temporal Cleaning of Stale Data
Data currency is imperative towards achieving up-to-date and accurate data analysis. Identifying and repairing stale data goes beyond simply having timestamps. Individual entities each have their own update patterns in both space and time. We develop CurrentClean, a probabilistic system for identifying and cleaning stale values. We introduce a spatio-temporal probabilistic model that captures the database update patterns to infer stale values, and propose a set of inference rules that model spatio-temporal update patterns commonly seen in real data. We recommend repairs to clean stale values by learning from past update values over cells.
Integrity constraints such as functional dependencies capture attribute relationships based on syntactic equivalence. In this project, we go beyond just equality relationships and study a new class of dependencies called Ontology Functional Dependencies (OFDs) that capture attribute relationships based on synonym and hierarchical (is-a) relationships defined in an ontology. We study the theoretical foundations of OFDs, and a discovery algorithm that mines for OFDs in a data instance.