Privacy-Aware Data Cleaning

Given the proliferation of sensitive user information, data privacy concerns have largely remained unexplored in data cleaning techniques. We explore a new privacy-aware data cleaning framework that aims to resolve data inconsistencies while protecting the sensitive information.  We investigate an information exchange model that allows two parties A and B to work together to clean the data from A while disclosing minimal information from B.  We propose a set of new repair operations that increase data utility while preserving data privacy.  In a sister project, we consider an extended set of repair operations that provide more fine-grained choices on how to clean the data.  This allows the user with options of how to improve data utility (i.e., cleanliness) while carefully controlling the level of information disclosure from sensitive data values.


IBM Waston

Data Quality Metrics for Watson Analytics

Duplicated data leads to poor data quality as two or more references to the same entity may contain inconsistent information.  In collaboration with IBM, we are developing new data quality metrics for IBM’s cloud based data analytics platform, Watson Analytics.  We are developing new measures that provide finer measurement of duplicate values within an attribute, and to accurately identify duplicate records in a dataset.   The metrics may be customized according to desirable properties based on a user’s data analysis task. This project aims to provide organizations with more accurate measurements to clean their data, thereby saving money and time to enable faster decision making.

Query-Driven Temporal Data Cleaning

For some applications, it is more cost-effective to partially clean a dirty database (i.e. approximate data cleaning) due to: (1) substantial costs to clean the entire database, or (2) performance requirements warrant a fast response time.  In this project, we develop a new temporal based cleaning model that approximately cleans a data instance according to a set of temporal conditions while respecting a limited budget.

Sigma js

Discovery of Ontology Functional Dependencies

Integrity constraints such as functional dependencies capture attribute relationships based on syntactic equivalence.  In this project, we go beyond just equality relationships and study a new class of dependencies called Ontology Functional Dependencies (OFDs) that capture attribute relationships based on synonym and hierarchical (is-a) relationships defined in an ontology.  We study the theoretical foundations of OFDs, and a discovery algorithm that mines for OFDs in a data instance.