Privacy-Aware Data Cleaning
Given the proliferation of sensitive user information, data privacy concerns have largely remained unexplored in data cleaning techniques. We explore a new privacy-aware data cleaning framework that aims to resolve data inconsistencies while protecting the sensitive information. We investigate an information exchange model that allows two parties A and B to work together to clean the data from A while disclosing minimal information from B. We propose a set of new repair operations that increase data utility while preserving data privacy. In a sister project, we consider an extended set of repair operations that provide more fine-grained choices on how to clean the data. This allows the user with options of how to improve data utility (i.e., cleanliness) while carefully controlling the level of information disclosure from sensitive data values.
Data Integrity over Graphs
Data dependencies play a fundamental role in preserving and enforcing data integrity. In relational data, integrity constraints such as functional dependencies capture attribute relationships based on syntactic equivalence. In this project, we explore new classes of data dependencies over graphs, possessing topological and syntactic constraints. We go beyond just equality relationships and study a new class of dependencies called Ontology Functional Dependencies (OFDs) that capture attribute relationships based on synonym and hierarchical (is-a) relationships defined in an ontology. We also study new dependencies over temporal graphs to capture topological and attribute constraints that persist over time.
Spatio-temporal Cleaning of Stale Data
Data currency is imperative towards achieving up-to-date and accurate data analysis. Identifying and repairing stale data goes beyond simply having timestamps. Individual entities each have their own update patterns in both space and time. We develop CurrentClean, a probabilistic system for identifying and cleaning stale values. We introduce a spatio-temporal probabilistic model that captures the database update patterns to infer stale values, and propose a set of inference rules that model spatio-temporal update patterns commonly seen in real data. We recommend repairs to clean stale values by learning from past update values over cells.
A Data System for Blood Monitoring
According to the 2020 Auditor General Report on Blood Management and Safety, hospitals are using a variety of information systems to monitor blood inventory, usage, and patient clinical data. This heterogeneity has led to disparate, and disjoint systems hindering data sharing among hospitals, Canadian Blood Services (CBS), and government. Furthermore, these localized views and limited data exchange pose challenges to meet current and predicted demand across hospitals, and to ensure that usage of blood components and products adhere to provincial guidelines. Greater transparency is needed to understand the safety issues and determinate factors around blood usage. In this project, we develop a data system to understand how blood components and products are used to treat specific conditions, follow-on prognosis, and clinical outcomes with respect to patient demographics.