DIMACS/CCICADA Workshop on Data Quality Metrics February 3 - 4, 2011 DIMACS Center, CoRE Building, Rutgers University Organizers: Tamraparni Dasu, AT & T Research, tamr at research.att.com Lukasz Golab, AT & T Research, lgolab at research.att.com Presented under the auspices of The Homeland Security Center for Command, Control, and Interoperability Center for Advanced Data Analysis (CCICADA). ********************************************************************* Announcement: Large-scale databases and data warehouses often suffer from data quality problems, caused by the data-collecting mechanism (e.g., inaccurate sensor readings), by incorporating inconsistent data sources over time, or by a lack of understanding of the semantics of the data. Before ``cleaning'' the data, it is important to understand the extent of these problems. In the simplest case, we can construct a set of data quality rules that determine whether a data record is ``clean'' or ``dirty''. However, these rules may be complex and domain-specific. Furthermore, we may not be able to judge data quality by examining individual records; each record may be correct on its own, but inconsistent with other records. Thus, measuring the quality of a data set or a database is a challenging task. Data quality metrics have been the focus of research in the database and statistics communities, resulting in complementary approaches. In general, database metrics are motivated by constraint satisfaction while statistical metrics quantify departure from distributional and model assumptions. In this workshop, we explore both types of approaches, with an emphasis on: - recent advances in research, - role of data quality metrics in data cleaning, - applications & case studies, and - tools. Relevant topics include the following. * Factors that affect data quality: (1) Database perspective: completeness, consistency among data records, timeliness, ease of use, redundancy. (2) Statistical perspective: Assumptions about censoring, truncation, data gathering (e.g. sampling method), data distributions and models. * Techniques for measuring data quality: (1) Database perspective: business rules, logical constraints, lineage, impact on applications, business impact. (2) Statistics: residuals, distributional shifts, anomalies and outliers. * Case studies: financial, sensor and Internet data, time series and data stream quality, data mining quality, data quality for real-time alerting applications. ****************************************************************** Registration: For details and to register see: http://dimacs.rutgers.edu/Workshops/Metrics/ ********************************************************* Information on participation, registration, accommodations, and travel can be found at: http://dimacs.rutgers.edu/Workshops/Metrics/ **PLEASE BE SURE TO PRE-REGISTER EARLY** *********************************************************