Show icon Show search tips...
Hide icon Hide search tips...

[CCICADA-announce] DIMACS/CCICADA Workshop on Data Quality Metrics --text follows this line--

Linda Casals lindac at
Mon Jan 10 11:46:14 EST 2011

DIMACS/CCICADA Workshop on Data Quality Metrics
  February 3 - 4, 2011
  DIMACS Center, CoRE Building, Rutgers University

    Tamraparni Dasu, AT & T Research, tamr at 
    Lukasz Golab, AT & T Research, lgolab at 

Presented under the auspices of The Homeland Security Center for
Command, Control, and Interoperability Center for Advanced Data
Analysis (CCICADA).


Large-scale databases and data warehouses often suffer from data
quality problems, caused by the data-collecting mechanism (e.g.,
inaccurate sensor readings), by incorporating inconsistent data
sources over time, or by a lack of understanding of the semantics of
the data. Before ``cleaning'' the data, it is important to understand
the extent of these problems. In the simplest case, we can construct a
set of data quality rules that determine whether a data record is
``clean'' or ``dirty''. However, these rules may be complex and
domain-specific. Furthermore, we may not be able to judge data quality
by examining individual records; each record may be correct on its
own, but inconsistent with other records. Thus, measuring the quality
of a data set or a database is a challenging task.

Data quality metrics have been the focus of research in the database
and statistics communities, resulting in complementary approaches. In
general, database metrics are motivated by constraint satisfaction
while statistical metrics quantify departure from distributional and
model assumptions. In this workshop, we explore both types of
approaches, with an emphasis on:
  - recent advances in research,
  - role of data quality metrics in data cleaning,
  - applications & case studies, and
  - tools.

Relevant topics include the following.

    * Factors that affect data quality:
      (1) Database perspective: completeness, consistency among data 
          records, timeliness, ease of use, redundancy.
      (2) Statistical perspective: Assumptions about censoring, 
          truncation, data gathering (e.g. sampling method), data
          distributions and models.
     * Techniques for measuring data quality:
      (1) Database perspective: business rules, logical constraints,
          lineage, impact on applications, business impact.
      (2) Statistics: residuals, distributional shifts, anomalies and outliers.
    * Case studies: financial, sensor and Internet data, time series 
      and data stream quality, data mining quality, data quality for 
      real-time alerting applications. 


 For details and to register see:


 Information on participation, registration, accommodations, and travel can be found at:

More information about the Dimacs-ccicada-announce mailing list