Show icon Show search tips...
Hide icon Hide search tips...

[Sy-cg-global] [Publicity-list] DIMACS Workshop on Data Quality, Data Cleaning and Treatment of Noisy Data

Linda Casals lindac at
Thu Oct 2 10:10:22 EDT 2003

DIMACS Workshop on Data Quality, Data Cleaning and
Treatment of Noisy Data 

November 3 - 4, 2003 
DIMACS Center, CoRE Bldg, Rutgers University, Piscataway, NJ

Parni Dasu, AT&T Labs, tamr at 

WWW Information:

The word "data" has taken on a broad meaning in the last five
years. It is no longer a set of numbers or even text. New data
paradigms include data streams characterized by a high rate of
accumulation, web scraped documents and tables, web server logs,
images, audio and video, to name a few. Well-known challenges of
heterogeneity and scale continue to grow as data are integrated from
disparate sources and become more complex in size and content.

While new paradigms have enriched data, the quality of data has
declined considerably. In earlier times, data were collected as a part
of pre-designed experiments where data collection could be monitored
to enforce data quality standards. The data sets themselves were small
enough that even if data collection was unsupervised, the data could
be quickly scrubbed through highly manual methods. Today, neither
monitoring of data collection nor manual scrubbing of data is feasible
due to the sheer size and complexity of the data.

An additional challenge in addressing data quality is the domain
dependence of problems and solutions. Metadata and domain expertise
have to be discovered and incorporated into the solutions, entailing
an extensive interaction with widely scattered experts. This
particular aspect of data quality makes it difficult to find general
one-size-fits-all solutions. However, the process of discovering
metadata and domain expertise can be automated through the development
of appropriate tools and techniques such as data browsing and
exploration, knowledge representation and rule based programming.

Many disciplines have taken piecemeal approaches to data quality. The
areas of process management statistics, data mining database research
and metadata coding have all developed their own ad hoc approaches to
solve different pieces of the data quality puzzle. These include
statistical techniques for process monitoring, treatment of incomplete
data and outliers, techniques for monitoring and auditing data
delivery processes, database research for integration, discovery of
functional dependencies and join paths, and languages for data
exchange and metadata representation.

We need an integrated end-to-end approach within a common framework,
where the various disciplines can complement and leverage each other's
strengths. In this workshop, our broad objective is to bring together
experts from different research disciplines to initiate a
comprehensive technical discussion on data quality, data cleaning and
treatment of noisy data. Specifically,

* To provide an overview of the existing research in data quality

* To present data quality as a continuous, end-to-end concept

* To discuss and update the definition of data quality, to develop 
metrics for measuring data quality

* To emphasize data exploration, data browsing and data profiling for 
validating schema specific constraints and identifying aberrations

* To focus on disciplines such as knowledge representation and rule 
based programming for capturing and validating domain specific constraints

* To highlight applications, case studies

* To present research tools and techniques

* To identify research problems in data quality and data cleaning

Workshop Format

The format of the workshop will be a combination of invited talks,
contributed papers and posters. Invited and contributed talks will be 
published in the workshop proceedings.

Workshop Program:

 Monday, November 3, 2003

 9:00 -  9:50  Breakfast and Registration

 9:50 - 10:00  Opening Remarks
               Tamraparni Dasu, AT&T Labs - Research

10:00 - 10:50  Managing Inconsistency in Data Exchange and
               Rene Miller, University of Toronto

10:50 - 11:40  Data Quality and Data Mining in Finance*
               Grace Zhang, Morgan Stanley

11:40 - 12:30  Bellman - A Data Quality Browser
	       Ted Johnson, AT&T Labs

12:30 -  2:00  Lunch

 2:00 -  3:00  The Data Cleaning Problem --
	       Some Key Issues and Practical Approaches
               Ron Pearson, Daniel Baugh Institute for Functional Genomics and
	       Computational Biology, Thomas Jefferson University

 3:00 -  3:50  Pre-processing of Microarray Data
               Dhammikai Amaratunga, Javier Cabrera, Nandini Raghavan,
	       Johnson & Johnson, Rutgers, Johnson & Johnson

 3:50 -  4:00  Break

 4:00 -  4:50  Data Quality Challenges in the Analysis of Streaming Data
               S. Muthukrishnan, Rutgers Univeristy
 5:00	       Wine & Cheese

 Tuesday, November 4, 2003

 9:30 -  9:50  Breakfast and Registration

 9:50 - 10:00  Opening Remarks

10:00 - 11:00  Data Mining: A Powerful Tool for Data Cleaning
	       Jiawei Han, University of Illinois at Urbana-Champaign

11:00 - 12:00  A $220 Million Success Story*
               Jon Hill, British Telecommunications

12:00 - 1:00   Knowledge Engineering, Rule Based Syatems and
	       Data Quality
               Gregg Vesonder and Jon Wright, AT&T Labs - Research

 1:00 -  2:30  Lunch

 2:30 -  3:20  Managing Data Streams*
	       Andrew Hume, AT&T Labs

 3:20 -  4:10  Web page cleaning for web data mining
	       Bing Liu, Univeristy of Illinois at Chicago

 4:10 -  4:20  Break

 4:20 -  5:10  Relational Nonlinear FIR Filters
	       R.K. Pearson and M. Gabbouj
	       Daniel Baugh Institute for Functional Genomics and
               Computational Biology, Thomas Jefferson University
	       and Tampere University of Technology

* Tentative titles


Registration Fees: 

(Pre-registration deadline: October 27, 2003) 

Regular rate
Preregister before deadline $120/day
After preregistration deadline $140/day

Reduced Rate*
Preregister before deadline $60/day
After preregistration deadline $70/day

Preregister before deadline $10/day
After preregistration deadline $15/day

DIMACS Postdocs $0

Non-Local Graduate & Undergraduate students
Preregister before deadline $5/day
After preregistration deadline $10/day

Local Graduate & Undergraduate students $0
(Rutgers & Princeton)

DIMACS partner institution employees** $0

DIMACS long-term visitors*** $0

Registration fee to be collected on site, cash, check, VISA/Mastercard

Our funding agencies require that we charge a registration fee for the
workshop. Registration fees cover participation in the workshop, all
workshop materials, breakfast, lunch, breaks, and any scheduled social
events (if applicable).

* College/University faculty and employees of non-profit organizations
will automatically receive the reduced rate. Other participants may
apply for a reduction of fees. They should email their request for the
reduced fee to the Workshop Coordinator at
workshop at  Include your name, the Institution you
work for, your job title and a brief explanation of your situation.
All requests for reduced rates must be received before the
preregistration deadline. You will promptly be notified as to the
decision about it.

** Fees for employees of DIMACS partner institutions are waived.
DIMACS partner institutions are: Rutgers University, Princeton
University, AT&T Labs - Research, Bell Labs, NEC Laboratories America
and Telcordia Technologies. Fees for employees of DIMACS affiliate
members Avaya Labs, IBM Research and Microsoft Research are also

***DIMACS long-term visitors who are in residence at DIMACS for two or
more weeks inclusive of dates of workshop.


Information on participation, registration, accommodations, and travel
can be found at:


More information about the Dimacs-sy-cg-global mailing list