DIMACS Workshop on Data Quality, Data Cleaning and Treatment of Noisy Data November 3 - 4, 2003 DIMACS Center, CoRE Bldg, Rutgers University, Piscataway, NJ Organizer: Parni Dasu, AT&T Labs, tamr at research.att.com WWW Information: http://dimacs.rutgers.edu/Workshops/DataCleaning/ ***************************************************** The word "data" has taken on a broad meaning in the last five years. It is no longer a set of numbers or even text. New data paradigms include data streams characterized by a high rate of accumulation, web scraped documents and tables, web server logs, images, audio and video, to name a few. Well-known challenges of heterogeneity and scale continue to grow as data are integrated from disparate sources and become more complex in size and content. While new paradigms have enriched data, the quality of data has declined considerably. In earlier times, data were collected as a part of pre-designed experiments where data collection could be monitored to enforce data quality standards. The data sets themselves were small enough that even if data collection was unsupervised, the data could be quickly scrubbed through highly manual methods. Today, neither monitoring of data collection nor manual scrubbing of data is feasible due to the sheer size and complexity of the data. An additional challenge in addressing data quality is the domain dependence of problems and solutions. Metadata and domain expertise have to be discovered and incorporated into the solutions, entailing an extensive interaction with widely scattered experts. This particular aspect of data quality makes it difficult to find general one-size-fits-all solutions. However, the process of discovering metadata and domain expertise can be automated through the development of appropriate tools and techniques such as data browsing and exploration, knowledge representation and rule based programming. Many disciplines have taken piecemeal approaches to data quality. The areas of process management statistics, data mining database research and metadata coding have all developed their own ad hoc approaches to solve different pieces of the data quality puzzle. These include statistical techniques for process monitoring, treatment of incomplete data and outliers, techniques for monitoring and auditing data delivery processes, database research for integration, discovery of functional dependencies and join paths, and languages for data exchange and metadata representation. We need an integrated end-to-end approach within a common framework, where the various disciplines can complement and leverage each other's strengths. In this workshop, our broad objective is to bring together experts from different research disciplines to initiate a comprehensive technical discussion on data quality, data cleaning and treatment of noisy data. Specifically, * To provide an overview of the existing research in data quality * To present data quality as a continuous, end-to-end concept * To discuss and update the definition of data quality, to develop metrics for measuring data quality * To emphasize data exploration, data browsing and data profiling for validating schema specific constraints and identifying aberrations * To focus on disciplines such as knowledge representation and rule based programming for capturing and validating domain specific constraints * To highlight applications, case studies * To present research tools and techniques * To identify research problems in data quality and data cleaning Workshop Format The format of the workshop will be a combination of invited talks, contributed papers and posters. Invited and contributed talks will be published in the workshop proceedings. *********************************************************************** Workshop Program: Monday, November 3, 2003 9:00 - 9:50 Breakfast and Registration 9:50 - 10:00 Opening Remarks Tamraparni Dasu, AT&T Labs - Research 10:00 - 10:50 Managing Inconsistency in Data Exchange and Integration Rene Miller, University of Toronto 10:50 - 11:40 Data Quality and Data Mining in Finance* Grace Zhang, Morgan Stanley 11:40 - 12:30 Bellman - A Data Quality Browser Ted Johnson, AT&T Labs 12:30 - 2:00 Lunch 2:00 - 3:00 The Data Cleaning Problem -- Some Key Issues and Practical Approaches Ron Pearson, Daniel Baugh Institute for Functional Genomics and Computational Biology, Thomas Jefferson University 3:00 - 3:50 Pre-processing of Microarray Data Dhammikai Amaratunga, Javier Cabrera, Nandini Raghavan, Johnson & Johnson, Rutgers, Johnson & Johnson 3:50 - 4:00 Break 4:00 - 4:50 Data Quality Challenges in the Analysis of Streaming Data S. Muthukrishnan, Rutgers Univeristy 5:00 Wine & Cheese Tuesday, November 4, 2003 9:30 - 9:50 Breakfast and Registration 9:50 - 10:00 Opening Remarks 10:00 - 11:00 Data Mining: A Powerful Tool for Data Cleaning Jiawei Han, University of Illinois at Urbana-Champaign 11:00 - 12:00 A $220 Million Success Story* Jon Hill, British Telecommunications 12:00 - 1:00 Knowledge Engineering, Rule Based Syatems and Data Quality Gregg Vesonder and Jon Wright, AT&T Labs - Research 1:00 - 2:30 Lunch 2:30 - 3:20 Managing Data Streams* Andrew Hume, AT&T Labs 3:20 - 4:10 Web page cleaning for web data mining Bing Liu, Univeristy of Illinois at Chicago 4:10 - 4:20 Break 4:20 - 5:10 Relational Nonlinear FIR Filters R.K. Pearson and M. Gabbouj Daniel Baugh Institute for Functional Genomics and Computational Biology, Thomas Jefferson University and Tampere University of Technology * Tentative titles ************************************************************** Registration Fees: (Pre-registration deadline: October 27, 2003) Regular rate Preregister before deadline $120/day After preregistration deadline $140/day Reduced Rate* Preregister before deadline $60/day After preregistration deadline $70/day Postdocs Preregister before deadline $10/day After preregistration deadline $15/day DIMACS Postdocs $0 Non-Local Graduate & Undergraduate students Preregister before deadline $5/day After preregistration deadline $10/day Local Graduate & Undergraduate students $0 (Rutgers & Princeton) DIMACS partner institution employees** $0 DIMACS long-term visitors*** $0 Registration fee to be collected on site, cash, check, VISA/Mastercard accepted. Our funding agencies require that we charge a registration fee for the workshop. Registration fees cover participation in the workshop, all workshop materials, breakfast, lunch, breaks, and any scheduled social events (if applicable). * College/University faculty and employees of non-profit organizations will automatically receive the reduced rate. Other participants may apply for a reduction of fees. They should email their request for the reduced fee to the Workshop Coordinator at workshop at dimacs.rutgers.edu. Include your name, the Institution you work for, your job title and a brief explanation of your situation. All requests for reduced rates must be received before the preregistration deadline. You will promptly be notified as to the decision about it. ** Fees for employees of DIMACS partner institutions are waived. DIMACS partner institutions are: Rutgers University, Princeton University, AT&T Labs - Research, Bell Labs, NEC Laboratories America and Telcordia Technologies. Fees for employees of DIMACS affiliate members Avaya Labs, IBM Research and Microsoft Research are also waived. ***DIMACS long-term visitors who are in residence at DIMACS for two or more weeks inclusive of dates of workshop. *************************************************************** Information on participation, registration, accommodations, and travel can be found at: http://dimacs.rutgers.edu/Workshops/DataCleaning/ **PLEASE BE SURE TO PRE-REGISTER EARLY** ***************************************************************