About SIGKDD Explorations

Newsletter of the ACM Special Interest Group on Knowledge Discovery and Data Mining

About SIGKDD Explorations

About SIDKDD

Officers

Current Issue

Previous Issues

Submission instructions

Related Links

December 2000. Volume 2, Issue 2

Editorial by P. S. Bradley, U. Fayyad, S. Sarawagi, K. Shim (available in PDF and Postscript formats or HTML)

Contributed Articles on "Scalable Data Mining"

Support Vector Machines: Hype or Hallelujah?
K. P. Bennett and C. Campbell
(available in PDF and Postscript formats)
ABSTRACT: Support Vector Machines (SVMs) and related kernel methods have become increasingly popular tools for data mining tasks such as classification, regression, and novelty detection. The goal of this tutorial is to provide in intuitive explanation of SVMs from a geometric perspective. The classification problem is used to investigate the basic concepts behind SVMs and to examine their strengths and weaknesses from a data mining perspective. While this overview is not comprehensive, it does provide resources for those interested in further exploring SVMs.

Mining Frequent Patterns by Pattern Growth: Methodology and Implications
J. Han and J. Pei
(available in PDF and Postscript formats)
ABSTRACT: Mining frequent patterns has been a focused topic in data mining research in recent years, with the development of numerous interesting algorithms for mining associations, correlation, causality, sequential patterns, partial periodicity, constraint-based frequent pattern mining, associative classification, emerging patterns, etc. Most of the previous studies adopt an Apriori-like, candidate generation-and-test approach. However, based on our analysis, candidate generation and test may still be expensive, especially when encountering long and numerous patterns.

A new methodology, called frequent pattern growth, which mines frequent patterns without candidate generation, has been developed. The method adopts a divide-and-conquer philosophy to project and partition databases based on the currently discovered frequent patterns and grow such patterns to longer ones in the projected databases. Moreover, efficient data structures have been developed for effective database compression and fast in-memory traversal. Such a methodology may eliminate or substantially reduce the number of candidate sets to be generated and also reduce the size of the database to be iteratively examined, and, therefore, lead to high performance.

In this paper, we provide an overview of this approach and examine its methodology and implications for mining several kinds of frequent patterns, including association, frequent, closed itemsets, max-patterns, sequential patterns, and constraint-based mining of frequent patterns. We show that frequent pattern growth is efficient at mining large databases and its further development may lead to scalable mining of many other kinds of patterns as well.

The Segment Support Map: Scalable Mining of Frequent Itemsets
L. V. S. Lakshmanan, C. K. S. Leung and R. T. Ng
(available in PDF and Postscript formats)
ABSTRACT: Since its introduction, frequent set mining has been generalized to many forms, including online mining with Carma, and constrained mining with CAP. Regardless, scalability is always an important aspect of the development. In this paper, we propose a novel structure called segment support map to help mining of frequent itemsets of the various forms. A light-weight structure, the segment support map improves the performance of frequent-set mining algorithms by: (i) obtaining sharper bounds on the support of itemsets, and/or (ii) better exploiting properties of constraints. Our experimental results show the effectiveness of the segment support map.

Mining Patterns in Long Sequential Data with Noise
W. Wang, J. Yang and P. S. Yu
(available in PDF and Postscript formats)
ABSTRACT: Pattern discovery in time series data has been a problem of great importance in many fields, e.g., computational biology, performance analysis, consumer behavior, etc. Recently, considerable amount of research has been carried out in this area. The facts that that input data is typically very large and noises may present in various formats pose great challenge to the mining process. Recently, we have made several new research advances in this area. In this paper, we present some of them. We will survey new models proposed to address different types of noises as well as scalable algorithms developed for efficiently mining patterns under each model.

Distributed Data Clustering can be Efficient and Exact
G. Forman and B. Zhang
(available in PDF and Postscript formats)
ABSTRACT: Data clustering is one of the fundamental techniques in scientific data analysis and data mining. It partitions a data set into groups of similar items, as measured by some distance metric. Over the years, data set sizes have grown rapidly with the exponential growth of computer storage and increasingly automated business and manufacturing processes. Many of these datasets are geographically distributed across multiple sites, e.g. different sales or warehouse locations. To cluster such large and distributed data sets, efficient distributed algorithms are called for to reduce the communication overhead, central storage requirements, and computation time, as well as to bring the resources of multiple machines to bear on a given problem as the data set sizes scale-up. We describe a technique for parallelizing a family of center-based data clustering algorithms. The central idea is to communicate only sufficient statistics, yielding linear speed-up with excellent efficiency. The technique does not involve approximation and may be used orthogonally in conjunction with sampling or aggregation-based methods, such as BIRCH, to lessen the quality degradation of their approximation or to handle larger data sets. We demonstrate in this paper that even for relatively small problem sizes, it can be more cost effective to cluster the data in-place using an exact distributed algorithm than to collect the data in one location for clustering.

Scalable Data Mining with Model Constraints
M. Garofalakis and R. Rastogi
(available in PDF and Postscript formats)
ABSTRACT: Data mining can be abstractly defined as the process of extracting concise and interesting models (or, patterns) from large amounts of data. Unfortunately, conventional mining systems provide users with only very restricted mechanisms for specifying models of interest. As a consequence, the mining process is typically characterized by lack of focus and users often end up paying computational costs that are inordinately high compared to the specific models/patterns of interest. Exploiting user-defined model constraints during the mining process can help alleviate this problem and ensure system performance that is commensurate with the level of user focus. Attaining such performance goals, however, is not straightforward and, typically, requires the design of novel data mining algorithms that make effective use of the model constraints. In this paper, we provide an overview of our recent work on scalable, constraint-based algorithms for (1) decision tree construction with size and accuracy constraints for the desired decision tree model, and (2) sequential pattern extraction in the presence of structural, regular expression constraints for the target patterns. By "pushing" the model constraints inside the mining process, our algorithms give mining users exactly the models that they are looking for, while achieving performance speedups that often exceed one order of magnitude. Further, our work on sequential pattern mining has uncovered some valuable insights into the tradeoffs that arise when complex constraints that do not subscribe to the "nice" properties (like anti-monotonicity) are integrated into the mining process. We argue that, in general, a cost-based-approach (similar to that employed in conventional query optimizers) is needed to explore these tradeoffs in a principled manner and produce effective execution plans for ad-hoc mining queries.

An Efficient and Scalable Data Compression Approach to Classification
C. Diamantini and M. Panti
(available in PDF and Postscript formats)
ABSTRACT: Learning algorithms are effective means of inducing predictive models of a phenomenon starting from a set of instances of the phenomenon itself. However, the impressive growth of the amount of stored data makes scalability of both the learning and classification procedures a compelling requisite for their effective application in data mining tasks, at least as important as accuracy of the induced model. In this paper, we show the features of a stochastic gradient algorithm for the minimization of the average misclassification risk performed by a Labeled Vector Quantizer, both in terms of scalability and accuracy. The performance are compared with those of other related algorithms, often adopted in data mining, on both artificial and real data experiments.

Systems Support for Scalable Data Mining
W. A. Maniatty and M. J. Zaki
(available in PDF and Postscript formats)
ABSTRACT: The current generation of data mining tools have limited capacity and performance, since these tools tend to be sequential. This paper explores a migration path out of this bottleneck by considering an integrated hardware and software approach to parallelize data mining. Our analysis shows that parallel data mining solutions require the following components: parallel data mining algorithms, parallel and distributed data bases, parallel file systems, parallel I/O, tertiary storage, management of online data, support for heterogeneous data representations, security, quality of service and pricing metrics. State of the art technology in these areas is surveyed with an eye towards an integration strategy leading to a complete solution.

Mining Frequent Patterns with Counting Inference
Y. Bastide, R. Taouil, N. Pasquier, G. Stumme, and L. Lakhal
(available in PDF and Postscript formats)
ABSTRACT: In this paper, we propose the algorithm Pascal which introduces a novel optimization of the well-known algorithm Apriori. This optimization is based on a new strategy called pattern counting inference that relies on the concept of key patterns. We show that the support of frequent non-key patterns can be inferred from frequent key patterns without accessing the database. Experiments comparing Pascal to the three algorithms Apriori, Close and Max-Miner, show that Pascal is among the most efficient algorithms for mining frequent patterns.

Measuring Lift Quality in Database Marketing
G. Piatetsky-Shapiro and S. Steingold
(available in PDF and Postscript formats)
ABSTRACT: Database marketers often select predictive models based on the lift in the top 5%, 10% or 20%. However, different models may be better at different thresholds. Absent a good cost function, or when multiple cost functions are possible, we want a measure that helps to compare models by looking at the entire lift curve. In this paper, we propose such a measure, call L-quality, which indicates how close the lift curve is to the random and perfect models. We also examine the practical issues of computing L-quality from discrete quantiles available in a typical lift table.

The UCI KDD Archive of Large Data Sets for Data Mining Research and Experimentation
S. D. Bay, D. Kibler, M. J. Pazzani, and P. Smyth
(available in PDF and Postscript formats)
ABSTRACT: Advances in data collection and storage have allowed organizations to create massive, complex and heterogeneous databases, which have stymied traditional methods of data analysis. This has led to the development of new analytical tools that often combine techniques from a variety of fields such as statistics, computer science, and mathematics to extract meaningful knowledge from the data. To support research in this area, UC Irvine has created the UCI Knowledge Discovery in Databases (KDD) Arcive (http://kdd.ics.uci.edu/) which is a new online archive of large and complex datasets that encompasses a wide variety of data types, analysis tasks, and application areas. This article describes the objectives and philosophy of the UCI KDD Archive. We draw parallels with the development of the UCI Machine Learning Repository and its affect on the Machine Learning community.

Reports from KDD-2000

KDD-Cup 2000 Organizer's Report: Peeling the Onion
R. Kohavi, C. E. Brodley, B. Frasca, L. Mason, and Z. Zheng
(available in PDF and Postscript formats)
ABSTRACT: We describe KDD-Cup 2000, the yearly competition in data mining. For the first time the Cup included insight problems in addition to prediction problems, thus posing new challenges in both the knowledge discovery and the evaluation criteria, and highlighting the need to "peel the onion" and drill deeper into the reasons for the initial patterns found. We chronicle the data generation phase starting from the collection at the site through its conversion to a star schema in a warehouse through data cleansing, data obfuscation for privacy protection, and data aggregation. We describe the information given to the participants, including the questions, site structure, the marketing calendar, and the data schema. Finally, we discuss interesting insights, common mistakes, and lessons learned. Three winners were announced and they describe their own experiences and lessons in the pages following this paper.

Question 1 Winner's Report

Amdocs: A. Inger, N. Vatnik, S. Rosset, and E. Neumann

Question 2 Winner's Report

Salford Systems: D. Steinberg, N. S. Cardell, and M. Goloynya

Question 3 Winner's Report

Salford Systems: D. Steinberg, R. Carson, D. Agarwal, J. Yan, and A. Rupp

Question 4 Winner's Report

e-steam, inc.: R. Kustra, J. A. Picazo, B. E. Popescu

Question 5 Winner's Report

Amdocs: E. Neumann, N. Vatnik, S. Rosset, M. Duenias, I. Sassoon, A. Inger

Text Mining as Integration of Several Related Research Areas: Report on KDD'2000 Workshop on Text Mining
M. Grobelnik, D. Mladenic, and N. Milic-Frayling
(available in PDF and Postscript formats)
ABSTRACT: In this paper we give an overview of the KDD'2000 Workshop on Text Mining that was held in Boston, MA on August 20, 2000. We report in detail on the research issues covered in the papers presented at the Workshop and during the group discussion held in the final session of the Workshop.

Report on MDM/KDD2000: The 1st International Workshop on Multimedia Data Mining
S. J. Simoff and O. R. Zaïane
(available in PDF and Postscript formats)
ABSTRACT: This short report on the First International Workshop on Multimedia Data Mining summarizes the activities that took place during the workshop and gives pointers to resources where more information can be found.

WEBKDD 2000 -- Web Mining for E-Commerce
M. Spiliopoulou, J. Srivastava, R. Kohavi, and B. M. Masand
(available in PDF and Postscript formats)
ABSTRACT: In this paper, we provide a summary of the WEBKDD 2000 workshop, whose theme was `Web Mining for E-Commerce'. This workshop was held in conjunction with the ACM SIGKDD International Conference on Knowledge Discovery in Databases (KDD-2000).

Report from the Workshop on Distributed and Parallel Knowledge Discovery, ACM SIGKDD-2000
H. Kargupta, J. Ghosh, V. Kumar, and Z. Obradovic
(available in PDF and Postscript formats)
ABSTRACT: This report presents a summary of the activities that took place in the 2000 ACM SIGKDD workshop on Distributed and Parallel Knowledge Discovery.

Postprocessing in Machine Learning and Data Mining
I. Bruha and A. Famili
(available in PDF and Postscript formats)
ABSTRACT: This article surveis the contents of the workshop Post-Processing in Machine Learning and Data Mining: Interpretation, Visualization, Integration, and Related Topics within KDD-2000: The Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, USA, 20-23 August, 2000. The corresponding website is on www.acm.org/sigkdd/kdd2000.

First, this survey paper introduces the state of the art of the workshop topics, emphasizing that postprocessing forms a significant component in Knowledge Discovery and Databases (KDD). Next, the article brings up a report on the contents, analysis, discussion, and other aspects regarding this workshop. Afterwards, we survey all workshop papers. They can found at (and downloaded from): www.cas.mcmaster.ca/~bruha/kdd2000/kddrep.html.

The authors of this report worked as the organizers of the workshop; the programme committee was formed by additional three researchers in this field.

Report on the KDD2000 Panel Personalization and Data Mining: Exploring the Synergies
A. Tuzhilin
(available in PDF and Postscript formats)

Events and Announcements
(available in PDF and Postscript formats)

SIGKDD Explorations home page
Send comments and suggestions to sunita@it.iitb.ernet.in