ACM
SIGKDD Explorations
Newsletter of the ACM Special Interest Group on Knowledge Discovery and Data Mining
About SIGKDD Explorations
About SIGKDD
Officers
Current Issue
Previous Issues
Upcoming Issues
Submission instructions
Related Links
June 2002. Volume 4, Issue 1
 

Editorial by Roberto J. Bayardo Jr.  (available in PDF and Postscript formats)

Contributed Articles on Constraints in Data Mining

Classification Trees for Problems with Monotonicity Constraints
      R. Potharst and A. J. Feelders
(available in PDF and Postscript formats)
ABSTRACT:  For classification problems with ordinal attributes very often the class attribute should increase with each or some of the explaining attributes. These are called classification problems with monotonicity constraints. Classical decision tree algorithms such as CART or C4.5 generally do not produce monotone trees, even if the dataset is completely monotone. This paper surveys the methods that have so far been proposed for generating decision trees that satisfy monotonicity constraints. A distinction is made between methods that work only for monotone datasets and methods that work for monotone and non-monotone datasets alike.


SPARTAN:  Using Constrained Models for Guaranteed Error Semantic Compression
     S. Babu, M. Garofalakis and R. Rastogi
(available in PDF and Postscript formats)
ABSTRACT:   While a variety of lossy compression schemes have been developed for certain forms of digital data (e.g., images, audio, video), the area of lossy compression techniques for arbitrary data tables has been left relatively unexplored. Nevertheless, such techniques are clearly motivated by the ever-increasing data collection rates of modern enterprises and the need for effective, guaranteed-quality approximate answers to queries over massive relational data sets.

In this paper, we propose SPARTAN, a system that takes advantage of attribute semantics and data-mining models to perform lossy compression of massive data tables. SPARTAN is based on the novel idea of exploiting predictive data correlations and prescribed error-tolerance constraints for individual attributes to construct concise and accurate Classification and Regression Tree (CaRT) models for entire columns of a table.  More precisely, SPARTAN selects a certain subset of attributes (referred to as predicted attributes) for which no values are explicitly stored in the compressed table; instead, concise error-constrained CaRTs that predict these values (within the prescribed error tolerances) are maintained.  To restrict the huge search space of possible CaRT predictors, SPARTAN uses a Bayesian network structure to guide the selection of CaRT models that minimize the overall storage requirement, based on the prediction and materialization costs for each attribute.  SPARTAN's CaRT-building algorithms employ novel integrated pruning strategies that take advantage of the given error constraints on individual attributes to minimize the computational effort involved.  Our experimentation with several real-life data sets offers convincing evidence of the effectiveness of SPARTAN's model-based approach -- SPARTAN is able to consistently yield substantially better compression ratios than existing semantic or syntactic compression tools (e.g., gzip) while utilizing only small samples of the data for model inference.


Learning Missing Values from Summary Constraints
      X. Wu and D. Barbará
(available in PDF and Postscript formats)
ABSTRACT:   Real-world data sets often contain errors and inconsistency. Even though this is a very important problem it has received relatively little attention in the research community. In this paper we examine the problem of learning missing values when some summary information is available. We use linear algebra and constraint programming techniques to learn the missing values using apriori-known summary information and that derived from the raw data. We reconstruct the missing values by different methods in three scenarios:  ideal-constrained, under-constrained, and over-constrained. Furthermore, for a range query involving missing values, we also give the lower bound and upper bound for the values using constraint programming techniques. We believe that theory of linear algebra and constraint programming constitutes a sound basis for learning missing values when summary information is available.


Constrained Frequent Pattern Mining:  A Pattern-Growth View
      J. Pei and J. Han
(available in PDF and Postscript formats)
ABSTRACT: It has been well recognized that frequent pattern mining plays an essential role in many important data mining tasks. However, frequent pattern mining often generates a very large number of patterns and rules, which reduces not only the efficiency but also the effectiveness of mining. Recent work has highlighted the importance of the constraint-based mining paradigm in the context of mining frequent itemsets, associations, correlations, sequential patterns, and many other interesting patterns in large databases.

Recently, we developed efficient pattern-growth methods for frequent pattern mining. Interestingly, pattern-growth methods are not only efficient but also effective in mining with various constraints. Many tough constraints which cannot be handled by previous methods can be pushed deep into the pattern-growth mining process. In this paper, we overview the principles of pattern-growth methods for constrained frequent pattern mining and sequential pattern mining. Moreover, we explore the power of pattern-growth methods towards mining with tough constraints and highlight some interesting open problems.


Exploiting Succinct Constraints using FP-trees
      C. K.-S. Leung, L. V. S. Lakshmanan, and R. T. Ng
(available in PDF and Postscript formats)
ABSTRACT:  Since its introduction, frequent-set mining has been generalized to many forms, which include constrained data mining.  The use of constraints permits user focus and guidance, enables user exploration and control, and leads to effective pruning of the search space and efficient mining of frequent itemsets.  In this paper, we focus on the use of succinct constraints.  In particular, we propose a novel algorithm called FPS to mine frequent itemsets satisfying succinct constraints.  The FPS algorithm avoids the generate-and-test paradigm by exploiting succinctness properties of the constraints in a FP-tree based framework.  In terms of functionality, our algorithm is capable of handling not just the
succinct aggregate constraint, but any succinct constraint in general.  Moreover, it handles multiple succinct constraints. In terms of performance, our algorithm is more efficient and effective than existing FP-tree based constrained frequent-set mining algorithms.


Is Pushing Constraints Deeply into the Mining Algorithms Really What We Want?
      J. Hipp and U. Güntzer
(available in PDF and Postscript formats)
ABSTRACT:  The common approach to exploit mining constraints is to push them deeply into the mining algorithms. In our paper we argue that this approach is based on an understanding of KDD that is no longer up-to-date. In fact, today KDD is seen as a human centered, highly interactive and iterative process. Blindly enforcing constraints already during  the mining runs neglects the process character of KDD and therefore is no longer state of the art. Constraints can make a single algorithm run faster but in fact we are still far from response times that would allow true interactivity in KDD. In addition we pay the price of repeated mining runs and moreover risk reducing data mining to some kind of hypothesis testing. Taking all the above into consideration we propose to do exactly the contrary of constrained mining: We accept an initial (nearly) unconstrained and costly mining run. But instead of a sequence of subsequent and still expensive constrained mining runs we answer all further mining queries from this initial result set. Whereas this is straight forward for constraints that can be implemented as filters on the result set, things get more complicated when we restrict the underlying mining data. Actually in practice such constraints are very important, e.g. the generation of rules for certain days of the week, for families, singles, male or female customers etc. We show how to postpone such row-restriction constraints on the transactions from rule generation to rule retrieval from the initial result set.


Discovery in Multi-Attribute Data with User-defined Constraints
      C.-S. Perng. H. Wang, S. Ma, and J. L. Hellerstein
(available in PDF and Postscript formats)
ABSTRACT:  There has been a growing interest in mining frequent itemsets in relational data with multiple attributes. A key step in this approach is to select a set of attributes that group data into transactions and a separate set of attributes that labels data into items. Unsupervised and unrestricted mining, however, is stymied by the combinatorial complexity and the quantity of patterns as the number of attributes grows. In this paper, we focus on leveraging the semantics of the underlying data for mining frequent itemsets. For instance, there are usually taxonomies in the data schema and functional dependencies among the attributes. Domain knowledge and user preferences often have the potential to significantly reduce the exponentially growing mining space.  These observations motivate the design of a user-directed data mining framework that allows such domain knowledge to guide the mining process and control the mining strategy. We show examples of tremendous reduction in computation by using domain knowledge in mining relational data with multiple attributes.


Contributed Articles

Why so many clustering algorithms -- A Position Paper
      V. Estivill-Castro
(available in PDF and Postscript formats)
ABSTRACT:  We argue that there are many clustering algorithms, because the notion of  "cluster" cannot be precisely defined.  Clustering is in the eye of the beholder, and as such, researchers have proposed many induction principles and
models whose corresponding optimization problem can only be approximately solved by an even larger number of algorithms.  Therefore, comparing clustering algorithms, must take into account a careful understanding of the inductive principles involved.


News, Events and Announcements

(available in PDF and Postscript formats)

SIGKDD Explorations home page
Send comments and suggestions to sunita@it.iitb.ernet.in