|
June 2002. Volume 4, Issue 1
Editorial by
Roberto J. Bayardo Jr. (available in PDF
and Postscript
formats)
Contributed
Articles on Constraints in Data Mining
Classification Trees for Problems with Monotonicity
Constraints
R. Potharst and A.
J. Feelders
(available in PDF
and Postscript
formats)
ABSTRACT: For classification problems
with ordinal attributes very often the class attribute should increase
with each or some of the explaining attributes. These are called classification
problems with monotonicity constraints. Classical decision tree algorithms
such as CART or C4.5 generally do not produce monotone trees, even if the
dataset is completely monotone. This paper surveys the methods that have
so far been proposed for generating decision trees that satisfy monotonicity
constraints. A distinction is made between methods that work only for monotone
datasets and methods that work for monotone and non-monotone datasets alike.
SPARTAN: Using Constrained Models for Guaranteed
Error Semantic Compression
S. Babu, M. Garofalakis
and R. Rastogi
(available in PDF
and Postscript
formats)
ABSTRACT: While a variety of
lossy compression schemes have been developed for certain forms of digital
data (e.g., images, audio, video), the area of lossy compression techniques
for arbitrary data tables has been left relatively unexplored. Nevertheless,
such techniques are clearly motivated by the ever-increasing data collection
rates of modern enterprises and the need for effective, guaranteed-quality
approximate answers to queries over massive relational data sets.
In this paper, we propose SPARTAN, a system that takes advantage of
attribute semantics and data-mining models to perform lossy compression
of massive data tables.
SPARTAN is based on the novel idea of exploiting
predictive data correlations and prescribed error-tolerance constraints
for individual attributes to construct concise and accurate Classification
and Regression Tree (CaRT) models for entire columns of a table.
More precisely, SPARTAN selects a certain subset of attributes (referred
to as predicted attributes) for which no values are explicitly stored
in the compressed table; instead, concise error-constrained CaRTs
that predict these values (within the prescribed error tolerances) are
maintained. To restrict the huge search space of possible CaRT predictors,
SPARTAN uses a Bayesian network structure to guide the selection of CaRT
models that minimize the overall storage requirement, based on the prediction
and materialization costs for each attribute. SPARTAN's CaRT-building
algorithms employ novel integrated pruning strategies that take advantage
of the given error constraints on individual attributes to minimize the
computational effort involved. Our experimentation with several real-life
data sets offers convincing evidence of the effectiveness of SPARTAN's
model-based approach -- SPARTAN is able to consistently yield substantially
better compression ratios than existing semantic or syntactic compression
tools (e.g., gzip) while utilizing only small
samples of the data for model inference.
Learning Missing Values from Summary Constraints
X. Wu and D. Barbará
(available in PDF
and Postscript
formats)
ABSTRACT: Real-world data sets often
contain errors and inconsistency. Even though this is a very important
problem it has received relatively little attention in the research community.
In this paper we examine the problem of learning missing values when some
summary information is available. We use linear algebra and constraint
programming techniques to learn the missing values using apriori-known
summary information and that derived from the raw data. We reconstruct
the missing values by different methods in three scenarios: ideal-constrained,
under-constrained, and over-constrained. Furthermore, for a range query
involving missing values, we also give the lower bound and upper bound
for the values using constraint programming techniques. We believe that
theory of linear algebra and constraint programming constitutes a sound
basis for learning missing values when summary information is available.
Constrained Frequent Pattern Mining: A Pattern-Growth
View
J. Pei and J. Han
(available in PDF
and Postscript
formats)
ABSTRACT: It
has been well recognized that frequent pattern mining plays an essential
role in many important data mining tasks. However, frequent pattern mining
often generates a very large number of patterns and rules, which reduces
not only the efficiency but also the effectiveness of mining. Recent work
has highlighted the importance of the constraint-based mining paradigm
in the context of mining frequent itemsets, associations, correlations,
sequential patterns, and many other interesting patterns in large databases.
Recently, we developed efficient pattern-growth methods
for frequent pattern mining. Interestingly, pattern-growth methods are
not only efficient but also effective in mining with various constraints.
Many tough constraints which cannot be handled by previous methods can
be pushed deep into the pattern-growth mining process. In this paper, we
overview the principles of pattern-growth methods for constrained frequent
pattern mining and sequential pattern mining. Moreover, we explore the
power of pattern-growth methods towards mining with tough constraints and
highlight some interesting open problems.
Exploiting Succinct Constraints using FP-trees
C. K.-S. Leung, L.
V. S. Lakshmanan, and R. T. Ng
(available in PDF
and Postscript
formats)
ABSTRACT: Since its introduction, frequent-set
mining has been generalized to many forms, which include constrained data
mining. The use of constraints
permits user focus and guidance,
enables user exploration and control, and leads to effective pruning of
the search space and efficient mining of frequent itemsets. In this
paper, we focus on the use of succinct constraints. In particular,
we propose a novel algorithm called FPS to mine frequent itemsets
satisfying succinct constraints. The FPS algorithm avoids the generate-and-test
paradigm by exploiting succinctness properties of the constraints in a
FP-tree based framework. In terms of functionality, our algorithm
is capable of handling not just the
succinct aggregate constraint, but any succinct constraint
in general. Moreover, it handles multiple succinct constraints. In
terms of performance, our algorithm is more efficient and effective than
existing FP-tree based constrained frequent-set mining algorithms.
Is Pushing Constraints Deeply into the Mining
Algorithms Really What We Want?
J. Hipp and U. Güntzer
(available in PDF
and Postscript
formats)
ABSTRACT: The common approach to exploit
mining constraints is to push them deeply into the mining algorithms. In
our paper we argue that this approach is based on an understanding of KDD
that is no longer up-to-date. In fact, today KDD is seen as a human centered,
highly interactive and iterative process. Blindly enforcing constraints
already during the mining runs neglects the process character of
KDD and therefore is no longer state of the art. Constraints can make a
single algorithm run faster but in fact we are still far from response
times that would allow true interactivity in KDD. In addition we pay the
price of repeated mining runs and moreover risk reducing data mining to
some kind of hypothesis testing. Taking all the above into consideration
we propose to do exactly the contrary of constrained mining: We accept
an initial (nearly) unconstrained and costly mining run. But instead of
a sequence of subsequent and still expensive constrained mining runs we
answer all further mining queries from this initial result set. Whereas
this is straight forward for constraints that can be implemented as filters
on the result set, things get more complicated when we restrict the underlying
mining data. Actually in practice such constraints are very important,
e.g. the generation of rules for certain days of the week, for families,
singles, male or female customers etc. We show how to postpone such row-restriction
constraints on the transactions from rule generation to rule retrieval
from the initial result set.
Discovery in Multi-Attribute Data with User-defined
Constraints
C.-S. Perng. H. Wang,
S. Ma, and J. L. Hellerstein
(available in PDF
and Postscript
formats)
ABSTRACT: There has been a growing interest
in mining frequent itemsets in relational data with multiple attributes.
A key step in this approach is to select a set of attributes that group
data into transactions and a separate set of attributes that labels data
into items. Unsupervised and unrestricted mining, however, is stymied by
the combinatorial complexity and the quantity of patterns as the number
of attributes grows. In this paper, we focus on leveraging the semantics
of the underlying data for mining frequent itemsets. For instance, there
are usually taxonomies in the data schema and functional dependencies among
the attributes. Domain knowledge and user preferences often have the potential
to significantly reduce the exponentially growing mining space. These
observations motivate the design of a user-directed data mining framework
that allows such domain knowledge to guide the mining process and control
the mining strategy. We show examples of tremendous reduction in computation
by using domain knowledge in mining relational data with multiple attributes.
Contributed
Articles
Why so many clustering algorithms -- A Position
Paper
V. Estivill-Castro
(available in PDF
and Postscript
formats)
ABSTRACT: We argue that there are
many clustering algorithms, because the notion of "cluster" cannot
be precisely defined. Clustering is in the eye of the beholder, and
as such, researchers have proposed many induction principles and
models whose corresponding optimization problem can only be approximately
solved by an even larger number of algorithms. Therefore, comparing
clustering algorithms, must take into account a careful understanding of
the inductive principles involved.
News,
Events and Announcements
(available
in PDF
and Postscript
formats) |