July 2001. Volume 3,
Issue 1
Contributed
Articles
What's Interesting About Cricket?
- On Thresholds and Anticipation in Discovered Rules
J. F. Roddick and S. Rice
(available in PDF
and Postscript
formats)
ABSTRACT: Despite significant
progress, determining the interestingness of a rule remains a difficult problem.
This short paper investigates the lessons that may be learned from analysing
the (largely manual) selection of interesting statistics for cricket (or
any other data rich sport) by experts. In particular, the effect of thresholds
on the interestingness of rules describing events in the sporting arena is
discussed. The concept of anticipation is shown also to be critical in this
selection and to vary the level of interest in events that may
contribute to the achievement of a threshold value during a match, thus
adding a temporal dimension to interestingness. This temporal aspect can
be best modelled on the single-past-branching-future model of time. As a
result of this investigation, a few new general ideas are discussed that
add to the research in this area. Significantly, some of the new criteria
are implicitly temporal in that they rely on a model of behaviour over time.
The applicability of threshold values for detecting uncharacteristically
poor performances are canvassed as areas of interest yet to be explored.
Resource Description Framework:
Metadata and Its Applications
K. S. Candan, H. Liu, and R. Suvarna
(available in PDF
and Postscript
formats)
ABSTRACT: Universality, the property of the Web that makes it the largest
data and information source in the world, is also the property
behind the lack of a uniform organization scheme that would allow easy access
to data and information. A semantic web, wherein different applications and
Web sites can exchange information and hence exploit Web data and information
to their full potential, requires the information about Web resources to
be represented in a detailed and structured manner. Resource Description
Framework (RDF), an effort in this direction supported by the World Wide
Web Consortium, provides a means for the description of metadata which is
a necessity for the next generation of interoperable Web applications. The
success of RDF and the semantic web will depend on (1) the development of
applications that prove the applicability of the concept, (2) the availability
of application interfaces which enable the development of such applications,
and (3) databases and inference systems that exploit RDF to identify and
locate most relevant Web resources. In addition, many practical issues, such
as security, ease of use, and compatibility, will be crucial in the success
of RDF. This survey aims at providing a glimpse at the past, present, and
future of this upcoming technology and highlights why we believe that the
next generation of the Web will be more organized, informative, searchable,
accessible, and, most importantly, useful. It is expected that knowledge
discovery and data mining can benefit from RDF and the Semantic Web.
Towards Long Pattern Generation
in Dense Databases
C. C. Aggarwal
(available in PDF
and Postscript
formats)
ABSTRACT: This paper discusses the problem of long pattern generation
in dense databases. In recent years, there has been an
increase of interest in techniques for maximal pattern generation. We present
a survey of this class of methods for long pattern generation which differ
considerably from the level-wise approach of traditional methods. Many of
these techniques are rooted in combinatorial tricks which can be applied
only when the generation of frequent patterns is not forced to be level wise.
We present an overview of the different kinds of methods which can be used
in order to improve the counting and search space exploration methods for
long patterns.
A Preprocessing Scheme for High-Cardinality
Category Attributes in Classification and Prediction Problems
D. Micci-Barreca
(available in PDF
and Postscript
formats)
ABSTRACT: Categorical data fields characterized by a large number
of distinct values represent a serious challenge for many classification
and regression algorithms that require numerical inputs. On the other hand,
these types of data fields are quite common in real-world data mining applications
and often contain potentially relevant information that is difficult to represent
for modeling purposes.
This paper presents a simple preprocessing scheme
for high-cardinality categorical data that allows this class of attributes
to be used in predictive models such as neural networks, linear and logistic
regression. The proposed method is based on a well-established statistical
method (empirical Bayes) that is straightforward to implement as an in-database
procedure. Furthermore, for categorical attributes with an inherent hierarchical
structure, like ZIP codes, the preprocessing scheme can directly leverage
the hierarchy by blending statistics at the various levels of aggregation.
While the statistical methods discussed in this
paper were first introduced in the mid 1950’s, the use of these methods as
a preprocessing step for complex models, like neural networks, has not been
previously discussed in any literature.
Genetic Subtyping using Cluster
Analysis
T. Burr, J. R. Gattiker, and G. S. LaBerge
(available in PDF
and Postscript
formats)
ABSTRACT: In this paper we (1) describe state-of-the-art methods to
identify clusters in DNA sequence data for taxonomic analysis; (2) describe
a new method with better scaling properties based on model-based clustering,
and (3) present examples using the nucleoprotein and hemagglutin regions
of influenza and the env and gag regions of human immunodeficiency virus
(HIV).
Report on the Workshop on Research
Issues in Data Mining and Knowledge Discovery Workshop (DMKD 2001)
R. Bayardo and J. E. Gehrke
(available in PDF
and Postscript
formats)
ABSTRACT: This short article summarizes the program for the Sixth Workshop
on Research Issues in Data Mining and Knowledge Discovery Workshop (DMKD
2001).
KDnuggets Interview with
Usama Fayyad
G. Piatetsky-Shapiro
(available in PDF
and Postscript
formats)
ABSTRACT: The KDnuggets newsletter has a new section of interviews with
leaders in the field. This article presents the interview with Usama Fayyad,
President and CEO of digiMine.
KDnuggets Interview with Jesus
Mena
G. Piatetsky-Shapiro
(available in PDF
and Postscript
formats)
ABSTRACT: The KDnuggets newsletter has a new section of interviews with
leaders in the field. This article presents the interview with Jesus Mena,
CEO of WebMiner.
|