About SIGKDD Explorations

Newsletter of the ACM Special Interest Group on Knowledge Discovery and Data Mining

About SIGKDD Explorations

About SIGKDD

Officers

Current Issue

Previous Issues

Upcoming Issues

Submission instructions

Related Links

July 2001. Volume 3, Issue 1

Contributed Articles

What's Interesting About Cricket? - On Thresholds and Anticipation in Discovered Rules
J. F. Roddick and S. Rice
(available in PDF and Postscript formats)
ABSTRACT: Despite significant progress, determining the interestingness of a rule remains a difficult problem. This short paper investigates the lessons that may be learned from analysing the (largely manual) selection of interesting statistics for cricket (or any other data rich sport) by experts. In particular, the effect of thresholds on the interestingness of rules describing events in the sporting arena is discussed. The concept of anticipation is shown also to be critical in this selection and to vary the level of interest in events that may
contribute to the achievement of a threshold value during a match, thus adding a temporal dimension to interestingness. This temporal aspect can be best modelled on the single-past-branching-future model of time. As a result of this investigation, a few new general ideas are discussed that add to the research in this area. Significantly, some of the new criteria are implicitly temporal in that they rely on a model of behaviour over time. The applicability of threshold values for detecting uncharacteristically poor performances are canvassed as areas of interest yet to be explored.

Resource Description Framework: Metadata and Its Applications
K. S. Candan, H. Liu, and R. Suvarna
(available in PDF and Postscript formats)
ABSTRACT: Universality, the property of the Web that makes it the largest data and information source in the world, is also the property
behind the lack of a uniform organization scheme that would allow easy access to data and information. A semantic web, wherein different applications and Web sites can exchange information and hence exploit Web data and information to their full potential, requires the information about Web resources to be represented in a detailed and structured manner. Resource Description Framework (RDF), an effort in this direction supported by the World Wide Web Consortium, provides a means for the description of metadata which is a necessity for the next generation of interoperable Web applications. The success of RDF and the semantic web will depend on (1) the development of applications that prove the applicability of the concept, (2) the availability of application interfaces which enable the development of such applications, and (3) databases and inference systems that exploit RDF to identify and locate most relevant Web resources. In addition, many practical issues, such as security, ease of use, and compatibility, will be crucial in the success of RDF. This survey aims at providing a glimpse at the past, present, and future of this upcoming technology and highlights why we believe that the next generation of the Web will be more organized, informative, searchable, accessible, and, most importantly, useful. It is expected that knowledge discovery and data mining can benefit from RDF and the Semantic Web.

Towards Long Pattern Generation in Dense Databases
C. C. Aggarwal
(available in PDF and Postscript formats)
ABSTRACT: This paper discusses the problem of long pattern generation in dense databases. In recent years, there has been an
increase of interest in techniques for maximal pattern generation. We present a survey of this class of methods for long pattern generation which differ considerably from the level-wise approach of traditional methods. Many of these techniques are rooted in combinatorial tricks which can be applied only when the generation of frequent patterns is not forced to be level wise. We present an overview of the different kinds of methods which can be used in order to improve the counting and search space exploration methods for long patterns.

A Preprocessing Scheme for High-Cardinality Category Attributes in Classification and Prediction Problems
D. Micci-Barreca
(available in PDF and Postscript formats)
ABSTRACT: Categorical data fields characterized by a large number of distinct values represent a serious challenge for many classification and regression algorithms that require numerical inputs. On the other hand, these types of data fields are quite common in real-world data mining applications and often contain potentially relevant information that is difficult to represent for modeling purposes.

This paper presents a simple preprocessing scheme for high-cardinality categorical data that allows this class of attributes to be used in predictive models such as neural networks, linear and logistic regression. The proposed method is based on a well-established statistical method (empirical Bayes) that is straightforward to implement as an in-database procedure. Furthermore, for categorical attributes with an inherent hierarchical structure, like ZIP codes, the preprocessing scheme can directly leverage the hierarchy by blending statistics at the various levels of aggregation.

While the statistical methods discussed in this paper were first introduced in the mid 1950’s, the use of these methods as a preprocessing step for complex models, like neural networks, has not been previously discussed in any literature.

Genetic Subtyping using Cluster Analysis
T. Burr, J. R. Gattiker, and G. S. LaBerge
(available in PDF and Postscript formats)
ABSTRACT: In this paper we (1) describe state-of-the-art methods to identify clusters in DNA sequence data for taxonomic analysis; (2) describe a new method with better scaling properties based on model-based clustering, and (3) present examples using the nucleoprotein and hemagglutin regions of influenza and the env and gag regions of human immunodeficiency virus (HIV).

Report on the Workshop on Research Issues in Data Mining and Knowledge Discovery Workshop (DMKD 2001)
R. Bayardo and J. E. Gehrke
(available in PDF and Postscript formats)
ABSTRACT: This short article summarizes the program for the Sixth Workshop on Research Issues in Data Mining and Knowledge Discovery Workshop (DMKD 2001).

KDnuggets Interview with Usama Fayyad
G. Piatetsky-Shapiro
(available in PDF and Postscript formats)
ABSTRACT: The KDnuggets newsletter has a new section of interviews with leaders in the field. This article presents the interview with Usama Fayyad, President and CEO of digiMine.

KDnuggets Interview with Jesus Mena
G. Piatetsky-Shapiro
(available in PDF and Postscript formats)
ABSTRACT: The KDnuggets newsletter has a new section of interviews with leaders in the field. This article presents the interview with Jesus Mena, CEO of WebMiner.

SIGKDD Explorations home page
Send comments and suggestions to sunita@it.iitb.ernet.in