About SIGKDD Explorations

Newsletter of the ACM Special Interest Group on Knowledge Discovery and Data Mining

About SIGKDD Explorations

About SIGKDD

Officers

Current Issue

Previous Issues

Upcoming Issues

Submission instructions

Related Links

January 2003. Volume 4, Issue 2

Editorial by Johannes Gehrke. (available in PDF format)

Contributed Articles on Privacy and Security

Data Mining, National Security, Privacy and Civil Liberties
B. Thuraisingham
(available in PDF and Postscript formats)
ABSTRACT: In this paper, we describe the threats to privacy that can occur through data mining and then view the privacy problem as a variation of the inference problem in databases.

The Inference Problem: A Survey
C. Farkas and S. Jajodia
(available in PDF and Postscript formats)
ABSTRACT: Access control models protect sensitive data from unauthorized disclosure via direct access, however, they fail to prevent indirect accesses. Indirect data disclosure via inference channels occurs when sensitive information can be inferred from non-sensitive data and metadata. Inference channels are often low-bandwidth and complex; nevertheless, detection and removal of inference channels is necessary to guarantee data security. This paper presents a survey of the current and emerging research in data inference control and emphasizes the importance of targeting this so often overlooked problem during database security design.

Cryptographic Techniques for Privacy-Preserving Data Mining
B. Pinkas
(available in PDF and Postscript formats)
ABSTRACT: Research in secure distributed computation, which was done as part of a larger body of research in the theory of cryptography, has achieved remarkable results. It was shown that non-trusting parties can jointly compute functions of their different inputs while ensuring that no party learns anything but the defined output function. These results were shown using generic constructions that can be applied to any function that has an efficient representation as a circuit. We describe these results, discuss their efficiency, and demonstrate their relevance to privacy preserving computation of data mining algorithms. We also show examples of secury computation of data mining algorithms that use these generic constructions.

Database Privacy
M. Olivier
(available in PDF and Postscript formats)
ABSTRACT: The emphasis in database privacy should fall on a balance between confidentiality, integrity and availability of personal data, rather than on confidentiality alone. This balance should not necessarily be a trade-off, but should take into account the sensitive nature of the data being stored and attemp to increase all three dimensions to the highest level possible.
To achieve such a balance, technological means should be developed.
The paper illustrates some of the inherent problems in database privacy that should be addressed by technical solutions. It next demonstrates that the notion of privacy is complex; this complexity is likely to impede development of technical solutions.
Finally, the paper finally uses the notion of informed consent to illustrate how the privacy problem can be viewed from multiple angles to flesh out the underlying problems that may be addressed by technical solutions.

Tools for Privacy Preserving Data Mining
C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin and M.Y. Zhu
(available in PDF and Postscript formats)
ABSTRACT: Privacy preserving mining of distributed data has numerous applications. Each application poses different constraints: What is meant by privacy, what are the desired results, how is the data distributed, what are the constraints on collaboration and cooperative computing, etc. We suggest that the solution to this is a toolkit of components that can be combined for specific privacy-preserving data mining applications. This paper presents some components of such a toolkit, and shows how they can be used to solve several privacy- preserving data mining problems.

Applying Data Mining to Intrusion Detection: The Quest for Automation, Efficiency, and Credibility
W. Lee
(available in PDF and Postscript formats)
ABSTRACT: Intrusion detection is an essential component of the layered computer security mechanisms. It requires accurate and efficient models for analyzing a large amount of system and network audit data. This paper is an overview of our research in applying data mining techniques to build intrusion detection models. We describe a framework for mining patterns from system and network audit data, and constructing features according to analysis of intrusion patterns. We discuss approaches for improving the run-time efficiency as well as the credibility of detection models. We report the ideas, algorithms, and prototype systems we have developed, and discuss open research problems.

Randomization in Privacy-Preserving Data Mining
A. Evfimievski
(available in PDF and Postscript formats)
ABSTRACT: Suppose there are many clients, each having some personal information, and one server, which is interested only in aggregate, statistically significant, properties of this information. The clients can protect privacy of their data by perturbing it with a randomization algorithm and then submitting the randomized version. The randomization algorithm is chosen so that aggregate properties of the data can be recovered with sufficient precision, while individual entries are significantly distorted. How much distortion is neeed to protect privacy can be determined using a privacy measure. Several possible privacy measures are known; finding the best measure is an open question. This paper presents some methods and results in randomization for numerical and categorical data, and discusses the issue of measuring privacy.

Contributed Articles

A Survey on Wavelet Applications in Data Mining
T. Li, Q. Li, S. Zhu, and M. Ogihara
(available in PDF and Postscript formats)
ABSTRACT: Recently there have been significant development in the use of wavelet methods in various data mining processes. However, there has been written no comprehensive survey available on the topic. The goal of this paper is to fill the void. First, the paper presents a high-level data- mining framework that reduces the overall process into smaller components. The paper concludes by discussing the impact of wavelets on data mining research and outlining potential future research directions and applications.

A Perspective on Inductive Databases
L. De Raedt
(available in PDF and Postscript formats)
ABSTRACT: Inductive databases tightly integrate databases with data mining. The key ideas are that data and patterns (or models) are handled in the same way and that an inductive query language allows the user to query and manipulate the patterns (or models) of interest.
This paper proposes a simple and abstract model for inductive databases. We describe the basic formalism, a simple but fairly powerful inductive query language, some basics of reasoning for query optimization, and discuss some memory organization and implementation issues.

The True Lift Model - A Novel Data Mining Approach to Response Modeling in Database Marketing
V. S. Y. Lo
(available in PDF and Postscript formats)
ABSTRACT: In database marketing, data mining has been used extensively to find the optimal customer targets so as to maximize return on investment. In particular, using marketing campaign data, models are typically developed to identify characteristics of customers who are most likely to respond. While these models are helpful in identifying the likely responders, they may be targeting customers who have decided to take the desirable action or not regardless of whether they receive the campaign contact (e.g. mail, call). Based on many years of business experience, we identify the appropriate business objective and its associated mathematical objective function. We point out that the current approach is not directly designed to solve the appropriate business objective. We then propose a new methodology to identify the customers whose decisions will be positively influenced by campaigns. The proposed methodology is easy to implement and can be used in conjunction with most commonly used supervised learning algorithms. An example using simulated data is used to illustrate the proposed methodology. This paper may provide the database marketing industry with a simple but significant methodological improvement and open a new area for further research and development.

Reports from KDD-2002

Background and Overview for KDD Cup 2002 Task 1: Information Extraction from Biomedical Articles
A. Yeh, L. Hirschman, A. Morgan
(available in PDF and Postscript formats)
ABSTRACT: This paper presents a background and overview for task 1 (of 2 tasks) of the KDD Challenge Cup 2002, a competition held in conjunction with the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), July 23-26, 2002. Task 1 dealt with detecting which papers, in a set of fruit fly genetics papers (texts), contained experimental results about gene products (transcripts and proteins), and also within each paper, which genes had experimental results about their products mentioned.

Rule-based Extraction of Experimental Evidence in the Biomedical Domain - the KDD Cup (Task 1)
Y. Regev, M. Finkelstein-Landau, and R. Feldman
(available in PDF and Postscript formats)
ABSTRACT: Below we describe the winning system that we built for the KDD Cup 2002 Task 1 competition. Our system is a Rule-based Information Extraction (IE) system. It combines pattern matching, Natural Language Processing (NLP) tools, semantic constraints based on the domain and the specific task, and a post-processing stage for making the final curation decision based on the various evidence (positive and negative) found within the document. Development and implementation were made using the DIAL IE language and the ClearLab development environment. The results achieved were significantly superior than those achieved using categorization approaches.

A Machine Learning Approach for the Curation of Biomedical Literature - KDD Cup 2002 (Task 1)
S.S. Keerthi, C.J. Ong, K.B. Siah, D.B.L. Lim, W. Chu, M. Shi, D.S. Edwin, R. Menon, L. Shen, J.Y.K. Lim, and H.T. Loh
(available in PDF and Postscript formats)
ABSTRACT: In this paper, we present an automated text classification system for the classification of biomedical papers. This classification is based on whether there is experimental evidence for the expression of molecular gene products for specified genes within a given paper. The system performs pre-processing and data cleaning, followed by feature extraction from the raw text. It subsequently classifies the paper using the extracted features with a Naïve Bayes Classifier. Our approach has made it possible to classify (and curate) biomedical papers automatically, thus potentially saving considerable time and resources.

Automatic Scientific Text Classification Using Local Patterns: KDD CUP 2002 (Task 1)
M. M. Ghanem, Y. Guo, H. Lodhi, and Y. Zhang
(available in PDF and Postscript formats)
ABSTRACT: In this paper, we describe our approach for addressing Task 1 in the KDD CUP 2002 competition. The approach is based on developing and using an improved automatic feature selection method in conjunction with traditional classifiers. The feature selection method used is based on capturing frequently occurring keyword combinations (or motifs) within short segments of the text of a document and has proved to produce more accurate classification results than approaches relying solely on using keyword-based features.

The Genomics of a Signaling Pathway: A KDD Cup Challenge Task
M. Craven
(available in PDF and Postscript formats)
ABSTRACT: This article describes Task 2 of the KDD Cup 2002 data- mining competition. The task involved predicting whether deletions of individual genes in a yeast genome would affect a particular signaling pathway in the cell. With its rich data sources and abundance of missing information, the task is representative of data mining problems in genomics and proved to be quite challenging.

One Class SVM for Yeast Regulation Prediction
A. Kowalczyk and B. Raskutti
(available in PDF and Postscript formats)
ABSTRACT: In this paper, we outline the main steps leading to the development of the winning solution for Task 2 of KDD Cup 2002 (Yeast Gene Regulation Prediction). Our unusual solution was a pair of linear classifiers in high dimensional space (~14,000), developed with just 38 and 84 training examples, respectively, all belonging to the target class only. The classifiers were built using the support vector machine approach outlined in the paper.

Predicting the Effects of Gene Deletion
D. S. Vogel and R. C. Axelrod
(available in PDF and Postscript formats)
ABSTRACT: In this paper, we describe techniques that can be used to predict the effects of gene deletion. We will focus mainly on the creation of predictive variables, and then briefly discuss different modeling techniques that have been used successfully on this data.

Combining Data and Text Mining Techniques for Yeast Gene Regulation Prediction: A Case Study
M.-A. Krogel, M. Denecke, M. Landwehr, and T. Scheffer
(available in PDF and Postscript formats)
ABSTRACT: In order to solve task 2 of the KDD Cup 2002, we exploited various available information sources. In particular, use of relational information describing the interactions among genes and information automatically extracted from scientific abstracts improves the accuracy of our predictions.

Feature Engineering for a Gene Regulation Prediction Task
G. Forman
(available in PDF and Postscript formats)
ABSTRACT: This paper describes an approach that won honorable mention for the gene regulation prediction task of the 2002 KDD Cup competition. Our methodology used extensive cross-validation to direct the search for an appropriate problem representation and the selection of an 'off-the-shelf' induction algorithm. A prominent trait of the dataset is the presence of three hierarchical attributes, for each of which we generated a novel predictive feature: the percentage of positives hierarchically aggregated at the node specified by the instance.

P-tree Classification of Yeast Gene Deletion Data
A. Perera, A. Denton, P. Kotala, W. Jockheck, W. V. Granda, and W. Perrizo
(available in PDF and Postscript formats)
ABSTRACT: Genomics data has many properties that make it different from "typical" relational data. The presence of multi-valued attributes as well as the large number of null values led us to a P-tree-based bit-vector representation in which matching 1-values were counted to evaluate similarity between genes. Quantitative information such as the number of interactions was also included in the classifier. Interaction information allowed us to extend the known properties of one protein with information on its interacting neighbors. Different feature attributes were weighted independently. Relevance of different attributes was systematically evaluated through optimization of weights using a genetic algorithm. The AROC value for the classified list was used as the fitness function for the genetic algorithm.

Report on the SIGKDD-2002 Panel The Perfect Data Mining Tool: Interactive or Automated
M. Ankerst
(available in PDF and Postscript formats)
ABSTRACT: Commercial and academic data mining tools range from being fully automated to highly interactive. We will discuss the role of human involvement in the data mining process. On the one hand, providing interactivity/visualization enables domain knowledge transfer and the use of the human's perceptual capabilities. On the other hand, the vast amount of data to be mined today makes real-time interactivity hard to achieve and unnecessarily burdens the user to perform tasks that may be done automatically. Questions to be discussed include: What are the ideal roles of the computer and of the user in the data mining process? Which data mining methods (clustering, classification, association rules,…) can be improved by more human involvement? What kind of applications requires more human involvement? What kind of applications requires little or no human involvement?
The panel was organized by Mihael Ankerst (The Boeing Company) and the participants were Surajit Chauduri (Microsoft Research), Georges Grinstein (University of Massachusetts Lowell & AnVil, Inc.), Jiawei Han (University of Illinois at Urbana-Champaign), and Gregory Piatetsky-Shapiro (KDnuggets).

BIOKDD 2002: Recent Advances in Data Mining for Bioinformatics
M. J. Zaki, J. T. L. Wang, H. T. T. Toivonen
(available in PDF and Postscript formats)
ABSTRACT: Bioinformatics provides opportunities for developing novel data mining methods. Some of the grand challenges in bioinformatics include protien structure prediction, homology search, multiple alignment and phylogeny construction, genomic sequence analysis, gene finding and gene mapping, as well as applications in gene expression data analysis, drug discovery in pharmaceutical industry, etc. Following the great successful 1st BIOKDD Workshop, the 2nd BIOKDD Workshop on Data Mining in Bioinformatics was held in July 2002, at Edmonton, Canada, in conjunction with the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

KDD-2002 Workshop Report Fractals and Self-similarity in Data Mining: Issues and Approaches
J. Adibi and C. Faloutsos
(available in PDF and Postscript formats)
ABSTRACT: In this report we provide a summary of the first workshop on application of self-similarity and fractals in data mining: issues and approaches held in conjunction with ACM SIGKDD 2002, July 23 at Edmonton, Alberta, Canada.

MDM/KDD2002: Multimedia Data Mining between Promises and Problems
S. J. Simoff and C. Djeraba
(available in PDF and Postscript formats)
ABSTRACT: This report presents a brief overview of multimedia data mining and the corresponding workshop series at ACM SIGKDD conference series on data mining and knowledge discovery. It summarizes the presentations, conclusions and directions for future work that were discussed during the 3rd edition of the International Workshop on Multimedia Data Mining, conducted in conjunction with KDD-2002 in Edmonton, Alberta, Canada.

Multi-Relational Data Mining: a Workshop Report
S. Dzeroski and L. De Raedt
(available in PDF and Postscript formats)
ABSTRACT: In this report, we briefly review the Multi-Relational Data Mining workshop, which was held in Edmonton, Canada on July, 23, 2002 as part of the workshop program of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-02).

WEBKDD 2002 - Web Mining for Usage Patterns & Profiles
B. M. Masand, M. Spiliopoulou, J. Srivastava, and O. R. Zaiane
(available in PDF and Postscript formats)
ABSTRACT: In this paper, we provide a summary of the WEBKDD 2002 workshop, whose theme was 'Web Mining for Usage Patterns and Profiles'. This workshop was held in conjunction with the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002).

News, Events and Announcements

Advertisements
(available in PDF format)

Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003)
(available in PDF format)

SIGKDD Explorations home page
Send comments and suggestions to sunita@it.iitb.ac.in