|
January 2003. Volume 4, Issue 2
Editorial by
Johannes Gehrke. (available in PDF format)
Contributed Articles on Privacy and Security
Data Mining, National Security, Privacy and Civil Liberties B. Thuraisingham
(available in PDF
and Postscript
formats) ABSTRACT: In this paper, we describe the threats to privacy that can occur through data mining and then view the privacy problem as a variation of the inference problem in databases.
The Inference Problem: A Survey
C. Farkas and S. Jajodia (available in PDF
and Postscript
formats) ABSTRACT:
Access control models protect sensitive data from
unauthorized disclosure via direct access, however, they
fail to prevent indirect accesses. Indirect data
disclosure via inference channels occurs when sensitive
information can be inferred from non-sensitive data and
metadata. Inference channels are often low-bandwidth and
complex; nevertheless, detection and removal of inference
channels is necessary to guarantee data security. This
paper presents a survey of the current and emerging
research in data inference control and emphasizes the
importance of targeting this so often overlooked problem
during database security design.
Cryptographic Techniques for Privacy-Preserving Data Mining B. Pinkas (available in PDF
and Postscript
formats) ABSTRACT:
Research in secure distributed computation, which was
done as part of a larger body of research in the theory of
cryptography, has achieved remarkable results. It was
shown that non-trusting parties can jointly compute
functions of their different inputs while ensuring that no
party learns anything but the defined output function.
These results were shown using generic constructions that
can be applied to any function that has an efficient
representation as a circuit. We describe these results,
discuss their efficiency, and demonstrate their relevance
to privacy preserving computation of data mining
algorithms. We also show examples of secury computation
of data mining algorithms that use these generic
constructions.
Database Privacy
M. Olivier
(available in PDF
and Postscript
formats) ABSTRACT:
The emphasis in database privacy should fall on a balance
between confidentiality, integrity and availability of
personal data, rather than on confidentiality alone. This
balance should not necessarily be a trade-off, but should
take into account the sensitive nature of the data being
stored and attemp to increase all three dimensions to the
highest level possible.
To achieve such a balance, technological means should be
developed.
The paper illustrates some of the inherent problems in
database privacy that should be addressed by technical
solutions. It next demonstrates that the notion of
privacy is complex; this complexity is likely to impede
development of technical solutions.
Finally, the paper finally uses the notion of informed
consent to illustrate how the privacy problem can be
viewed from multiple angles to flesh out the underlying
problems that may be addressed by technical solutions.
Tools for Privacy Preserving Data Mining
C. Clifton, M.
Kantarcioglu, J. Vaidya, X. Lin and M.Y. Zhu
(available in PDF
and Postscript
formats) ABSTRACT:  
Privacy preserving mining of distributed data has numerous
applications. Each application poses different
constraints: What is meant by privacy, what are the
desired results, how is the data distributed, what are
the constraints on collaboration and cooperative
computing, etc. We suggest that the solution to this is
a toolkit of components that can be combined for
specific privacy-preserving data mining applications.
This paper presents some components of such a toolkit,
and shows how they can be used to solve several privacy-
preserving data mining problems.
Applying Data Mining to Intrusion Detection: The Quest for Automation, Efficiency, and Credibility
W. Lee
(available in PDF
and Postscript
formats) ABSTRACT:  
Intrusion detection is an essential component of the
layered computer security mechanisms. It requires
accurate and efficient models for analyzing a large amount
of system and network audit data. This paper is an
overview of our research in applying data mining
techniques to build intrusion detection models. We
describe a framework for mining patterns from system and
network audit data, and constructing features according
to analysis of intrusion patterns. We discuss approaches
for improving the run-time efficiency as well as the
credibility of detection models. We report the
ideas, algorithms, and prototype systems we have
developed, and discuss open research problems.
Randomization in Privacy-Preserving Data Mining
A. Evfimievski
(available in PDF
and Postscript
formats) ABSTRACT:  
Suppose there are many clients, each having some personal
information, and one server, which is interested only in
aggregate, statistically significant, properties of this
information. The clients can protect privacy of their
data by perturbing it with a randomization algorithm and
then submitting the randomized version. The randomization
algorithm is chosen so that aggregate properties of the
data can be recovered with sufficient precision, while
individual entries are significantly distorted. How much
distortion is neeed to protect privacy can be determined
using a privacy measure. Several possible privacy
measures are known; finding the best measure is an open
question. This paper presents some methods and results
in randomization for numerical and categorical data, and
discusses the issue of measuring privacy.
Contributed
Articles
A Survey on Wavelet Applications in Data Mining
T. Li, Q. Li, S. Zhu, and M. Ogihara
(available in PDF
and Postscript
formats) ABSTRACT:  
Recently there have been significant development in the
use of wavelet methods in various data mining processes.
However, there has been written no comprehensive survey
available on the topic. The goal of this paper is to fill
the void. First, the paper presents a high-level data-
mining framework that reduces the overall process into
smaller components. The paper concludes by discussing the
impact of wavelets on data mining research and outlining
potential future research directions and applications.
A Perspective on Inductive Databases
L. De Raedt
(available in PDF
and Postscript
formats) ABSTRACT:  
Inductive databases tightly integrate databases with data
mining. The key ideas are that data and patterns (or
models) are handled in the same way and that an inductive
query language allows the user to query and manipulate
the patterns (or models) of interest.
This paper proposes a simple and abstract model for
inductive databases. We describe the basic formalism, a
simple but fairly powerful inductive query language, some
basics of reasoning for query optimization, and discuss
some memory organization and implementation issues.
The True Lift Model - A Novel Data Mining Approach to Response Modeling in Database Marketing
V. S. Y. Lo
(available in PDF
and Postscript
formats) ABSTRACT:  
In database marketing, data mining has been used extensively to find
the optimal customer targets so as to maximize return on
investment. In particular, using marketing campaign data, models are
typically developed to identify characteristics of customers who are
most likely to respond. While these models are helpful in identifying
the likely responders, they may be targeting customers who have
decided to take the desirable action or not regardless of whether they
receive the campaign contact (e.g. mail, call). Based on many years of
business experience, we identify the appropriate business objective
and its associated mathematical objective function. We point out that
the current approach is not directly designed to solve the appropriate
business objective. We then propose a new methodology to identify the
customers whose decisions will be positively influenced by
campaigns. The proposed methodology is easy to implement and can be
used in conjunction with most commonly used supervised learning
algorithms. An example using simulated data is used to illustrate the
proposed methodology. This paper may provide the database marketing
industry with a simple but significant methodological improvement and
open a new area for further research and development.
Reports from KDD-2002
Background and Overview for KDD Cup 2002 Task 1: Information Extraction from Biomedical Articles
A. Yeh, L. Hirschman, A. Morgan
(available in PDF
and Postscript
formats) ABSTRACT:  
This paper presents a background and overview for task 1
(of 2 tasks) of the KDD Challenge Cup 2002, a competition
held in conjunction with the ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD),
July 23-26, 2002. Task 1 dealt with detecting which
papers, in a set of fruit fly genetics papers (texts),
contained experimental results about gene products
(transcripts and proteins), and also within each paper,
which genes had experimental results about their
products mentioned.
Rule-based Extraction of Experimental Evidence in the Biomedical Domain - the KDD Cup (Task 1)
Y. Regev, M. Finkelstein-Landau, and R. Feldman
(available in PDF
and Postscript
formats) ABSTRACT:  
Below we describe the winning system that we built for
the KDD Cup 2002 Task 1 competition. Our system is a
Rule-based Information Extraction (IE) system. It
combines pattern matching, Natural Language Processing
(NLP) tools, semantic constraints based on the domain
and the specific task, and a post-processing stage for
making the final curation decision based on the various
evidence (positive and negative) found within the
document. Development and implementation were made
using the DIAL IE language and the
ClearLab development environment. The results
achieved were significantly superior than those
achieved using categorization approaches.
A Machine Learning Approach for the Curation of Biomedical Literature - KDD Cup 2002 (Task 1)
S.S. Keerthi, C.J. Ong,
K.B. Siah, D.B.L. Lim, W. Chu,
M. Shi, D.S. Edwin, R. Menon, L. Shen, J.Y.K. Lim, and H.T. Loh
(available in PDF
and Postscript
formats) ABSTRACT:  
In this paper, we present an automated text classification system for the classification of biomedical papers. This classification is based on whether there is experimental evidence for the expression of molecular gene products for specified genes within a given paper. The system performs pre-processing and data cleaning, followed by feature extraction from the raw text. It subsequently classifies the paper using the extracted features with a Naïve Bayes Classifier. Our approach has made it possible to classify (and curate) biomedical papers automatically, thus potentially saving considerable time and resources.
Automatic Scientific Text Classification Using Local Patterns: KDD CUP 2002 (Task 1)
M. M. Ghanem, Y. Guo, H. Lodhi, and Y. Zhang
(available in PDF
and Postscript
formats) ABSTRACT:  
In this paper, we describe our approach for addressing Task 1 in the KDD CUP 2002 competition. The approach is based on developing and using an improved automatic feature selection method in conjunction with traditional classifiers. The feature selection method used is based on capturing frequently occurring keyword combinations (or motifs) within short segments of the text of a document and has proved to produce more accurate classification results than approaches relying solely on using keyword-based features.
The Genomics of a Signaling Pathway: A KDD Cup Challenge Task
M. Craven
(available in PDF
and Postscript
formats) ABSTRACT:  
This article describes Task 2 of the KDD Cup 2002 data-
mining competition. The task involved predicting whether
deletions of individual genes in a yeast genome would
affect a particular signaling pathway in the cell. With
its rich data sources and abundance of missing
information, the task is representative of data mining
problems in genomics and proved to be quite challenging.
One Class SVM for Yeast Regulation Prediction
A. Kowalczyk and B. Raskutti
(available in PDF
and Postscript
formats) ABSTRACT:  
In this paper, we outline the main steps leading to the development of
the winning solution for Task 2 of KDD Cup 2002 (Yeast Gene Regulation
Prediction). Our unusual solution was a pair of linear classifiers in
high dimensional space (~14,000), developed with just 38 and 84
training examples, respectively, all belonging to the target class
only. The classifiers were built using the support vector machine
approach outlined in the paper.
Predicting the Effects of Gene Deletion
D. S. Vogel and R. C. Axelrod
(available in PDF
and Postscript
formats) ABSTRACT:  
In this paper, we describe techniques that can be used to predict the effects of gene deletion. We will focus mainly on the creation of predictive variables, and then briefly discuss different modeling techniques that have been used successfully on this data.
Combining Data and Text Mining Techniques for Yeast Gene Regulation Prediction: A Case Study
M.-A. Krogel, M. Denecke, M. Landwehr, and T. Scheffer
(available in PDF
and Postscript
formats) ABSTRACT:  
In order to solve task 2 of the KDD Cup 2002, we exploited various available information sources. In particular, use of relational information describing the interactions among genes and information automatically extracted from scientific abstracts improves the accuracy of our predictions.
Feature Engineering for a Gene Regulation Prediction Task
G. Forman
(available in PDF
and Postscript
formats) ABSTRACT:  
This paper describes an approach that won honorable mention for the gene regulation prediction task of the 2002 KDD Cup competition. Our methodology used extensive cross-validation to direct the search for an appropriate problem representation and the selection of an 'off-the-shelf' induction algorithm. A prominent trait of the dataset is the presence of three hierarchical attributes, for each of which we generated a novel predictive feature: the percentage of positives hierarchically aggregated at the node specified by the instance.
P-tree Classification of Yeast Gene Deletion Data
A. Perera, A. Denton, P. Kotala, W. Jockheck, W. V. Granda, and W. Perrizo
(available in PDF
and Postscript
formats) ABSTRACT:  
Genomics data has many properties that make it different from "typical" relational data. The presence of multi-valued attributes as well as the large number of null values led us to a P-tree-based bit-vector representation in which matching 1-values were counted to evaluate similarity between genes. Quantitative information such as the number of interactions was also included in the classifier. Interaction information allowed us to extend the known properties of one protein with information on its interacting neighbors. Different feature attributes were weighted independently. Relevance of different attributes was systematically evaluated through optimization of weights using a genetic algorithm. The AROC value for the classified list was used as the fitness function for the genetic algorithm.
Report on the SIGKDD-2002 Panel The Perfect Data Mining Tool: Interactive or Automated
M. Ankerst
(available in PDF
and Postscript
formats) ABSTRACT:  
Commercial and academic data mining tools range from being fully automated to highly interactive. We will discuss the role of human involvement in the data mining process. On the one hand, providing interactivity/visualization enables domain knowledge transfer and the use of the human's perceptual capabilities. On the other hand, the vast amount of data to be mined today makes real-time interactivity hard to achieve and unnecessarily burdens the user to perform tasks that may be done automatically. Questions to be discussed include: What are the ideal roles of the computer and of the user in the data mining process? Which data mining methods (clustering, classification, association rules,…) can be improved by more human involvement? What kind of applications requires more human involvement? What kind of applications requires little or no human involvement?
The panel was organized by Mihael Ankerst (The Boeing Company) and the participants were Surajit Chauduri (Microsoft Research), Georges Grinstein (University of Massachusetts Lowell & AnVil, Inc.), Jiawei Han (University of Illinois at Urbana-Champaign), and Gregory Piatetsky-Shapiro (KDnuggets).
BIOKDD 2002: Recent Advances in Data Mining for Bioinformatics
M. J. Zaki, J. T. L. Wang, H. T. T. Toivonen
(available in PDF
and Postscript
formats) ABSTRACT:  
Bioinformatics provides opportunities for developing
novel data mining methods. Some of the grand challenges
in bioinformatics include protien structure prediction,
homology search, multiple alignment and phylogeny
construction, genomic sequence analysis, gene finding
and gene mapping, as well as applications in gene
expression data analysis, drug discovery in pharmaceutical
industry, etc. Following the great successful 1st
BIOKDD Workshop, the 2nd BIOKDD Workshop on Data Mining
in Bioinformatics was held in July 2002, at Edmonton,
Canada, in conjunction with the 8th ACM SIGKDD
International Conference on Knowledge Discovery and Data
Mining.
KDD-2002 Workshop Report Fractals and Self-similarity in Data Mining: Issues and Approaches
J. Adibi and C. Faloutsos
(available in PDF
and Postscript
formats) ABSTRACT:  
In this report we provide a summary of the first workshop on application of self-similarity and fractals in data mining: issues and approaches held in conjunction with ACM SIGKDD 2002, July 23 at Edmonton, Alberta, Canada.
MDM/KDD2002: Multimedia Data Mining between Promises and Problems
S. J. Simoff and C. Djeraba
(available in PDF
and Postscript
formats) ABSTRACT:  
This report presents a brief overview of multimedia data mining and the corresponding workshop series at ACM SIGKDD conference series on data mining and knowledge discovery. It summarizes the presentations, conclusions and directions for future work that were discussed during the 3rd edition of the International Workshop on Multimedia Data Mining, conducted in conjunction with KDD-2002 in Edmonton, Alberta, Canada.
Multi-Relational Data Mining: a Workshop Report
S. Dzeroski and L. De Raedt
(available in PDF
and Postscript
formats) ABSTRACT:  
In this report, we briefly review the Multi-Relational
Data Mining workshop, which was held in Edmonton, Canada
on July, 23, 2002 as part of the workshop program of the
8th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD-02).
WEBKDD 2002 - Web Mining for Usage Patterns & Profiles
B. M. Masand, M. Spiliopoulou, J. Srivastava, and O. R. Zaiane
(available in PDF
and Postscript
formats) ABSTRACT:  
In this paper, we provide a summary of the WEBKDD 2002 workshop, whose theme was 'Web Mining for Usage Patterns and Profiles'. This workshop was held in conjunction with the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002).
News, Events and Announcements
Advertisements
(available in PDF format)
Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003)
(available in PDF format)
|