ECCB

Tutorials

Wednesday
September 28, 2005

9:00/13:15 - Room 9 (2nd Floor)
T1: Mining Structured Data in Bioinformatics
Prof. Stefan Kramer

The topic of the tutorial is mining in structured data. This is particularly relevant for data mining applications in bioinformatics, since the majority of biological data is not kept in databases consisting of a single, flat table. Instead, we are frequently dealing with databases of structured and linked objects. In other words, the “objects” in bioinformatics databases often have a rich internal structure and are connected by some relation. (Consider, for instance, databases of proteins, small molecules, metabolic and regulatory networks, text databases, etc.) The tutorial will give an overview of data mining techniques for sequences, trees, graphs and relational databases. We will present techniques for both descriptive and predictive data mining in this context. In descriptive mining, we are looking for local patterns to characterize the data. In predictive mining, we are looking for models that can be used to make predictions for new, unseen cases. Along the two dimensions (types of data and predictive/descriptive), the tutorial is organized as follows: the first two parts of the tutorial are devoted to descriptive mining in databases of itemsets, strings and sequences, trees, graphs and relational databases. The third part of the tutorial deals with predictive mining based on propositionalization (i.e., feature construction using patterns), instance-based learning and kernel methods for graph and relational databases.

Synopsis:

The tutorial will give an overview of data mining techniques for sequences, trees, graphs and relational databases. Tree mining is of interest for analyzing text or linguistic data (XML documents, parse trees, etc.), RNA data and phylogenetic trees. Graph mining is useful for finding network motifs in metabolic and regulatory networks, and substructural fragments in small molecules. Mining in relational databases is important for analyzing data from heterogeneous sources in the life sciences.

In the first part of the tutorial, I will explain, in ascending order of complexity, mining in itemset (transaction) data, in string or sequence data, and in tree data. Typical strategies for searching the space of patterns will be presented for the simple case of itemsets (transaction data): breadth-first (APriori and levelwise search), depth-first, and combined depth-first/breadth-first. Subsequently, the basic ideas will be transferred to mining tree and graph data. Another focus will be on the use of index structures for organizing both the data and the patterns, enabling fast access to information about them. String mining will again be explained in ascending order of complexity: first, we will explain mining interesting substrings (without wildcards) in string or sequence databases, then will extend the approach to substrings containing wildcards. Finally, we will explain serial episodes, i.e., substrings with arbitrary gaps between the characters. In tree mining, different classes of tree patterns can be chosen depending on the application in mind (from embedded to free trees). In the presentation, we will focus on the basic, underlying ideas more than on minor technical details.

In the second part, we will continue with mining general graph data. Several of the previously developed concepts can be applied to graph mining as well. After a brief introduction to the topic, we will sketch two of the most important approaches so far: AGM (performing breadth-first search), and gSpan (performing depth-first search). Subsequently, we will give a gentle introduction to mining relational databases. For convenience, we will restrict ourselves to simple conjunctive datalog queries. The Warmr algorithm for finding frequently successful datalog queries will be explained in terms of inputs and outputs.

In the third part of the tutorial, we will give an overview of predictive mining in structured data. First, we will explain how patterns found in descriptive mining can be used to reformulate data for predictive purposes. Second, instance-based learning techniques for structured data will be presented. Last, but not least, a survey of kernel methods for structured data, in particular graphs, is given.

All theoretical concepts and algorithms will be illustrated by running examples from data mining applications in bioinformatics. A comprehensive reference list will be included, with links to online resources and literature where available.

Part I: Descriptive Mining:
From Itemsets to Tree Mining
(90 minutes)

- Transaction data and frequent itemsets
- APriori and levelwise search
- String mining
- Introduction to string mining
- Version space trees
- Mining with wildcards
- Mining serial episodes
- Tree mining
- Introduction to tree mining
- Mining embedded subtrees (TreeMiner)
- Mining induced subtrees (FREQT)
- Mining unordered subtrees (Unot)
- Mining free trees (FreeTreeMiner)

Coffee Break (30 minutes)

Part II: Descriptive Mining:
Graph Mining and Mining Relational Databases
(60 minutes)

- Graph mining
- Introduction to graph mining
- AGM
- gSpan
- Mining relational databases
- SQL, datalog, conjunctive queries, query containment, theta-subsumption
- Warmr algorithm (inputs and outputs)

Part III: Predictive Mining:
Propositionalization, Instance-Based Learning and
Kernel Methods
(60 minutes)

- Propositionalization: using frequent patterns for predictive mining
- Instance-based learning: distance measures for graphs and relational databases
- Kernel methods: graph kernels

Back to Tutorial Index