Vorlesung 1
Data on External Storage ? File Organization and Indexing ? Cluster Indexes,
Primary and Secondary Indexes ? Index data Structures ? Hash Based Indexing
...... Agglomerative hierarchical clustering, basic agglomerative hierarchical
clustering algorithm, specific techniques, DBSCAN: traditional density: center?
based ...
Part of the document
Vorlesung 10 25.11.2008
Clustering and Self Organizing Maps 01 to 66 Clustering and Self Organizing Maps
Motivation and Objectives
- Objective of exploratory data analysis: simplified descriptions and
summaries of large data sets
- Clustering
similarity relations between objects in high-dimensional signal space
- Self Organizing Maps
Project high-dimensional signal space on two-dimensional grid
Outline
- Clustering
- Self Organizing Maps
- Emotion SOM
Clustering - Problem Definition
- Clustering Algorithms
o Hierarchical
. Distance Measure
. Variants of Clustering
o Paritional
Problem Definition
- A is a finite set of n data objects [pic]
- A has to be partitioned in k subsets [pic]
- Quality
o Distance between objects of same cluster as small as possible
o Distances between different clusters as large as possible
Overview
- Hierarchical
find successive clusters using previously established clusters
o Agglomerative bottom-up merging methods
o Divisive top-down splitting methods
- Partitional
determine all clusters at once
Example: Hierarchical Agglomerative Clustering
i) start with six data objects
ii) merge clusters into successively larger clusters
Hierarchical Agglomerative Clustering Algorithm
1. Each element is a separate cluster: [pic]
2. Find closest pair [pic] with [pic]
3. Replace [pic] with [pic]
4. If [pic] replace [pic] with [pic]
5. [pic]
6. if [pic] goto 2
How to find closest cluster pair
- How to measure similarity
( Distance Measure
- How distances are measured between clusters
( Variants of Clustering
Distance Measure
- Demands
o Triangle inequality
o Symmetry
o Positive definite
Distance Measure: Euclid
- Most common metric
- Often inadequate, e.g. physical dimensions of broomsticks
Distance Measure: Pearson
- Balanced inclusion of all dimensions
- No correction for correlation in the variables
Distance Measure: Mahalanobis
- Takes correlation in the variables into account
- Corrects data for different scales
Distance Measure: City-Block
- Also known as rectilinear distance, L1 distance or Manhattan distance
- Distance between two points is the sum of the absolute differences of
their coordinates Distance Measure: Minowski
- More general distance measure
- Special cases Euclid ([pic]) and City-Block ([pic])
Distance Measure: Maximum Norm
- Also known as Supremum Norm or Chebyshev Norm
Variants of Clustering
- There are many variants available
- The criteria used differ and hence different clusterings may be
obtained for the same data, even when using the same distance measure!
Variants of Clustering: Single Linkage
- Minimum distance between members of the two clusters
Variants of Clustering: Complete Linkage
- Maximum distance between members of the two clusters
Variants of Clustering: Centroid Linkage
- Difference between the centroids of the two clusters
Variants of Clustering: Average Linkage
- Average distance between each point in the first cluster and all other
points in the second cluster
Variants of Clustering: Ward's Method
- Calculate the total sum of squared deviations form the mean of a
cluster
- Fuse cluster that produce the smallest possible increase in the error
sum of squares E Partitional Clustering: K means - Qualitative
- Basic idea: assign each object to the cluster whose centre is nearest
1. Choose the number of clusters k
2. Randomly generate k clusters and determine the cluster centres
3. Assign each object to the nearest cluster centre
4. Recompute the new cluster centres
5. Repeat the two previous steps until the assignment of objects to
cluster hasn't changed
Partitional Clustering: K means - Formal
- Initialization
Randomly generate K clusters with means [pic]
- Assignment step
Assign each object [pic] to the nearest cluster centre
[pic]
Set indicator variable [pic] to one if mean k is the closest mean to
[pic], otherwise [pic] is zero
- Update step
Recompute the new cluster centres [pic]
[pic]
Clustering: References
Clustering: Exercises Self Organizing Maps - Main Aspects
- Wine Example
- Mapping Signal Space to SOM
- SOM Training
- Prediction
- Supervised SOMs
- Wearable Computing Application: Emotion SOM
Fields of Application
Main Aspects
- Visualization
high-dimensional signal space on a two-dimensional grid of nodes
- Abstraction
preserve the topological relationships of the signal space on the two-
dimensional display Wine Example
- 178 wines grown in the same region in Italy
- Derived from three different sorts
- Chemical analysis determined the quantities of 13 constituents
SOM Configuration
- Consists of a set of non-interconnected units
- Units are spatially ordered in a two-dimensional grid
- Each unit in the map is equipped with a weight vector
- Weight vectors are of the same dimension like the input objects
Mapping Signal Space to SOM
- Given an input vector
[pic]
- Given a SOM where each unit S is equipped with a weight vector
[pic]
- The image of the input vector x on the SOM array is defined as the
array element s that matches best with x
[pic]
SOM Training
- Objective
Preserving the topological relationships of the signal space on the
two-dimensional display
- Basic Principle
SOM unite that are close in the grid will activate each other to
"learn" from the same input x
- Learning
Adapt weight vectors during the learning process to respond similarly
to certain input patterns
Training: Initialization
- Weight vectors are usually initialized by
o The average input vector plus small random vectors, or
o Sampling from the subspace spanned by the two largest principal
component eigenvectors
Training: Winner unit
- Same procedure like in the mapping process
- The unit processing the most similar weight vector is assigned to the
winner
Training: Update of weight vectors - Qualitative
- Subsequently, the weight vectors of the winning unit and its closest
neighbours in the map are updated by
1. Calculating the difference between the actual input object and the
respective weight vector
2. Adding this difference attenuated by a certain factor to the original
weight vector
- Thus the weights of the winning unit and its neighbours become
slightly more similar to the presented input object
Training: Size of Neighbourhood
- Initially, the size of the neighbourhood is approximately equal to
that of the size of the map itself
- During training, the size of the neighbourhood is gradually decreased
- Finally, only the weights of the winning unit itself are adapted
Training: Size of Neighbourhood - Consequences
- Initially, global characteristics of the input signal space are
captured
- During training, local clusters of units are forced
- Finally, the winning unit becomes specialised to those objects which
are frequently mapped onto it
Training: Update of weight vectors - Formal
- Given an input vector x at time t, the update of the weight vector
[pic] of the node i is done by
[pic]
using the neighbourhood function
[pic]
with a learning rate [pic] and with location vectors [pic] of nodes c
and i
Prediction
- Units-wise averaging of the outputs associated with the mapped input
objects
- Problem: holes corresponding to units onto which none of the training
objects is mapped
Supervised SOMs: Overview
- Supervised Kohonen Network
- Bi-Directional Kohonen Network
- Counter Propagation Network
- XY-fused Kohonen Network Supervised Kohonen Network - General
- The input map Xmap and the output map Ymap are "glued" together to a
combined input-output map XYmap
- The topological formation of the concatenated map is driven by X and Y
in a truly supervised way
Supervised Kohonen Network - Training
- Each input X and its corresponding output Y are concatenated to serve
as input for the combined XYmap
- Wine example
13-dim input + 3-dim output
- Same training procedure as for unsupervised SOM
- After training, the input and output maps are decoupled
Supervised Kohonen Networks - Limitations
- Objects X and Y in the training set must be scaled properly, for
optimal embedding of the topology
- It remains unclear how to deal with the relative weight of the number
of variables in X and the number of variables in Y
Bi-Directional Kohonen Network - General
- Units in the Xmap and the Ymap are updated in an alternating way
- Update is driven by the topology gradually embedded in the weight
vectors located in the opposite map
Bi-Directional Kohonen Network - Training
- Usage of a "fused" similarity measure based on a weighted sum of
o Similarities [pic] between an object X and all units in the Xmap
o Similarities between output Y and the units in the Ymap
- Location of the winning unit determined by dominating similarity
measure [pic]
- First updating pass, only Xmap adapted
- Reverse pass, Ymap are updated object-wise by using the winner
determined by the dominating similarity [pic]
Counter Propagation Network - General Counter Propagation Network - Training
XY-fused Kohonen Network - General XY-fused Kohonen Network