Dalke Scientific Software: More science. Less time. Products

Clustering

The great thing about the net is that so much information is available, if you know how to find it. Want more information about HMMs? Search for "HMM tutorial" or "Markov tutorial." Want more information about clustering? Search for " clustering tutorial. I found several good ones in the top hits. I'll go through one clustering tutorial in class.

Biopython includes the Pycluster clustering package. I'll use South African milage data to show how to use it. Here's the code I wrote to convert the data into a Python Numeric array. I copy-and-pasted the table from the page into a Python string then parsed the string to get the table information.

import Numeric

lines = """
bloemfontein
1009	cape town
625	1532	durban
559	1034	650	east london
746	425	1219	615	george
389	1407	567	963	1142	johannesburg
157	963	792	721	726	482	kimberley
409	1307	860	994	1071	296	345	mmabatho
788	1798	760	1250	1534	395	873	689	graskop
716	456	1275	671	57	1115	671	1015	1505	oudtshoorn
632	738	901	297	319	1059	736	1080	1450	376	port elizabeth
508	1214	424	226	795	831	670	933	1024	852	478	umtata
""".strip().splitlines()

names = []
distances = Numeric.zeros( (len(lines), len(lines)), Numeric.Int )
for i in range(len(lines)):
    line = lines[i]
    words = line.split(None, i)
    names.append(words[-1])
    for j in range(i):
        distances[i][j] = distances[j][i] = float(words[j])

# City names
print names
# used to make the Numeric array for "distances.py"
print repr(distances)

# Visually double-check that a couple entries are correct
def get_distance(from_city, to_city):
    from_index = names.index(from_city)
    to_index = names.index(to_city)
    return distances[from_index][to_index]

print get_distance("cape town", "johannesburg")
print get_distance("port elizabeth", "durban")

I'll cluster this data on the assumption that there are two regions in South Africa. I'll use the k-means alorithm. In Pycluster there are two functions for doing k-means; "kcluster", which takes a point in N-dimensional space for each data element, and "kmedoids" which takes the full distance matrix. The distances here are road distances and not straight-line/great-circle distances so I'll use kmedoids. I'll find two clusters:

import Numeric
from Bio import Cluster

names = [ "Bloemfontein", "Cape Town", "Durban", "East London",
          "George", "Johannesburg", "Kimberley", "Mmabatho", "Graskop",
          "Oudtshoorn", "Port Elizabeth", "Umtata"]
    
distances = Numeric.array([
[   0, 1009,  625,  559,  746,  389,  157,  409,  788,  716,  632,  508],
[1009,    0, 1532, 1034,  425, 1407,  963, 1307, 1798,  456,  738, 1214],
[ 625, 1532,    0,  650, 1219,  567,  792,  860,  760, 1275,  901,  424],
[ 559, 1034,  650,    0,  615,  963,  721,  994, 1250,  671,  297,  226],
[ 746,  425, 1219,  615,    0, 1142,  726, 1071, 1534,   57,  319,  795],
[ 389, 1407,  567,  963, 1142,    0,  482,  296,  395, 1115, 1059,  831],
[ 157,  963,  792,  721,  726,  482,    0,  345,  873,  671,  736,  670],
[ 409, 1307,  860,  994, 1071,  296,  345,    0,  689, 1015, 1080,  933],
[ 788, 1798,  760, 1250, 1534,  395,  873,  689,    0, 1505, 1450, 1024],
[ 716,  456, 1275,  671,   57, 1115,  671, 1015, 1505,    0,  376,  852],
[ 632,  738,  901,  297,  319, 1059,  736, 1080, 1450,  376,    0,  478],
[ 508, 1214,  424,  226,  795,  831,  670,  933, 1024,  852,  478,    0],
], Numeric.Float)

clusterids, error, nfound = Cluster.kmedoids(distances, 2)
print "Cluster ids:", clusterids
print "error:", error
print "nfound:", nfound

cities_in_cluster = {}
for name, clusterid in zip(names, clusterids):
    cities_in_cluster.setdefault(clusterid, []).append(name)

# Showing off a nice module for when you have long text that
# should fold over multiple lines.
import textwrap
for centroid_id, city_names in cities_in_cluster.items():
    print "Cluster around", names[centroid_id]
    text = ", ".join(city_names)
    for line in textwrap.wrap(text, 70):
        print "  ", line

Here's the output
Cluster ids: [0 4 0 0 4 0 0 0 0 4 4 0]
error: 4236.0
nfound: 1
Cluster around Bloemfontein
   Bloemfontein, Durban, East London, Johannesburg, Kimberley, Mmabatho,
   Graskop, Umtata
Cluster around George
   Cape Town, George, Oudtshoorn, Port Elizabeth
Looking at the map it seems to make sense.

I'll try again with 3 clusters.

Cluster around Umtata
   Durban, East London, Umtata
Cluster around George
   Cape Town, George, Oudtshoorn, Port Elizabeth
Cluster around Johannesburg
   Bloemfontein, Johannesburg, Kimberley, Mmabatho, Graskop
and with 4
Cluster around Bloemfontein
   Bloemfontein, Kimberley
Cluster around Umtata
   Durban, East London, Umtata
Cluster around George
   Cape Town, George, Oudtshoorn, Port Elizabeth
Cluster around Johannesburg
   Johannesburg, Mmabatho, Graskop
I can increase k but the point of clustering is to simplify the data set to understand it. Too many points overfits the data and makes false clusters appear.

You might think the clusterings aren't quite correct. You'll get different results if you use a different metric or include more distinguishing characteristics. For example, you might have that coastal cities should be considered more similar than non-coastal ones. Coming up with good descriptors is hard.

By the way, the NxN distance matrix contains duplicate data because distance(x,x) == 0 and distance(x,y) == distance(y,x). Pycluster only needs the lower triagle of the matrix to get everything it needs. This matrix also works for the kmedoids function
distances = Numeric.array(
[1009,
  625, 1532,
  559, 1034,  650,
  746,  425, 1219,  615,
  389, 1407,  567,  963, 1142,
  157,  963,  792,  721,  726,  482,
  409, 1307,  860,  994, 1071,  296,  345,
  788, 1798,  760, 1250, 1534,  395,  873,  689,
  716,  456, 1275,  671,   57, 1115,  671, 1015, 1505,
  632,  738,  901,  297,  319, 1059,  736, 1080, 1450,  376,
  508, 1214,  424,  226,  795,  831,  670,  933, 1024,  852,  478,
], Numeric.Float)

clusterids, error, nfound = Cluster.kmedoids(distances, 2)

Hierarchical clustering

Here's the code to make a hierarchical clustering based on the same data set using Pycluster's treecluster() function. Unlike the k-means code there isn't one function for when a distance matrix is known and and another when a distance function should be used. They are merged into one function with a different keyword argument for the two forms. This one is distancematrix

tree, linkdist = Cluster.treecluster(distancematrix = distances)
print "Tree:", tree
print "linkdist:", linkdist
which produces
Tree: [[  9   4]
 [  6   0]
 [ 11   3]
 [  7   5]
 [ 10  -1]
 [ -4  -2]
 [ -3   2]
 [ -5   1]
 [  8  -6]
 [ -7  -9]
 [ -8 -10]]
linkdist: [   57.   157.   226.   296.   376.   482.   650.   738. 
  873.  1250.  1798.]
The tree data structure is one commonly used for clustering. Each of the tree nodes has an index starting with "-1" and going progressively more negative. Tree node "-1" here is the node formed by merging cities 9 (Oudtshoorn) and 4 (George) and "-2" is the merger of cities 6 (Kimberley) and 0 (Bloemfontein). Tree node "-5" is the merger of city 10 (Port Elizabeth) and tree node -1.

Here's some code to convert that tree data structure into a simple tree display. Better code would reduce the clutter.

import Numeric
from Bio import Cluster

names = [ "Bloemfontein", "Cape Town", "Durban", "East London",
          "George", "Johannesburg", "Kimberley", "Mmabatho", "Graskop",
          "Oudtshoorn", "Port Elizabeth", "Umtata"]
    
distances = Numeric.array(
[1009,
  625, 1532,
  559, 1034,  650,
  746,  425, 1219,  615,
  389, 1407,  567,  963, 1142,
  157,  963,  792,  721,  726,  482,
  409, 1307,  860,  994, 1071,  296,  345,
  788, 1798,  760, 1250, 1534,  395,  873,  689,
  716,  456, 1275,  671,   57, 1115,  671, 1015, 1505,
  632,  738,  901,  297,  319, 1059,  736, 1080, 1450,  376,
  508, 1214,  424,  226,  795,  831,  670,  933, 1024,  852,  478,
], Numeric.Float)
tree, linkdist = Cluster.treecluster(distancematrix = distances)

#print "Tree:", tree
#print "linkdist:", linkdist

# Kilometers / character; 
SCALE = 100.0
def _show_tree(node_id, spacer, tree, linkdist, names):
    if node_id < 0:
        i = -node_id-1
        left, right = tree[i]
        d = linkdist[i]

        w = int(d/SCALE)
        if w > 1:
            w -= 1

        _show_tree(left,  spacer+" "*w+"|", tree, linkdist, names)
        if w == 0:
            print spacer + "="
        else:
            print spacer + "-"*w+"|"
        _show_tree(right, spacer+" "*w+"|", tree, linkdist, names)
    else:
        print spacer + "-- " + names[node_id]
        

def show_tree(tree, linkdist, names):
    _show_tree(-len(tree), "", tree, linkdist, names)

show_tree(tree, linkdist, names)

Its output
                |      |  |-- Port Elizabeth
                |      |--|
                |      |  ||-- Oudtshoorn
                |      |  |=
                |      |  ||-- George
                |------|
                |      |-- Cape Town
----------------|
                |           |     | |-- Umtata
                |           |     |-|
                |           |     | |-- East London
                |           |-----|
                |           |     |-- Durban
                |-----------|
                |           |       |-- Graskop
                |           |-------|
                |           |       |   | |-- Mmabatho
                |           |       |   |-|
                |           |       |   | |-- Johannesburg
                |           |       |---|
                |           |       |   | |-- Kimberley
                |           |       |   |-|
                |           |       |   | |-- Bloemfontein

Predicting

Given the above tree you could make some sort of categorization, like "South Coast", "Inland" and "East Coast". What if you wanted to categorize a new city? Then the above data set becomes known as a training set. Get the new city information, find the nearest elements of the training set (or nearest centroid, depending on your preference) and use its classification.

Likely you'll have fuzzy data, or classifications which don't quite agree with your metric. For example, you might have a number of chemical compounds of which some are "active" (or "toxic") and others are "inactive" (or "safe"). (These are also called "white" and "black", with "grey" for ambiguous items). Chemists find that chemical substructures provide useful insight into the activity of the compound as a whole. One frequently used structural descriptor is the MDL MACCS keys (example code). This ends up with a set of bits like this:

Some of these bits are better than other bits at making the classification. Ideally you would like to take the characteristic MACCS keys descriptor and let the classification algorithm figure out which bits are better than others. There are a few ways to do that. One is with scalable vector machines. Biopython includes code for classification based on SVMs but it recommends you use LIBSVM instead. Since you have Biopython installed, I'll show you an example based on it.

Here's the program. It tries to identify which messages are spam or "ham" (not spam) based on a training vector made by looking at the most common words in a corpus of ham and spam.

Here's the output when I run it

Top 10 words in ham
       THE    2598
       AND    1258
      FONT    1007
       FOR     794
      HTTP     649
       YOU     634
      NBSP     578
      THAT     547
       COM     500
      THIS     464
Top 10 words in spam
      FONT    1590
       THE    1063
     TABLE     615
       YOU     607
       AND     600
     ALIGN     587
    CENTER     553
   CONTENT     545
      SIZE     526
      HTTP     509
Using descriptor vector of size 15
THE 244.07
FONT 173.13
AND 123.87
YOU 82.73
HTTP 77.20
FOR 52.93
TABLE 41.00
ALIGN 39.13
NBSP 38.53
CENTER 36.87
THAT 36.47
CONTENT 36.33
SIZE 35.07
COM 33.33
THIS 30.93

============= HAM TESTS
First ham in training set: 0.120362351148
Unknown ham: 1.05219354544
============================================================
(You have acquired this SF.net update because you requested to be on the
list. Honest! An explanation is at the bottom of this email, and also
info on how to be extracted from the list.)

0. Introduction
1. August Project of the Month (POTM): Gourmet Recipe Manager
2. July Project of the Month (POTM): JasperReports
3. June Project of the Month (POTM): Nagios
4. SourceForge.net Subscriptions - Free T-shirt
5. Statistics


This SF.net update brought to you by HP Linux Reference Architecture
-------------------------------------------------------------------
Get everything you need for a successful Linux implementation right
here!  From downloads, to information on industry leading platforms,
best of breed partner solutions (from BEA, JBoss, Oracle, MySQL), to
comprehensive management tools, and global support services. See the
Linux Reference Architecture resources site at:
http://solutions.itmanagersjournal.com/hplra.tmpl
-----------------------------------------------------------------
============================================================


============= SPAM TESTS
First spam in training set: -1.03056381363
Unknown spam: -0.222528260367
============================================================
PEhUTUw+DQo8cD48Zm9udCBmYWNlPSJBcmlhbCIgc2l6ZT0iMiI+SSBtZXQg
bWUge2Rpc3x0aGlzfSB7Y2hpY2tpZXxnaXJsaWV9LCBzaGUgdGhvdWdodCAN
Ckkgd2FzIHJpY2guJm5ic3A7IEkge2hpdHx0YXBwZWR9IGl0IGFsbCBuaWdo
dCBsb25nLjwvZm9udD48L3A+DQo8cD48Zm9udCBmYWNlPSJBcmlhbCIgc2l6
ZT0iMiI+PGEgaHJlZj0iaHR0cDovL3d3dy5tZWRpYW1ha2VyaHVnbWFya2V0
c2hha2VyLmNvbS9kcmlwcHkiPg0KaHR0cDovL3tnZXRzb21lfGlnb3Rzb21l
fS5kcmlwcHB5eHNsaXR0ei5jb208L2E+PC9mb250PjwvcD4NCjxwPntOb25l
IG9mIHVzIHdpbGwgZXZlcnkgYWNjb21wbGlzaCBhbnl0aGluZyBleGNlbGxl
bnQgb3IgY29tbWFuZGluZyBleGNlcHQgDQp3aGVuIGhlIGxpc3RlbnMgdG8g
dGhpcyB3aGlzcGVyIHdoaWNoIGlzIGhlYXJkIGJ5IGhpbSBhbG9uZS4gfEl0
IGRvZXMgbm90IGRvIHRvIA0KZHdlbGwgb24gZHJlYW1zIGFuZCBmb3JnZXQg
dG8gbGl2ZS4mbmJzcDsgfTwvcD4NCjxwPiZuYnNwOzwvcD4NCjxwPiZuYnNw
OzwvcD4NCjxwPjxmb250IGxhbmc9IjAiIGZhY2U9IkFyaWFsIiBzaXplPSIy
IiBGQU1JTFk9IlNBTlNTRVJJRiIgUFRTSVpFPSIxMCI+aWYgeW91IA0KbmV2
ZXIgd2FudCB0byBzZWUgdGhpczxicj4NCjwvZm9udD4NCjxmb250IGxhbmc9
IjAiIHN0eWxlPSJiYWNrZ3JvdW5kLWNvbG9yOiAjZmZmZmZmIiBmYWNlPSJB
cmlhbCIgY29sb3I9IiMwMDAw
============================================================

Of course if you're really want to filter spam you should look at A Plan For Spam, with Python code available in SpamBayes. If you want to write your own classifier, start with that code base – it's well-written, fast and powerful.

This field is called "KDD" for "knowledge-discovery in databases". It's very wide-ranging and there's a lot more going on than I know, much less can provide pointers. Might as well start someplace and explore.



Copyright © 2001-2013 Andrew Dalke Scientific AB