Clustering
The great thing about the net is that so much information is available, if you know how to find it. Want more information about HMMs? Search for "HMM tutorial" or "Markov tutorial." Want more information about clustering? Search for " clustering tutorial. I found several good ones in the top hits. I'll go through one clustering tutorial in class.
Biopython includes the Pycluster clustering package. I'll use South African milage data to show how to use it. Here's the code I wrote to convert the data into a Python Numeric array. I copy-and-pasted the table from the page into a Python string then parsed the string to get the table information.
import Numeric
lines = """
bloemfontein
1009 cape town
625 1532 durban
559 1034 650 east london
746 425 1219 615 george
389 1407 567 963 1142 johannesburg
157 963 792 721 726 482 kimberley
409 1307 860 994 1071 296 345 mmabatho
788 1798 760 1250 1534 395 873 689 graskop
716 456 1275 671 57 1115 671 1015 1505 oudtshoorn
632 738 901 297 319 1059 736 1080 1450 376 port elizabeth
508 1214 424 226 795 831 670 933 1024 852 478 umtata
""".strip().splitlines()
names = []
distances = Numeric.zeros( (len(lines), len(lines)), Numeric.Int )
for i in range(len(lines)):
line = lines[i]
words = line.split(None, i)
names.append(words[-1])
for j in range(i):
distances[i][j] = distances[j][i] = float(words[j])
# City names
print names
# used to make the Numeric array for "distances.py"
print repr(distances)
# Visually double-check that a couple entries are correct
def get_distance(from_city, to_city):
from_index = names.index(from_city)
to_index = names.index(to_city)
return distances[from_index][to_index]
print get_distance("cape town", "johannesburg")
print get_distance("port elizabeth", "durban")
I'll cluster this data on the assumption that there are two regions in South Africa. I'll use the k-means alorithm. In Pycluster there are two functions for doing k-means; "kcluster", which takes a point in N-dimensional space for each data element, and "kmedoids" which takes the full distance matrix. The distances here are road distances and not straight-line/great-circle distances so I'll use kmedoids. I'll find two clusters:
import Numeric
from Bio import Cluster
names = [ "Bloemfontein", "Cape Town", "Durban", "East London",
"George", "Johannesburg", "Kimberley", "Mmabatho", "Graskop",
"Oudtshoorn", "Port Elizabeth", "Umtata"]
distances = Numeric.array([
[ 0, 1009, 625, 559, 746, 389, 157, 409, 788, 716, 632, 508],
[1009, 0, 1532, 1034, 425, 1407, 963, 1307, 1798, 456, 738, 1214],
[ 625, 1532, 0, 650, 1219, 567, 792, 860, 760, 1275, 901, 424],
[ 559, 1034, 650, 0, 615, 963, 721, 994, 1250, 671, 297, 226],
[ 746, 425, 1219, 615, 0, 1142, 726, 1071, 1534, 57, 319, 795],
[ 389, 1407, 567, 963, 1142, 0, 482, 296, 395, 1115, 1059, 831],
[ 157, 963, 792, 721, 726, 482, 0, 345, 873, 671, 736, 670],
[ 409, 1307, 860, 994, 1071, 296, 345, 0, 689, 1015, 1080, 933],
[ 788, 1798, 760, 1250, 1534, 395, 873, 689, 0, 1505, 1450, 1024],
[ 716, 456, 1275, 671, 57, 1115, 671, 1015, 1505, 0, 376, 852],
[ 632, 738, 901, 297, 319, 1059, 736, 1080, 1450, 376, 0, 478],
[ 508, 1214, 424, 226, 795, 831, 670, 933, 1024, 852, 478, 0],
], Numeric.Float)
clusterids, error, nfound = Cluster.kmedoids(distances, 2)
print "Cluster ids:", clusterids
print "error:", error
print "nfound:", nfound
cities_in_cluster = {}
for name, clusterid in zip(names, clusterids):
cities_in_cluster.setdefault(clusterid, []).append(name)
# Showing off a nice module for when you have long text that
# should fold over multiple lines.
import textwrap
for centroid_id, city_names in cities_in_cluster.items():
print "Cluster around", names[centroid_id]
text = ", ".join(city_names)
for line in textwrap.wrap(text, 70):
print " ", line
Here's the output
Cluster ids: [0 4 0 0 4 0 0 0 0 4 4 0] error: 4236.0 nfound: 1 Cluster around Bloemfontein Bloemfontein, Durban, East London, Johannesburg, Kimberley, Mmabatho, Graskop, Umtata Cluster around George Cape Town, George, Oudtshoorn, Port ElizabethLooking at the map it seems to make sense.
I'll try again with 3 clusters.
Cluster around Umtata Durban, East London, Umtata Cluster around George Cape Town, George, Oudtshoorn, Port Elizabeth Cluster around Johannesburg Bloemfontein, Johannesburg, Kimberley, Mmabatho, Graskopand with 4
Cluster around Bloemfontein Bloemfontein, Kimberley Cluster around Umtata Durban, East London, Umtata Cluster around George Cape Town, George, Oudtshoorn, Port Elizabeth Cluster around Johannesburg Johannesburg, Mmabatho, GraskopI can increase k but the point of clustering is to simplify the data set to understand it. Too many points overfits the data and makes false clusters appear.
You might think the clusterings aren't quite correct. You'll get different results if you use a different metric or include more distinguishing characteristics. For example, you might have that coastal cities should be considered more similar than non-coastal ones. Coming up with good descriptors is hard.
By the way, the NxN distance matrix contains duplicate data because distance(x,x) == 0 and distance(x,y) == distance(y,x). Pycluster only needs the lower triagle of the matrix to get everything it needs. This matrix also works for the kmedoids functiondistances = Numeric.array( [1009, 625, 1532, 559, 1034, 650, 746, 425, 1219, 615, 389, 1407, 567, 963, 1142, 157, 963, 792, 721, 726, 482, 409, 1307, 860, 994, 1071, 296, 345, 788, 1798, 760, 1250, 1534, 395, 873, 689, 716, 456, 1275, 671, 57, 1115, 671, 1015, 1505, 632, 738, 901, 297, 319, 1059, 736, 1080, 1450, 376, 508, 1214, 424, 226, 795, 831, 670, 933, 1024, 852, 478, ], Numeric.Float) clusterids, error, nfound = Cluster.kmedoids(distances, 2)
Hierarchical clustering
Here's the code to make a hierarchical clustering based on the same data set using Pycluster's treecluster() function. Unlike the k-means code there isn't one function for when a distance matrix is known and and another when a distance function should be used. They are merged into one function with a different keyword argument for the two forms. This one is distancematrix
tree, linkdist = Cluster.treecluster(distancematrix = distances) print "Tree:", tree print "linkdist:", linkdistwhich produces
Tree: [[ 9 4] [ 6 0] [ 11 3] [ 7 5] [ 10 -1] [ -4 -2] [ -3 2] [ -5 1] [ 8 -6] [ -7 -9] [ -8 -10]] linkdist: [ 57. 157. 226. 296. 376. 482. 650. 738. 873. 1250. 1798.]The tree data structure is one commonly used for clustering. Each of the tree nodes has an index starting with "-1" and going progressively more negative. Tree node "-1" here is the node formed by merging cities 9 (Oudtshoorn) and 4 (George) and "-2" is the merger of cities 6 (Kimberley) and 0 (Bloemfontein). Tree node "-5" is the merger of city 10 (Port Elizabeth) and tree node -1.
Here's some code to convert that tree data structure into a simple tree display. Better code would reduce the clutter.
import Numeric
from Bio import Cluster
names = [ "Bloemfontein", "Cape Town", "Durban", "East London",
"George", "Johannesburg", "Kimberley", "Mmabatho", "Graskop",
"Oudtshoorn", "Port Elizabeth", "Umtata"]
distances = Numeric.array(
[1009,
625, 1532,
559, 1034, 650,
746, 425, 1219, 615,
389, 1407, 567, 963, 1142,
157, 963, 792, 721, 726, 482,
409, 1307, 860, 994, 1071, 296, 345,
788, 1798, 760, 1250, 1534, 395, 873, 689,
716, 456, 1275, 671, 57, 1115, 671, 1015, 1505,
632, 738, 901, 297, 319, 1059, 736, 1080, 1450, 376,
508, 1214, 424, 226, 795, 831, 670, 933, 1024, 852, 478,
], Numeric.Float)
tree, linkdist = Cluster.treecluster(distancematrix = distances)
#print "Tree:", tree
#print "linkdist:", linkdist
# Kilometers / character;
SCALE = 100.0
def _show_tree(node_id, spacer, tree, linkdist, names):
if node_id < 0:
i = -node_id-1
left, right = tree[i]
d = linkdist[i]
w = int(d/SCALE)
if w > 1:
w -= 1
_show_tree(left, spacer+" "*w+"|", tree, linkdist, names)
if w == 0:
print spacer + "="
else:
print spacer + "-"*w+"|"
_show_tree(right, spacer+" "*w+"|", tree, linkdist, names)
else:
print spacer + "-- " + names[node_id]
def show_tree(tree, linkdist, names):
_show_tree(-len(tree), "", tree, linkdist, names)
show_tree(tree, linkdist, names)
Its output
| | |-- Port Elizabeth
| |--|
| | ||-- Oudtshoorn
| | |=
| | ||-- George
|------|
| |-- Cape Town
----------------|
| | | |-- Umtata
| | |-|
| | | |-- East London
| |-----|
| | |-- Durban
|-----------|
| | |-- Graskop
| |-------|
| | | | |-- Mmabatho
| | | |-|
| | | | |-- Johannesburg
| | |---|
| | | | |-- Kimberley
| | | |-|
| | | | |-- Bloemfontein
Predicting
Given the above tree you could make some sort of categorization, like "South Coast", "Inland" and "East Coast". What if you wanted to categorize a new city? Then the above data set becomes known as a training set. Get the new city information, find the nearest elements of the training set (or nearest centroid, depending on your preference) and use its classification.
Likely you'll have fuzzy data, or classifications which don't quite agree with your metric. For example, you might have a number of chemical compounds of which some are "active" (or "toxic") and others are "inactive" (or "safe"). (These are also called "white" and "black", with "grey" for ambiguous items). Chemists find that chemical substructures provide useful insight into the activity of the compound as a whole. One frequently used structural descriptor is the MDL MACCS keys (example code). This ends up with a set of bits like this:
- bit 1 is True iff there is a 4 membered ring
- bit 3 is True iff there is an aliphatic sulpher single bonded to an aliphatic sulpher
- ...
Some of these bits are better than other bits at making the classification. Ideally you would like to take the characteristic MACCS keys descriptor and let the classification algorithm figure out which bits are better than others. There are a few ways to do that. One is with scalable vector machines. Biopython includes code for classification based on SVMs but it recommends you use LIBSVM instead. Since you have Biopython installed, I'll show you an example based on it.
Here's the program. It tries to identify which messages are spam or "ham" (not spam) based on a training vector made by looking at the most common words in a corpus of ham and spam.
Here's the output when I run it
Top 10 words in ham
THE 2598
AND 1258
FONT 1007
FOR 794
HTTP 649
YOU 634
NBSP 578
THAT 547
COM 500
THIS 464
Top 10 words in spam
FONT 1590
THE 1063
TABLE 615
YOU 607
AND 600
ALIGN 587
CENTER 553
CONTENT 545
SIZE 526
HTTP 509
Using descriptor vector of size 15
THE 244.07
FONT 173.13
AND 123.87
YOU 82.73
HTTP 77.20
FOR 52.93
TABLE 41.00
ALIGN 39.13
NBSP 38.53
CENTER 36.87
THAT 36.47
CONTENT 36.33
SIZE 35.07
COM 33.33
THIS 30.93
============= HAM TESTS
First ham in training set: 0.120362351148
Unknown ham: 1.05219354544
============================================================
(You have acquired this SF.net update because you requested to be on the
list. Honest! An explanation is at the bottom of this email, and also
info on how to be extracted from the list.)
0. Introduction
1. August Project of the Month (POTM): Gourmet Recipe Manager
2. July Project of the Month (POTM): JasperReports
3. June Project of the Month (POTM): Nagios
4. SourceForge.net Subscriptions - Free T-shirt
5. Statistics
This SF.net update brought to you by HP Linux Reference Architecture
-------------------------------------------------------------------
Get everything you need for a successful Linux implementation right
here! From downloads, to information on industry leading platforms,
best of breed partner solutions (from BEA, JBoss, Oracle, MySQL), to
comprehensive management tools, and global support services. See the
Linux Reference Architecture resources site at:
http://solutions.itmanagersjournal.com/hplra.tmpl
-----------------------------------------------------------------
============================================================
============= SPAM TESTS
First spam in training set: -1.03056381363
Unknown spam: -0.222528260367
============================================================
PEhUTUw+DQo8cD48Zm9udCBmYWNlPSJBcmlhbCIgc2l6ZT0iMiI+SSBtZXQg
bWUge2Rpc3x0aGlzfSB7Y2hpY2tpZXxnaXJsaWV9LCBzaGUgdGhvdWdodCAN
Ckkgd2FzIHJpY2guJm5ic3A7IEkge2hpdHx0YXBwZWR9IGl0IGFsbCBuaWdo
dCBsb25nLjwvZm9udD48L3A+DQo8cD48Zm9udCBmYWNlPSJBcmlhbCIgc2l6
ZT0iMiI+PGEgaHJlZj0iaHR0cDovL3d3dy5tZWRpYW1ha2VyaHVnbWFya2V0
c2hha2VyLmNvbS9kcmlwcHkiPg0KaHR0cDovL3tnZXRzb21lfGlnb3Rzb21l
fS5kcmlwcHB5eHNsaXR0ei5jb208L2E+PC9mb250PjwvcD4NCjxwPntOb25l
IG9mIHVzIHdpbGwgZXZlcnkgYWNjb21wbGlzaCBhbnl0aGluZyBleGNlbGxl
bnQgb3IgY29tbWFuZGluZyBleGNlcHQgDQp3aGVuIGhlIGxpc3RlbnMgdG8g
dGhpcyB3aGlzcGVyIHdoaWNoIGlzIGhlYXJkIGJ5IGhpbSBhbG9uZS4gfEl0
IGRvZXMgbm90IGRvIHRvIA0KZHdlbGwgb24gZHJlYW1zIGFuZCBmb3JnZXQg
dG8gbGl2ZS4mbmJzcDsgfTwvcD4NCjxwPiZuYnNwOzwvcD4NCjxwPiZuYnNw
OzwvcD4NCjxwPjxmb250IGxhbmc9IjAiIGZhY2U9IkFyaWFsIiBzaXplPSIy
IiBGQU1JTFk9IlNBTlNTRVJJRiIgUFRTSVpFPSIxMCI+aWYgeW91IA0KbmV2
ZXIgd2FudCB0byBzZWUgdGhpczxicj4NCjwvZm9udD4NCjxmb250IGxhbmc9
IjAiIHN0eWxlPSJiYWNrZ3JvdW5kLWNvbG9yOiAjZmZmZmZmIiBmYWNlPSJB
cmlhbCIgY29sb3I9IiMwMDAw
============================================================
Of course if you're really want to filter spam you should
look at A Plan
For Spam, with Python code available in
SpamBayes.
If you want to write your own classifier, start with that
code base – it's well-written, fast and powerful.
This field is called "KDD" for "knowledge-discovery in databases". It's very wide-ranging and there's a lot more going on than I know, much less can provide pointers. Might as well start someplace and explore.
Copyright © 2001-2010 Dalke Scientific Software, LLC.


