In search of an inverted index library

[ previous | newer ] /home/writings/diary/archive/2012/06/10/inverted_index_library

In search of an inverted index library

I'm looking for a Python library for managing inverted indices. I've been using Python sets, which are nice and fast, but they take up too much memory. Lists use 1/10th the space but are a lot slower and still not compact enough. The most likely solution so far is Xapian. Are there other options I should consider? If so, email me or comment on the Reddit thread I started on this topic.

Let explain what I want and what I've done so far.

Sparse feature fingerprints

I'm working with chemical fingerprints. That's not unusual for me. What's new is that these fingerprints are not fixed length but are of indefinite length. For each molecule I find all fragments with no more than 7 atoms, and put the fragment into a canonical representation, here a canonical fragment SMILES. Each unique representation gets its own unique integer identifier. The end result is a fingerprint which looks like:

0:4,1:3,5:2,18:2,28:2,124,47298,53119 254846

The original SMILES for this is CCO.CCO.O=[V].Cl. I picked this structure because it has a very small fingerprint.

This fingerprint line says that PubChem record 254846 has 8 substructures ("features"). The features are stored along with a count of the number of times it's found. Here, features '5', '18', and '28' are found in two difference places in the molecule, while features '124', '47298', and '53119' are found only once. Also, the feature field and the identifier field are separated by a tab.

I kept track of the original canonical description. Here's the breakdown:

        canonical
 bit#  description
 ----  -----------
    0    C
    1    O
    5    CC
   18    CO
   28    CCO
  124    Cl
47298    O=[V]
53119    [V]

With a bit of back and forth you can verify that it's correct.

Overall I have about 25 million records and 2 million canonical values, using a subset of PubChem's 35 million compounds. The average fingerprint contains about 300-400 features, though value ranges from 1 to about 2,000 features. These are "sparse" fingerprints because because the number of features per record divided by the number of features total is very small (350/2,000,000) - small enough that it's more efficient to store the list of feature bits (0, 1, 5, etc.) than it is to store the bit pattern of 0s and 1s.

The features are very long-tail. Carbon ("C") exists in nearly every record, in both aromatic and non-aromatic forms. "O", "cc", "ccc", and "CC" are also very popular. But about 1/2 of the features only exist in one record. Here's more on the fragment distribution pattern.

Why do I want an inverted index?

Feature fingerprints are old news in cheminformatics. One application is to improve substructure search performance. Suppose you are searching for a phenol (structure "c1ccccc1O"). You could do a substructure isomorphism test with every single record, but that's slow. Instead, there are some obvious improvements.

You know that if a record matches then it must have at least 7 atoms and 7 bonds, at least 6 carbons, and at least 1 oxygen. It's easy to precompute this information, so the set labeled "at least 6 carbons" contains the set of all records with at least 6 carbons, and so on. When a query comes in, analyze the structure based on those predetermined features. This ends up with a list of sets, one for each feature. Find the intersection of those sets, and you've reduced your search space, potentially by a lot.

This is exactly when one should use an inverted index.

(There are other optimizations. If the query exists already as one of the precomputed sets then there's no need to any additional substructure search.)

Feature "folding"

Most cheminformatics systems "fold" their sparse bits into a fixed bit length, like 1024 bits. For example, if a feature is assigned the canonical id of "458793484" then some programs will apply modulo 1024 to that to get the folded bit id "524". This saves a lot of space, at the cost of extra subgraph isomorphism search time.

I've been wondering, do I really need to fold the bits? Why can't I keep all of the information around? My thoughts are informed by Managing Gigabytes, which points out that there are efficient ways to encode the inverted index. Can I apply those techniques to this problem?

Working code

The first step is to establish that there is a problem. I'll use a subset of 10 files, which represents about 0.5% of my actual data set, and write some code to handle it. Here's the final code and here's the code with the benchmark data set (37 MB, compressed with 7zip).

The simplest inverted index implementation uses Python sets.

from collections import defaultdict

class InvertedIndex(object):
    def __init__(self):
        # Map from a feature string like "ccO" to an integer feature id
        self.feature_to_id = {}
        # Map from an integer feature id to a set of record ids
        self.inverted_indices = defaultdict(set)

    def get_feature_id(self, feature):
        try:
            return self.feature_to_id[feature]
        except KeyError:
            n = len(self.feature_to_id)
            self.feature_to_id[feature] = n
            return n

    def add_record(self, id, feature_ids):
        for feature_id in feature_ids:
            self.inverted_indices[feature_id].add(id)

    def search(self, features):
        # These are *features*, not *feature_ids*.
        # If the feature wasn't seen before then there are no entries
        # and this will raise a KeyError, meaning nothing found.
        try:
            terms = [self.inverted_indices[self.feature_to_id[feature]] for feature in features]
        except KeyError:
            return set()
        terms.sort(key=len)
        return set.intersection(*terms)

Very simple, and it works pretty well. There's a bit of API trickiness to get the feature id for a given feature. This simplifies the loader and handles integer interning. (Interning is outside of the discussion of this essay.)

Second is the code parse the data files and load the index. The file format looks like:

#FPC1
#bit-0=13120 C
#bit-1=12986 c
#bit-2=12978 cc
 ...
#bit-46924=1 COCNCP=O
0:6,1:21,2:21,3:22,4:23,5:2,6,...,36858,36869,37247,37429:4,37431,39942   26350001
0:6,1:21,2:21,3:22,4:23,5:2,6,...,36858,36869,37247,37429:4,37431,39942   26350002
 ...
0:5,1:12,2:12,3:12,4:12,5:3,6:3,...,25998,32031,32086:2,33182,33998,38357    26374982

There's a version line, some header lines starting "#bit-", and the fingerprint lines. The fingerprint line uses the sparse fingerprint I mentioned earlier, followed by a tab, followed by the record identifier. (Here it's a PubChem id, which is always an integer.)

Note: the bit definitions are not the same across different files. The header lines show that bit 0 corresponds to the feature "C" for this file (and it exists in 13120 records), and bit 1 is "c" (in 12986 records), but in another file I see that bit 0 is the feature "cc" and in a third it's "ccc(c)CCN". So the parser needs to be smart and map the per-file bit numbers to its own internal fragment identifier.

The details of the parser aren't important to this essay, and left to the implementation.

Benchmark

I need a benchmark. I'll load 10 data files (links to the 37 MB benchmark file) to make the inverted indices. I also need some query data. I reused the parser to extract a subset (1%) of the records in a fingerprint file. Actually, two fingerprint files. One timing set reuses one of the data files used to build the invertex indices, which means that everything query will match, while the other uses a new data file, which should have many fewer matches.

Again, see the code for full details.

When I run the benchmark I get:

== Sets ==
Load 0 Compound_000000001_000025000.counts.fpc
Load 1 Compound_000025001_000050000.counts.fpc
Load 2 Compound_000050001_000075000.counts.fpc
Load 3 Compound_000075001_000100000.counts.fpc
Load 4 Compound_000100001_000125000.counts.fpc
Load 5 Compound_000125001_000150000.counts.fpc
Load 6 Compound_000150001_000175000.counts.fpc
Load 7 Compound_000175001_000200000.counts.fpc
Load 8 Compound_000200001_000225000.counts.fpc
Load 9 Compound_000225001_000250000.counts.fpc
Loading took 63.0 seconds and 2092896256 memory
100 66575 315.0 per second
200 100909 337.5 per second
all exist: 227 searches took 0.7 seconds (counts 114990) 
100 2323 1161.5 per second
200 3865 1360.9 per second
some exist: 230 searches took 0.2 seconds (counts 4048)

Some of this is progress information. The important parts are the load time (63.0) seconds, the total amount of memory used in the index (2.1 GB), and the search times of 0.7 and 0.2 seconds for the two test sets. (I use 'psutil' to get the memory usage; note that the exact meaning is OS dependent.) Also, the match counts of "114990" and "4048" are useful checks that each inverted index implementation is correct.

This output shows the problem. I'm using 2.1 GB of memory for 10 data files. I only have 12 GB on this desktop machine, and remember, the full data set is 200 times bigger. Do you have a 500 GB machine I can use?

Use an array instead of a set

Python objects have overhead. For example, the Python integer 300 on my desktop requires 24 bytes, while I know I have about 2**25 records so could easily get away with 4 bytes.

>>> import sys
>>> sys.getsizeof(3)
24

The other 20 bytes handle things like a reference to the Python integer class object and the reference count.

The Python set data type uses a lot of space because it's implemented as a hash, and many of the internal slots are empty in order to improve lookup performance. Quick testing shows that a set uses about 30-40 bytes for each element in the set.

It's not quite as bad as what this because the set elements contain Python references, which are 8 byte pointers on my 64-bit Python. But that's still about 42 bytes of space used to store at most 4 bytes of data. I should be able to do better.

The usual way "to do better" in Python, when storing a list of C-like integers is to use an 'array.array.' This is similar to a Python list, except that the contents are all of the same type, so there's no per-item overhead to store the type information. (Note: pje described this approach on the Reddit thread.

The downside is that there's no equivalent to a set.intersection. For now, I'll convert an array into a set, and do the normal set intersection. Here's the code:

# Make sure I can handle 4 byte integers
import array
_test = array.array("I", (0,))
try:
    _test[0] = 2**31+100
    array_code = "I"
except OverflowError:
    array_code = "L"

def make_unsigned_array():
    return array.array(array_code, ())

class InvertedIndexArray(object):
    def __init__(self):
        # Map from a feature string like "ccO" to an integer feature id
        self.feature_to_id = {}
        # Map from an integer feature id to an set of record ids
        self.inverted_indices = defaultdict(make_unsigned_array)


    def get_feature_id(self, feature):
        try:
            return self.feature_to_id[feature]
        except KeyError:
            n = len(self.feature_to_id)
            self.feature_to_id[feature] = n
            return n

    def add_record(self, id, feature_ids):
        for feature_id in feature_ids:
            self.inverted_indices[feature_id].append(id)

    def search(self, features):
        assert features
        try:
            terms = [self.inverted_indices[self.feature_to_id[feature]] for feature in features]
        except KeyError:
            return set()
        terms.sort(key=len)
        # Set conversion is expensive; it's faster to pass an iteratable to intersection_update.
        result = set(terms[0])
        for term in terms[1:]:
            result.intersection_update(term)
            # Short-circuit the evalution when the result set is empty.
            # Gives a 3x speedup for the "some exists" benchmark, with no
            # performance hit for the "all exist" benchmark.
            if not result:
                break
        return result

How well does it work? Here's the benchmark output using the array-based inverted index implementation, trimmed somewhat to make it easier to read:

== Sets ==
Loading took 63.0 seconds and 2092896256 memory
all exist: 227 searches took 0.7 seconds (counts 114990) 
some exist: 230 searches took 0.2 seconds (counts 4048) 
== Arrays ==
Loading took 51.2 seconds and 165552128 memory
all exist: 227 searches took 41.7 seconds (counts 114990) 
some exist: 230 searches took 16.1 seconds (counts 4048)

The new version takes less 10% of the memory as the old, which is about what I predicted. It's nice to see the confirmation. It's also a bit faster to build the data structure. But on the other hand, search is over 60x slower. At 0.1 seconds per query, this performance is not acceptable the types of highly interactive searches I want to do.

There are of course hybrid solutions. This version spends a lot of time to convert an array to a set, which is then discarded. I convert the 'C', 'c' and a few other inverted indices a lot, so what I could do is use an LRU cache for the top 1,000 or so sets.

But let's get back to my problem. My data set is 200x bigger than the test set, so simple scaling of 166MB says I'll need 33 GB of RAM in the best of cases. I really would like this to work on my desktop, instead of renting some Amazon compute node every time I want to try things out.

Stop the presses! pypy 1.9 just came out

The pypy-1.9 release notice says:

Sets now have strategies just like dictionaries. This means for example that a set containing only ints will be more compact (and faster).

This is good because in previous benchmarking I found that pypy sets were rather slower than cpython's own sets.

I re-ran the benchmark using pypy 1.7 and 1.9. Here are the results, first for 1.7:

== Sets ==
Loading took 278.6s seconds and (psutil not installed) memory
all exist: 227 searches took 1.0 seconds (counts 114990)
some exist: 230 searches took 0.2 seconds (counts 4048)
== Arrays ==
Loading took 32.7s seconds and (psutil not installed) memory
all exist: 227 searches took 299.6 seconds (counts 114990)
some exist: 230 searches took 335.6 seconds (counts 4048)

and then for 1.9:

== Sets ==
Loading took 47.9s seconds and (psutil not installed) memory
all exist: 227 searches took 0.5 seconds (counts 114990)
some exist: 230 searches took 0.1 seconds (counts 4048)
== Arrays ==
Loading took 35.4s seconds and (psutil not installed) memory
all exist: 227 searches took 186.2 seconds (counts 114990)
some exist: 230 searches took 72.9 seconds (counts 4048)

The good news is that 1.9 set operations are faster — 5x faster to load, and 2x faster to search. In fact, the search time is faster than cpython's! The bad news is it looks like pypy's intersection_update isn't yet as well optimized. Also, manual inspection shows that pypy 1.9 uses 2.1 GB of memory, which is about the same as cpython's. pypy on its own isn't the right solution.

Is there an inverted index library for Python?

Inverted indices are used in all sort of search engines, and there are many well-known ways to improve intersection performance and decrease memory use. Some of the ways I know of storing bit sets are:

FastBit: An Efficient Compressed Bitmap Index Technology
sparse and dense hash sets from the Google sparsehash library
Judy1 arrays from Judy

However, I haven't found good Python bindings for these. (I wrote some Python/Judy bindings years ago. Unfortunately, it's buggy, and I would have to start from scratch to get it working again.)

Another possibility is to store the 'on' bits as a sorted list. Sorted lists are nice because the intesection corresponds to a merge sort, in strictly linear time. Other methods are described in Experimental Analysis of a Fast Intersection Algorithm for Sorted Sequences. There was also a recent posting on how to take advantage of SSE instructions for fast sorted list intersection. Since the fields are sorted, it's also possible to compress the data by, for example, storing deltas using something like Elias gamma coding or perhaps an encoder tuned to the data set.

Here too I've not been able to find an existing library which handled this for me, much less one with Python bindings.

Search engine components

Actually, I'm wrong. These are available, as part of a larger search engine package like Lucene and Xapian. Lucene is very well known, but I would have to mix the Java runtime and the chemistry code that I'm using from C++. Doable, but it feels like too much weight.

Instead, I downloaded, compiled, and installed Xapian and its Python bindings and wrote an inverted index based on its API. (Justinsaccount was the first to suggest this in the Reddit thread.)

Xapian uses the filesystem to store the database, so this version uses a lot less local memory than the previous two inverted indices. I wonder how much of the data set is stored in memory vs. how much is read from disk. I imagine there are tuning variables, and it looks like there's an in-memory option I could also do. I am quite the newbie with this library.

Here's my implementation:

try:
    import xapian
except ImportError:
    xapian = None

class InvertedIndexXapian(object):
    def __init__(self, dbname):
        self.db = xapian.WritableDatabase(dbname, xapian.DB_CREATE_OR_OPEN)
        
    def get_feature_id(self, feature):
        return feature
    
    def add_record(self, id, feature_ids):
        try:
            doc = self.db.get_document(id)
        except xapian.DocNotFoundError:
            doc = xapian.Document()
        for feature_id in feature_ids:
            doc.add_boolean_term(feature_id)
        self.db.replace_document(id, doc)

    def search(self, features):
        enquire = xapian.Enquire(self.db)
        query = xapian.Query(xapian.Query.OP_AND, features)
        enquire.set_query(query)
        matches = enquire.get_mset(0, 200000)
        return matches

and when I run the benchmark I get this timing information:

Loading took 349.9 seconds and 46465024 memory
all exist: 227 searches took 15.7 seconds (counts 114990) 
some exist: 230 searches took 20.7 seconds (counts 4048)

Queries are about 3-12x slower than the cpython implementation, and 5-20x slower than pypy. But on the other hand, this uses only 90 MB of RAM, although it does build a 340 MB data file. Simple scaling suggests that I'll need 65 GB of disk space for the entire data set. That's quite doable.

The Xapian load time is a lot slower, but as the data is persistent, reload time from a built database is trivially fast. The documentation says I can set XAPIAN_FLUSH_THRESHOLD to a larger value for better indexing performance. Are there other tuning variables I should consider?

Xapian might be the way to go. Value ranges might be useful when I think about how to handle fragment counts. I'll need to do more evaluation. For example, I don't like that it's an order of magnitude slower than Python sets. After all, Python sets aren't optimized for this task, so I expected faster search times.

Sadly, there isn't much documentation about Xapian - and nowhere near like what there is for Lucene. (For example, there appears to be no page which describes the tuning environment variables. There's XAPIAN_MAX_CHANGESETS and XAPIAN_FLUSH_THRESHOLD .. are these or something else relevant for what I want?)

I still want a good inverted index library for Python

Xapian might work well for this project, but there are other projects I work on which use inverted indicies. In one project I try to optimize the ~1024 fixed patterns used in a traditional feature-based substructure screen. Both my greedy algorithm and my genetic algorithm for this used a lot of temporary intersections to evaluate a given selection. I'm not sure how well it would work on top of a persistent database.

If you know of a good library, please email me or comment on the Reddit thread I started on the topic.

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me