Inverted index using Python sets

[ previous | newer ] /home/writings/diary/archive/2011/12/23/inverted_index

Inverted index using Python sets

I am working on a problem which is very similar to an inverted index. What's an inverted index? Suppose I want to find a book on "sharks." I could search every book in the library until I find one on sharks, but that's tedious, and every time I do a new search I would need to re-search all of the books again. (Hopefully I would remember some of what I read the first time!)

Instead, I can invert the task. For every word in every book, make a list of books which contain the word. That also takes a lot of work but I only need to do it once. To search for a book about sharks, I look up the word "shark" and get the small list of candidates books, which I need to search manually. I still need to search them because while 20,000 Leagues Under the Sea mentions sharks, it isn't about sharks, and neither is a book which mentions a "loan shark" or the "land shark" skit.

This list of word to books which contain the word is called an "inverted index."

Compound term searches

I can use it as the basis for more complex queries. I want to find books which mention "whale shark". If the inverted index contains only single words then I get the list of books containing the word "shark" and manually search those for "whale shark", but it would be better if I combined the list of books containing "whale" and the list of books containing "shark" to make a new list of those books containing both "whale" and "shark."

In other words, I find the intersection of those two lists.

An inverted index for letters in words

It's very easy to create an inverted index using Python's "set type." Instead of the usual case of searching a book (or document) for words, I'll show an example of how to search words for letters. On my computer, the file "/usr/share/dict/words" contains 234936 different English words, with one word per line, of which 233614 are unique after I convert everything to lowercase.

I'll turn that into an inverted index where each letter is mapped to the set of words which contain that letter:

import collections
inverted_index = collections.defaultdict(set)
for line in open("/usr/share/dict/words"):
    word = line.strip().lower()  # ignore case
    for letter in word:
        inverted_index[letter].add(word)

I'll check the number of inverted indices (there should be one for each letter), and I'll show the sizes of a couple of them.

>>> len(inverted_index)
26
>>> len(inverted_index["a"])
144086
>>> len(inverted_index["j"])
2993

This means there are 144086 unique lower-cased words with an "a" or "A" in them, but only 2993 with a "j" or "J". (From here on I'll only mention the lower-case letter even when I mean either lower or upper case.) How many words have both an "a" and a "j" in them?

>>> len(inverted_index["a"] & inverted_index["j"])
1724

sorted() and heapq.nsmallest()

What are the first 5 of them, alphabetically?

>>> sorted(inverted_index["a"] & inverted_index["j"])[:5]
['abject', 'abjectedness', 'abjection', 'abjective', 'abjectly']

I used sorted() because it's a builtin function. However, while it works, it's a bit wasteful to sort the entire list of 1724 items when I only want the first 5. If you need something faster, try heapq.nsmallest (and heapq.nlargest.). It can be faster because it only worries about sorting the needed subset

>>> import heapq
>>> heapq.nsmallest(5, inverted_index["a"] & inverted_index["j"])
['abject', 'abjectedness', 'abjection', 'abjective', 'abjectly']

A quick timing test shows that heapq.nsmallest is about 40% faster than sorted()[:5].

Inverted index as a search filter

What about a harder search? Which words contain all 6 vowels (including y) in alphabetical order? The simplest solution is a linear search using a regular expression:

>>> import re
>>> sequential_vowels = re.compile("a.*e.*i.*o.*u.*y")
>>> words = [line.strip() for line in open("/usr/share/dict/words")
...                             if sequential_vowels.search(line)]
>>> len(words)
8
>>> words
['abstemiously', 'adventitiously', 'auteciously', 'autoeciously', 'facetiously',
'pancreaticoduodenostomy', 'paroeciously', 'sacrilegiously']

This works, but if this type of search will occur often, and if there's enough memory, and if there's a performance need, then it's easy to speed up using an inverted index.

Filter using the inverted index

I'll split this search into two stages. The first will filter out the obvious mismatches, and leave a smaller set of candidates for the second stage.

For the first stage, I'll use the inverted index to find the words which contains all of the vowels. The inverted index doesn't know the character order, so once I find the candidates with all of the letters then I'll use the regular expression test from before.

To find the set of words with all of the vowels, I could continue the sequence of "&" binary operators as I did earlier, but that gets to be clumsy when there are six terms. Instead, I'll call set.intersection() with the intersection sets as parameters:

>>> vowels = [inverted_index[c] for c in "aeiouy"]
>>> len(set.intersection(*vowels))
670

The variable "vowels" contains a list of sets, "*vowels" in a function call turns that list parameter into individual parameters, and "set.intersection" creates a new set which is the intersection of all of the sets in vowels.

("set.intersection" used here is actually an unbound method, and it's pretty rare to find Python code where using an unbound method makes sense. The above code is almost identical to "vowels[0].intersection(*vowels[1:])".)

I used two lines above, more for clarity reasons. For myself I would probably put it into a single line:

>>> len(set.intersection(*(inverted_index[c] for c in "aeiouy")))
670

Yes, the inverted index can be done in a single line!

Testing the candidates

The first stage reduced the search space from 233614 words to 670. In the second stage I'll use the regular expression to check which ones contain the vowels in order.

>>> candidates = set.intersection(*(inverted_index[c] for c in "aeiouy"))
>>> [word for word in candidates if sequential_vowels.search(word)]
['autoeciously', 'adventitiously', 'facetiously', 'abstemiously', 'sacrilegiously',
'auteciously', 'pancreaticoduodenostomy', 'paroeciously']

You can verify that it finds the same matches as before, although because sets are unordered, the result order has changed.

I've shown that the inverted index can be used to make the code more complicated. Is is worthwhile? That is, how much faster is the new code?

My timings show that the old version (which does 233614 regular expression searches) takes 0.092 seconds to run, while the new one takes 0.056 seconds. It's about 40% faster.

Use integers as set elements, not strings

It's easy to go faster still. The core of Python's set itersection works something like this:

new_set = set()
for element in set1:
    if element in set2:
        new_set.add(element)

This requires a hash comparison for every element in set1. If that passes then there's an equality test in set2, and if that passes then there's another hash and possible equality test to insert into new_set.

The set element are currently strings. String hash and comparisons are very optimized in Python, but integers are even faster. What if I used an index into a word list rather than the word itself? The corresponding code is:

import collections

# Get all of the words into a list of words.
# Ignore words which are the same except for capitalization.
unique_words = set(line.strip().lower() for line in open("/usr/share/dict/words"))
words = sorted(unique_words)

# Map from character to the set of word indicies
inverted_index = collections.defaultdict(set)
for i, word in enumerate(words):
    for c in word.lower():
      inverted_index[c].add(i)

I can do the inverted index operations just like before, but the result is a list of indices into the "words" list. For example:

>>> set.intersection(*(inverted_index[c] for c in "jkx"))
set([99716, 98492])
>>> for i in set.intersection(*(inverted_index[c] for c in "jkx")):
...   print words[i]
... 
jukebox
jackbox

I'll modify the "sequential_vowels" code to use the index-based inverted index instead of the string-based version:

>>> candidates = set.intersection(*(inverted_index[c] for c in "aeiouy"))
>>> [words[i] for i in candidates if sequential_vowels.search(words[i])]
['pancreaticoduodenostomy', 'adventitiously', 'abstemiously', 'auteciously', 'sacrilegiously', 'autoeciously', 'facetiously', 'paroeciously']

My timings numbers give 0.029 seconds per search, with nearly all of the time spent in the set intersection. Remember that the brute-force linear search takes 0.094 seconds and the original inverted index takes 0.056 seconds, so switching to integer-based indices brings another factor of two performance gain. The overall search with an inverted index is 3x faster than the original regex-based linear search.

Order-dependent performance

Python's set.intersection() actually works more like this:

def intersection(*args):
  left = args[0]
  # Perform len(args)-1 pairwise-intersections
  for right in args[1:]:

    # Tests take O(N) time, so minimize N by choosing the smaller set
    if len(left) > len(right):
      left, right = right, left

    # Do the pairwise intersection
    result = set()
    for element in left:
      if element in right:
        result.add(element)
    
    left = result  # Use as the start for the next intersection

  return left

That is, it does a set of pair-wise reductions of list of sets. The order of the set operations affects the performance! Recall that there are only 1724 words with "j" in them. If I search for words with "m", "a", and "j" in them, as

>>> len(set.intersection(*(inverted_index[c] for c in "maj")))
286

then Python computes the intersection of inverted_index["m"] and inverted_index["a"], giving an intermediate set with 41148 hits, which it then intersects with the 1724 "j" elements.

However, if the search order were "jma" then the intermediate set for the intersection of "j" and "m" give only 450 elements, which means only 450 tests against inverted_index["a"].

Both give the same answer, but one requires a lot more work than the other.

Here's evidence of just how big that impact is:

>>> def go(letters):
...   t1 = time.time()
...   for i in range(1000):
...     x = len(set.intersection(*(inverted_index[c] for c in letters)))
...   return x, (time.time()-t1)
... 
>>> go("jma")
(286, 0.12344503402709961)
>>> go("jam")
(286, 0.2954399585723877)
>>> go("amj")
(286, 6.223098039627075)
>>>

Yes, it's a factor of 50 between the slowest and fastest ordering!

There are some obvious ways to make the Python code faster. The easiest is to process the sets from smallest size to largest. That was proposed in Issue3069 on 2008-06-10, but the patch was not integrated into the Python codebase.

However, that's not necessarily the best strategy. Suppose I want letters with "q", "u", and "c" in them. There are 3619, 75144, and 85679 words with those letters, respectively, so you might think the best sort order is "quc". Testing that hypothesis:

>>> go("quc")
(842, 0.40471911430358887)
>>> go("qcu")
(842, 0.2493000030517578)

shows that "qc" is the more selective pair. This is because "q" and "u" are highly correlated; there are only

>>> len(inverted_index["q"] - inverted_index["u"])
14

14 words in the data set with a 'q' but without a 'u', while there are 2776 words with a 'q' and no 'c'.

Some years ago I came up with a dynamic algorithm which tries to prefer the set which least matches the reference set.

For the case of three sets, it simplifies to:

def set_intersection3(*input_sets):
    N = len(input_sets)
    assert N == 3
    min_index = min(range(len(input_sets)), key=lambda x: len(input_sets[x]))
    best_mismatch = (min_index+1)%N

    new_set = set()
    for element in input_sets[min_index]:
        # This failed to match last time; perhaps it's a mismatch this time?
        if element not in input_sets[best_mismatch]:
            continue

        j = 3-best_mismatch-min_index
        # If the element isn't in the set then perhaps this
        # set is a better rejection test for the next input element
        if element not in input_sets[j]:
            best_mismatch = j
        else:
            # The element is in all of the other sets
            new_set.add(element)
    return new_set

while the intersection version to sort by size is:

def set_intersection_sorted(*input_sets):
    input_sets = sorted(input_sets, key=len)
    new_set = set()
    for element in input_sets[0]:
        if element in input_sets[1]:
            if element in input_sets[2]:
                new_set.add(element)
    return new_set

Here's a head-to-head comparison between the three versions

pattern	set.intersection	set_intersection3	set_intersection_sorted
quc	0.462	0.852	1.032
qcu	0.312	0.842	1.032
ucq	7.152	0.772	0.962

which shows how (for this case) my dynamic algorithm is less sensitive to the initial query choice and faster than the sorted version. In fact, if each of the three orderings is equally possible, then the average search time for the C code is 1/2 the speed of the Python code.

Of course it's not possible to drop in one of these alternate algorithms because set.intersection can take iterables. These other algorithms would only work for the special case where all of the inputs are sets.

I've submitted this observation as Issue13653. I'm curious to see what will become of it.

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me