Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2004/12/27/Mindy_and_Martel

Martel and Mindy status

I was asked recently about the status of the Mindy, the Martel indexing project I worked on a few years ago. Here's my reply.

Thank you for your interest in Mindy. It's a project I worked on a few years ago but stopped working on because I didn't get any feedback from people saying they would pay money for it, much less feel it was needed. To make a useful product I feel developers need to work with end-users, and I didn't know the right people in that category.

So while I do feel it's a useful product, I've instead been working on things which are either more interesting or more likely to pay for my mortgage.

The proof of concept prototype showed that it can be successful and if you would like to work on it I can provide some pointers.

The main problem is Martel. As you've read, it's a special-purpose regular expression engine that emits SAX events. This makes it possible to parse existing flat-file formats as if they are in XML.

I based Martel on the mxTextTools engine. As it turns out, my implementation doesn't quite match the standard regexp behavior. As I recall, one bug is that ".*" doesn't backtrack and another is that "((a)|(ab))c" also doesn't backtrack so fails to match "abc". (Why? It assumes that once a match occurs there is not need for backtracking.)

In addition, mxTextTools requires the entire string be in memory, so Martel inherits that limitation. I have the record parser layer to break large files into smaller records, and synthesize the extra bits needed to make the XML correct, but it's still a hack and doesn't work for, say, the FASTA file of human chromosome 1.

Ideally I would like a feed-oriented parser so I can pass it a chunk of text at a time and it will emit the SAX events as soon as the input is no longer ambiguous.

Supporting that means getting rid of mxTextTools and managing the parsing state in heap rather than stack. It's a lot of work. See the zlib code or one of the feed-oriented XML parsers for what that type of API looks like.

I conjecture the need for some sort of 'cut' operation (term taken from Prolog). Consider a FASTA file format definition with two forms for the header line

NCBI_header = ...>gi|123456 followed by description ...
generic_header = ...>just the description ...

header = NCBI_header | generic_header
body = ...
record = header + body
format = Rep(record)
One might do this to have special code that detects the NCBI gi form.

When parsing an NCBI header there are two paths to take, one the NCBI_header and the other the generic_header. This will take the NCBI header but still keep the state around to allow a backtrack. If the FASTA file comes from NCBI then there will be 1 stack push for every record in the file.

In theory it's possible for Martel to figure out that one path is a subset of the other so there is no need to backtrack, or it could generate 3 paths (only to left of '|', both side of '|', or only to right of '|') through an or pattern. Either one is hard. Even if done, there will still be an ambiguity in grouping. That is,

  (ab)(c)x
  (a)(bc)y
match the same text but the groupings don't resolve until the last character.

So another approach is to have something like

header = cut(NCBI_header | generic_header)
which says that any ambiguities found in that part of the grammar are discarded after the end of the cut group.

Martel has no support at all for this.

Even assuming these are fixed, there's still a problem in describing the languages. These are tree structures constructed in Python. Because of the forward reference problem, the leaves are built first, with the root at the bottom of the file. This makes it harder to read, though not too bad because most of the formats are very flat.

What's harder is modifying a grammar. For example, some time ago SWISS-PROT changed the AC line to allow multiple lines when there are a lot of accession numbers. To make that new format in Martel requires a lot of duplication. I can get the leaves from the older definition but any branch which has the AC definition as a leaf needs to be rewritten. So I end up doing things like

from older_sp import ID, AC, DE, ....

AC_group = Rep1(AC)

record = (ID + AC + DT_created + .... + DE + ... + sequence + END)
format = RecordReader.StartsWith("ID", record)
while I would rather do something like
sprot_new = sprot_old.copy()

AC_group = Rep1(AC)
sprot_new.replace("AC", AC_group)

that is, to do tree editing from an existing format, rather than rebuild the tree with the small change.

I've considered recasting the tree into a DOM-like object that supports some subset of XSLT. Except I don't like XSLT. It just the only tree editing language I know about.

The format definitions also need something that understands what to do with the XML events. Having one parser for every grammar is still tedious. Better would be if I could have one parser that supports a wide number of formats.

For example I was toying with the idea of a standard set of names. It should be possible to have either a tag or attribute so I know that a given block of text contains DNA sequences. This would convert a made up format like

123_BOVINE_TPRH  DNA
ATCGATCGATGCATGC
into
<record><name bio:idtype="id">123_BOVINE_TRPH</name>  <seqtype bio:seqtype="dna">DNA</seqtype>
<sequence>ATCGATCGATGCATGC</sequence>
where the tags "name" (or "bio:name"?), "seqtype, and "sequence" had defined meanings that the parser could depend upon.

It turns out that some formats are really slow to parse via a SAX-like interface. The best example was PIR, where the secondary structure annotation was mixed in with the sequence data, something like:

  A C<P G F A J K L R R G>A H V
where the '<' means "start of helix" and '>' means "end of helix". (That's an approximation; it's been a few years since I looked at it.)

In Martel this creates 3 function callbacks per character, like this

  startElement("sequence", {})
  text("A")
  endElement("sequence")

  startElement("annotation", {})
  text(" ")
  endElement("annotataion")

  startElement("sequence", {})
  text("C")
  endElement("sequence")

  startElement("annotation", {})
  text("<")
  endElement("annotataion")

I prototyped a sort of "pragma" interface where the format could say to use a user-defined function to parse a block of data rather than use the XML parser. The result could look more like

  startElement("sequence", {})
  text("ACPGFA...")
  endElement("sequence")
  startElement("annotation")
   .. some special way to handle the annotation data ..
  endElement("annotation")
It would not be convertible back to the original data, but is (according to my tests) a lot faster, and the pragma idea allows parsers to choose performance over precision.

Finally, from experience the regular expressions are hard to develop by non-experts. Regexps are hard to learn. It wasn't until I took an automata theory course that I understood them, and few in bioinformatics have that experience.

I believe it is possible to develop a tool that helps make the grammars, perhaps by altogether ignoring the actual Python definition.

It might be some sort of tree widget like this

+- format
   +- 'header'
   |
   |
   +- zero or more
      +- 'record'
         +- 'ID line'
         +- one or more 'AC block'
         |  +- 'AC line'
         +- 'ID line'
  ...
along with ways to input test cases. For example, when a given node/leaf in the tree is selected then there's a place to paste in text that should be matched by that node. It also shows the XML tree of the match text in some sort of browser, and/or show where it the parsing failed, both in the text and in the pattern.

To be a complete debugger it should also show the step-by-step parse, with something that shows the correspondence between text and pattern.

These are all interesting. Most are non-trivial. I just don't have the time or enthusiasm for this project because the work I'm doing now doesn't need this sort of tool. Hence the reason for no development these last few years.

If you are interested in either funding this work or doing some of it yourself, please let me know.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB