Fun with SMILES I: Does an element exist?
I first learned about SMILES in 1998, when I went to work for Bioreason and used the Daylight toolkit as part of a system to apply machine learning methods to high-throughput screening. Yet even after 18 years, I continue to find new ways to work with SMILES. I recently came up with methods which I haven't elsewhere, so figured I would share those discoveries. Then I started adding more examples. This is part 1.
Check for an element using a substring search
I like to work with SMILES at the syntax level. This is a light-weight approach based on string processing. A heavy-weight approach is to use a cheminformatics toolkit which can interpret the full SMILES chemistry model.
For example, if you have a set of valid SMILES strings and you want to select only those which contain at least one bromine atom, then the light-weight approach is to find the strings which contain the substring "Br" while the heavy-weight approach is to parse the string into a molecule object, iterate over the atoms, and select the molecule if one of the atoms is a bromine.
def lightweight_has_bromine_atom(smiles): return "Br" in smiles
from rdkit import Chem def heavyweight_has_bromine_atom(smiles): mol = Chem.MolFromSmiles(smiles) if mol is None: # SMILES could not be parsed return False return any(atom for atom in mol.GetAtoms() if atom.GetAtomicNum() == 35)The light-weight approach works because a valid SMILES string contains the two characters "Br" if and only if the structure has a bromine.
Test the 'Br' search code
It's easy to make a subtle mistake with light-weight methods, so it's extra important to include good tests, and preferably also use a cross-comparison with a heavy-weight equivalent to verify that the tests are correct. Here is a simple test case for the above functions:
# Pass in the function to test def _test_has_bromine_atom(has_bromine): for smiles in ( "Br", "[Br]", "[80Br]", "CCBr", "BCCBr", "B[Br]", ): assert has_bromine(smiles), smiles for smiles in ( "B", "c1ccccc1B", "B[Cr]", ): assert not has_bromine(smiles), smiles # Test the light-weight and heavy-weight functions def test_has_bromine_atom(): _test_has_bromine_atom(lightweight_has_bromine_atom) _test_has_bromine_atom(heavyweight_has_bromine_atom)(Once you've cross-verified that the test cases are correct, you can often remove away the heavy-weight versions that were part of the bootstrap process.)
Why use a light-weight method?
You might use a light-weight approach like this because string processing is much faster than fully processing a molecule. It's fast enough that it can even be used as part of substructure screening before doing the full atom-by-atom substructure test.
For an example that came up recently in one of my projects, the user can skip any structures with more than N non-hydrogen atoms, typically around 50, and structures with fewer then R rings, typically 1. If a couple of simple requirements about the SMILES format are met, then it's possible to do both tests before even touching the toolkit, which can save a lot of time if many structures can be skipped.
In future essays I'll describe how to do those two tests.
The heavy-weight approach has its own advantages. It can detect if an input like "BrBr" is valid or invalid, it can handle a wider variety of notations, and it requires less cleverness to use correctly. Indeed, this entire set of essays is all about "clever" ways to work on notation, and part of a tradition that goes back to at least the 1950s when people would do an often appoximate substructure search directly on the WLN syntax using punched-card machines.
Check for an element using a regular expression
You can test for many other elements using a simple substring search, like chlorine/"Cl", vanadium/"V" and mercury/"Hg" because Br, Cl, and Hg and many other element symbols exist if and only if the given element is part of the structure.Some elements can't be matched by a simple substring test, either because the element symbol is a substring of a larger element symbol or because there are other ways for the characters of the element symbol to appear in a SMILES string. These can sometimes be handled with a regular expression.
You cannot identify fluorine-containing structure by a substring test for "F", unless you knew there were no iron/"Fe", francium/"Fr", fermium/"Fm", or flerovium/"Fl" atoms in the SMILES data set. Realistically, the last three have maximum known half-lives of 22 minutes, 100.5 days, and 5.5 seconds, so you're very likely to know that your compound collection doesn't contain those structures, but iron is pretty common.
Instead of using a substring test, a light-weight search might look for an "F" which either at the end of the string (as in "CF") or isn't immediately followed by one of letters in "[erml]". I'll implement this using the regular expression:
F($|[^erml]) # regular expression to test if a SMILES contains a fluorine atom
Here's the full implementation and a small test suite:
import re def has_fluorine_atom(smiles): return _f_atom_pat.search(smiles) is not None def test_has_fluorine_atom(): for smiles in ( "F", "[F]", "[19F]", "[19F-]", "F[Fe]", "[Fe]CF", "Fc1ccccc1", "c1ccccc1F", "c1cc(F)ccc1", ): assert has_fluorine_atom(smiles), smiles for smiles in ( "[Fe]", "[Fr]", "[Fm]", "[Fl]", ): assert not has_fluorine_atom(smiles), smiles
You might think you could check for an "F" so long as it isn't followed by a lowercase letter, such as with "F[^a-z]". That would match "F", but not "Fe" or "Fr". This does not work because it would also ignore the "F" in "Fc1ccccc1".
The hard part of using a light-weight approch is figuring out corner cases like this. In addition to the hand-made tests like the ones you just saw, I typically cross-compare against a heavy-weight toolkit using a large, diverse SMILES collection, to help reduce the chances that I overlooked something.
It's almost true that any test for a two letter symbol can be written as a simple substring test. One of the most notable counter-examples is scandium.
A search for "Sc" will indeed identify all SMILES with a scandium atom. It will also match structures like "Sc1ccncc1" where a sulfer is attached to an aromatic carbon. Most data sets contain many more organosulfurs than scandium-containing molecules, so a subsearch for "Sc" mostly contain false positives.
bracket_atom ::= '[' isotope? symbol chiral? hcount? charge? class? ']'(I use the OpenSMILES definition instead of the Daylight SMILES definition because the latter is less strict about the order of the modifiers after the atom symbol. Daylight SMILES allows both "[CH+]" and "[C+H]" while OpenSMILES only allows the first. There is near universal agreement among SMILES generation tools to produce only the first form, and the restriction is both less ambiguous and less error prone (what does "[C+H-]" mean?) and easier to reason with at the syntax level.)
I can match a scandium by looking for the character "[" followed by an optional isotope number, followed by "Sc". There's no need to match the rest of the bracket atom term, so I'll use the following regular expression:
\[[0-9]*Sc # regular expression to test if a SMILES contains a scandium atomwhich leads to the following implementation and test:
_scandium_atom_pat = re.compile(r"\[[0-9]*Sc") def has_scandium_atom(smiles): return _scandium_atom_pat.search(smiles) is not None def test_has_scandium_atom(): for smiles in ( "[Sc]", "C[Sc]", "[Sc]C", "[45Sc]", "[45ScH+]", "[46Sc-3]", "Cl[Scl](Cl)Cl", ): assert has_scandium_atom(smiles), smiles for smiles in ( "Sc1ccccc1", "[S]c1ccccc1", "[S]C", "[Sb]C", ): assert not has_scandium_atom(smiles), smiles
I used scandium as the example because it's likely to give a large number of false positives. I could have given similar examples with aromatic boron, because lead/"Pb" looks the same as a phosphorous/"P" connected to an aromatic boron/"b", and niobium/"Nb" looks the same as a nitrogen/"N" connected to an aromatic boron.
If you don't want any false positives for Pb or Nb, you'll have to use a regular expression like "\[[0-9]*Pb" or "\[[0-9]*Nb". On the other hand, aromatic boron is rare, especially compared to lead, a substring search for "Pb" or "Nb" will have few false positives. That said, if you justify the approximate substring search for performance reasons over the exact regular expression tesst , please make sure to measure the difference to see if it's meaningful.
Cobalt and osmium
Cobalt/"Co" and osmium/"Os" are interesting middle case which highlight the difference between valid SMILES syntax and valid chemistry. In principle the substring "Co" can also match an aliphatic carbon attached to an aromatic oxygen, and "Os" match an aliphatic oxygen attached to an aromatic sulfer.
However, that's chemically unreasonable. An aromatic atom must be part of an aromatic ring. If the aliphatic notation indicates the first atom isn't part of a ring, then the second atom must be connected to two other (aromatic) bonds. For oxygen that can only happen if the atom is charged, but we know from the syntax that's not the case. For sulfer, well, I don't know enough chemistry to reject it outright, but I searched PubChem and found no matches for the SMARTS "So(:a):a".
This means it's safe to search a chemical database for "Co" and likely also for "Os" and not get any false positives.
Aromatic atoms don't have to indicate aromatic ring membership
With one odd exception.
The lower-case letter doesn't really mean the atom must be in an aromatic ring. Most toolkits interpret it as a hint that the atom is sp2 hybridized as part of the input aromaticity processing. Here's an example from RDKit with an aromatic oxygen in the input, but with no other aromatic atoms, and where the toolkit doesn't actually perceive it as aromatic:
>>> from rdkit import Chem >>> Chem.CanonSmiles("C1CCCCCo1") 'C1CCCOCC1'(Here's an obscure detail for you. Daylight SMILES allows "cc" as a SMILES representation of ethene, according to Weininger's chapter at page 85 of the Handbook of Chemoinformatics. This is not universally accepted by other toolkits.)
If you have an aromatic oxygen which isn't part of an aromatic ring then a substring test for "Co" has a higher chance of giving a false positive. However, you are not going to come across cases like this in real data sets and I'm not going to worry about non-portable SMILES like that.
This section uses an advanced regular expression syntax. I don't suggest you use this in real code, at least not without a lot more testing. I present it here more to show off what can be done with a light-weight approach.
It's difficult but not impossible to test if a carbon atom is present. One problem is that the aliphatic carbon "C" is the first letter of 11 other atomic element symbols including chlorine. The other problem is the aromatic carbon "c" which is the second letter of actinium and scandium. That right away tells you that substring tests aren't possible. The third problem is that "C", "c", and "Cl" are all part of the organic subset, that is, the symbols which don't need to be in square brackets.
It's not that hard to detect a "C" or "c" inside of square brackets using a regular expression like you've seen earlier. Here I'll used the "verbose" regular expression syntax, which lets me write the expression over multiple lines and annotate each component:
\[ # inside of brackets [0-9]* # skip any isotope [Cc] # either 'C' or 'c' [^a-z] # and not part of a two-letter element
The problem is in how to detect if a C or c is outside of square brackets. For that, I use a lookahead assertion to test if the match is not followed by some other pattern. For example, here's a pattern to check if a "C" is not followed by an "l":
C # C (?!l) # not followed by an 'l' (which would be chlorine) ...I'll also require that it not be followed by any non-"[" characters and then a "]" character, because that would mean it's a bracket atom.
C # C (?!l) # not followed by an 'l' (which would be chlorine) ... (?![^]*\]) # .. and not inside of bracketsThis isn't the best of regular expressions because it will search to the end of the string to look for any square bracket characters, which might not be present. It's possible to construct better regular expressions, perhaps by expressing the optional chirality, optional hydrogen count, etc.. This would be more complicated. A less complicated solution is to note that "[()=#]" are not allowed inside of bracket atoms but common outside of them. The negative lookahead assertion can stop if it finds one of those characters. I'll leave you to work that out.
Here is is all put together, with some basic tests:
# WARNING: I haven't fully tested this. Let me know of any problems. _carbon_atom_pat = re.compile(r""" ( \[ # inside of brackets [0-9]* # skip any isotope [Cc] # either 'C' or 'c' [^a-z] # and not part of a two-letter element )| ( C # C (in the organic subset) (?!l) # not followed by an 'l' (which would be chlorine) ... (?![^]*\]) # .. and not inside of brackets ) | ( c # c (in the organic subset) (?![^]*\]) # .. and not inside of brackets ) """, re.X) # re.X and re.VERBOSE enable verbose mode syntax def has_carbon_atom(smiles): return _carbon_atom_pat.search(smiles) is not None def test_has_carbon_atom(): for smiles in ( "C", "[C]", "[12C]", "[CH4]", "NC", "ClNC", "c1ccccc1", "c1ccccc1[Sc]", "n1nnnc1", "C[Cl]", "C[S]", "[cH]1[cH][cH][cH][cH][cH]1", "n1nnn[cH]1", "N[13C@](P)(O)Br", ): assert has_carbon_atom(smiles), smiles for smiles in ( "Cl", "[Cl]", "[35Cl]", "O=O", "[Al]", "[Cl+]", "[Sc]", ): assert not has_carbon_atom(smiles), smilesIt isn't pretty, but I think it does work, and give you a fast way to find the inorganic structures in your data set.
Digressing a bit, it might be possible use a negative lookbehind assertion. Python's lookbehind assertions, like most regular expression implementations, are limited to fixed-length patterns. That's okay since real-world atoms are limited to three digits for the isotope, so there are only four possibilities to enumerate. Even with that limitation, I wasn't able to make a valid regular expression which passed the tests.
It's impossble to count the number of hydrogens without having most of a SMILES parser. How is simple string analysis supposed to know that "O=O" has no hydrogens while "C=C" has four? Then add support for charges, atoms with multiple allowed valences, and aromaticity.
The only exceptions are if either the hydrogens are written explicitly, like "[C]([H])([H])([H])[H]", or written with an explicitly specified implicit hydrogen count, like "[CH4]". Both forms are rare.
If they are are all explicit, with no isotope, charged, or other atom properties, then a substring test for "[H]" will work. Otherwise, use the scandium-like regular expression:
As for expicitly specified implicit hydrogens, stay tuned for my next essay, where I will go into atom-based properties.
Counts and multiple matches
I started this essay by pointing out that a substring test for "Br" is equivalent to testing if the structure contains a bromine. This can easily be extended to count the number of bromine atoms by counting the number of time "Br" occurs in the SMILES string.
def get_num_bromine_atoms(smiles): return smiles.count("Br")
In fact, every substring test can be converted to a count function, so "smiles.count("V")" gives the number of vanadium atoms.
Count based on regular expressions
The regular expression tests can also be turned into a count function, by using the "findall()" method instead of the "search()" method. For example, the following counts the number of scandium atoms in a SMILES string using the _scandium_atom_pat from earlier:
_scandium_atom_pat = re.compile(r"\[[0-9]*Sc") def get_num_scandium_atoms(smiles): return len(_scandium_atom_pat.findall(smiles))along with some basic tests:
def test_get_num_scandium_atoms(): for expected_count, smiles in ( (0, "Sc1ccccc1"), (0, "[S]c1ccccc1"), (0, "[S]C"), (0, "[Sb]C"), (1, "[Sc]"), (1, "C[Sc]"), (1, "[Sc]C"), (1, "[45Sc]"), (1, "[45ScH+]"), (1, "[46Sc-3]"), (1, "Cl[Scl](Cl)Cl"), (2, "[Sc]Sc[Sc]"), # not a valid SMILES (3, "[Sc][S]Scc[Sc][45S]c[Sc]"), ): got_count = get_num_scandium_atoms(smiles) assert got_count == expected_count, (smiles, got_count, expected_count)If you do this, make sure that your regular expression pattern doesn't accidentally include additional characters. For example, you can test for boron by looking for a "B" followed by something that isn't in "[aehikr]", perhaps with the regular expression:
B[^aehikr] # finds boron, but will also match BBThis would find that the SMILES BB, contains a boron, by matching both the first and second B, since B is not one of those lowercase letters. If you use this regular expression in a findall, then the search would continue after the second B, so you'll end up with one match instead of finding two boron atoms.
To fix this, switch from "[^aehikr]" to the negative lookahead assertion (?![aehikr])".
B(?![aehikr]) # only matches boron element symbolsNOTE! This will also match some chiral specifications, like "@TB1". However, I have never seen those outside of the SMILES specification. If your data set does contain them, then you will need to use more advanced techniques, like the tokenizer approach I will present in a future essay.
Match multiple elements based on regular expressions
It's also possible to combine multiple regular expressions into a large regular expression, to detect if all of them exist in a single call to match(), rather than findall() to get match counts. Here is a regular expression which matches at least two scandium atoms:
\[[0-9]*Sc.*\[[0-9]*Sc(You might also test the performance with the non-greedy ".*?" instead of the greedy ".*"; my tests with "grep -P" showed they were the same.)
However, down this route lies two different forms of combinatorial madness. If there are n scandium atoms, and you want n+1 then the backtracker will still try all 2n possible partial matches. This will slow things down. You may want to consider using a substring count first, then do the regular expression test filter out false positives.
The other combinatorial explosion comes if you want to find a SMILES string which contains multiple different element types. As a example, if you want the (admittedly unlikely) structures with 1 Pb, 1 Sc, and 1 Hg atom, you'll have to write out all six possible orderings of those terms in the combined regular expression.
Really, it's much better to tokenize and process each atom than it is to force everything into a regular expression test. I'll describe how to do that in a future essay (and link it from here once I've written it).
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2020 Andrew Dalke Scientific AB