Parsing Chemical Formulas?

Before I try and reinvent the wheel, I was wondering if anybody has code to parse chemical formulas to get the number of atoms of each type in the compound.

I mean something like: (C8H17N(CH3)3)3H3V10O28
with nested parentheses

Thanks,

  • KAren

I would have to bet that if anybody does, Eugene Daiken would based on his job responsibilities and educational background (heavy-duty chemistry from everything I’ve seen) … might try a PM to him if you get no response here, Karen.

There seem to be two nice characteristics

  1. the periodic table is all well defined so you dont have to handle arbitrary chemical names (Carbon IS C and always C, etc) so you can easily find unique atom names - even when there are names that overlap For instance C and Ca, Co. Replace from longest names to shortest names with their full names and other markers you can split on

  2. none of the atoms have numbers in their names so you can eliminate that from parsing quit easily

nested parens are fairly easy to handle as you parse as you just count up & down

This little bit splits it into an array that you can deal with reasonably

  dim parseString as string = "(C8H17N(CH3)3)3H3V10O28"
  
  dim bits() as string
  
  dim tmp as string = parseString
  tmp = tmp.ReplaceAll("(", Encodings.UTF8.Chr(9)+"("+Encodings.UTF8.Chr(9))
  tmp = tmp.ReplaceAll(")", Encodings.UTF8.Chr(9)+")"+Encodings.UTF8.Chr(9))
  
  // now just for example to handle the atoms in this formula
  tmp = tmp.ReplaceAll("C", Encodings.UTF8.Chr(9)+"C")
  tmp = tmp.ReplaceAll("H", Encodings.UTF8.Chr(9)+"H")
  tmp = tmp.ReplaceAll("N", Encodings.UTF8.Chr(9)+"N")
  tmp = tmp.ReplaceAll("O", Encodings.UTF8.Chr(9)+"O")
  tmp = tmp.ReplaceAll("V", Encodings.UTF8.Chr(9)+"V")
  
  bits = split(tmp,  Encodings.UTF8.Chr(9))
  1. ignore empty elements in bits
  2. each atom + any associated count is one entry (which I assume you can deal with)
  3. parens are in an element of their own so you can count & multiply as needed (ie when you get to (ch3)3 you’ll have
 (
    c
    h3
 )
 3

so you need to multiply accumulated counts by the trailing 3 so you count c3 h9

Seems like this should be pretty easy
Where have I heard those words before ? :slight_smile:

[quote=186679:@Norman Palardy]Seems like this should be pretty easy
Where have I heard those words before ? :)[/quote]
from Geoff, perhaps??? ^^

The nice thing here is that the possible items are well defined, the algorithm is something very close to needing to parse a mathematical expression (see shunting yard algorithm) and really looks like it should be reasonably straight forward

Its just a simple matter of code :stuck_out_tongue:

The notation is always the atom or group followed by its quantity in the molecule. Groups are usually enclosed in parentheses and also followed by a number that indicates their quantity in the molecule.

therefore H3V10O28 is quite straightforward.

(C8H17N(CH3)3)3 must be read 3 times C8H17N(CH3)3 where (CH3)3 is three times CH3 (therefore, nine methyl groups in total)

that should end up somthing like: 61 C, 63 H, 3N, 10 V and 28 O, unless I did not multiply and add correctly.

I forgot. No number after a symbol means 1 occurrrence of the atom in the group or molecule.

EDIT: yes, I did miscalculate a few atoms… a recount gives me 33 C, 81 H, 3 N, 10 V, 28 O. Numbers may again be incorrect, but the principle applies.

Norm,
It’s basically simple but a pain because the nesting can go deep and I hate writing that type of code! :wink:

BTW new elements do get made…

Chemical element symbols all start with capital letters and followed with 0 or more lower case letters(It used to be at most one lower case letter but then they made new elements so some now some have 2 lowercase letters!) with the numbering and grouping conventions mentioned above.

Also I want to handle common group abbreviations that would follow the same naming conventions as elements. For example

Et = CH3CH2 (Ethyl group)
So:
Et3N = (CH3CH2)3N = triethylamine

Anyway I was just hoping that if someone had done it already they might be willing to share.

  • Karen

It appears that you are in for a few hours of fun. You may also have to account for charges that are sometimes included with the formula. It is easy enough to skip + and -, but the likes of 3- and 2+ could be confusing if they are assigned to atoms or groups within a larger formula. I am assuming of course that you would want to parse very large and complex formulas, the smaller ones being relatively easy to parse manually.

I also mentioned organic chemistry rules, but there may be other rules to take into account to describe the large groups formed by solvation, for example. Electrochemistry reactions may also include special notations. All that is becoming quite dusty and rusty in my memory, but I have vague recollections of funky reactions and formulas.

a recursive algorithm :slight_smile:

Not often

[quote=186696:@Karen Atkocius]Chemical element symbols all start with capital letters and followed with 0 or more lower case letters(It used to be at most one lower case letter but then they made new elements so some now some have 2 lowercase letters!) with the numbering and grouping conventions mentioned above.
[/quote]
Sure - but they are all known & well defined

Changing requirements already :stuck_out_tongue:

But even those groups are well defined & well known (no?) so search, replace, split & count

I dont know that anyone has done this BUT there are LOTS of things people have done that I dont know about :slight_smile:

For what I and I think I can safely ignore charged species.

Let me explain. (fro non Chemists - jusr skip this!)

I just got a new LCMS (Liquid chromatograph mass spectrometer) in my lab. The software that runs it does calculate monoisotopic molecular weight from a formula, but it does not support calculating and displaying an isotope model (showing the isotope pattern expected from the molecule).

I want to write an Xojo app so I can do that for M+ or M- (depending on if running in positive or negative more) as well as the common adducts for those modes to help me be sure of identifying peaks.

I doubt I will ever be starting with a charged compound.

  • Karen

BTW I can’t edit the above to fix typoes because it says someone else already posted…

Norm posted WHILE I was writing my post - not after it

Wow, LCMS’s are fantastic!! I’ve performed lots of GC and GCMS work, but not LCMS.

There is some interesting software on Source Forge and one possibility is (OpenChrom that looks like is assembles the chemicals (see the screen shots) and provides the CAS number. I’m not sure how reliable it is or the calibration type, and its just an option.

Sorry, I haven’t created a program to build the chemicals from their components, I usually did this manually. Your right, this would save a great amount of time!

Just a note:
Keep in mind that a formula may contain more parts. See e.g an Alaun like this one:

And yes, I did something like this many years ago :wink: