Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding all chemical symbols in a file

Tags:

bash

sed

I have a file containing many chemical formulas throughout. I need to mark any text that is a chemical formula. I want to search a file for any place containing a combination of at least one chemical symbol and at least one number, and add \chemical{} around it. E.g. H2O becomes \chemical{H2O} and FeS2 becomes \chemical{FeS2}.

  • The chemicals are bounded by spaces () or forward-slashes (/". E.g.: /Ar becomes /\chemical{Ar}, but Arizona should not be identified as ` \chemical{Ar}izona".
  • Combinations containing no numbers should be ignored.
  • I found this list which I think has every possible chemical name: "Ac, Ag, Al, Am, Ar, As, At, Au, B, Ba, Be, Bh, Bi, Bk, Br, C, Ca, Cd, Ce, Cf, Cl, Cm, Cn, Co, Cr, Cs, Cu, Db, Ds, Dy, Er, Es, Eu, F, Fe, Fm, Fr, Ga, Gd, Ge, H, He, Hf, Hg, Ho, Hs, I, In, Ir, K, Kr, La, Li, Lr, Lu, Md, Mg, Mn, Mo, Mt, N, Na, Nb, Nd, Ne, Ni, No, Np, O, Os, P, Pa, Pb, Pd, Pm, Po, Pr, Pt, Pu, Ra, Rb, Re, Rf, Rg, Rh, Rn, Ru, S, Sb, Sc, Se, Sg, Si, Sm, Sn, Sr, Ta, Tb, Tc, Te, Th, Ti, Tl, Tm, U, Uuh, Uuo, Uup, Uuq, Uus, Uut, V, W, Xe, Y, Yb, Zn, Zr".

How can I find all of the chemical formulas appearing in the file?

like image 339
Village Avatar asked Mar 11 '12 05:03

Village


1 Answers

I'd use Perl. It is more monotonous than exciting. You create a regex containing all the alternative symbols, and then build up a more complex regex from that and some other bits and pieces:

#!/usr/bin/env perl
use strict;
use warnings;

my $symbols = "Ac|Ag|Al|Am|Ar|As|At|Au|B|Ba|Be|Bh|Bi|Bk|Br|C|Ca|Cd|Ce|Cf|Cl|Cm|Cn|Co|Cr|Cs|Cu|Db|Ds|Dy|Er|Es|Eu|F|Fe|Fm|Fr|Ga|Gd|Ge|H|He|Hf|Hg|Ho|Hs|I|In|Ir|K|Kr|La|Li|Lr|Lu|Md|Mg|Mn|Mo|Mt|N|Na|Nb|Nd|Ne|Ni|No|Np|O|Os|P|Pa|Pb|Pd|Pm|Po|Pr|Pt|Pu|Ra|Rb|Re|Rf|Rg|Rh|Rn|Ru|S|Sb|Sc|Se|Sg|Si|Sm|Sn|Sr|Ta|Tb|Tc|Te|Th|Ti|Tl|Tm|U|Uuh|Uuo|Uup|Uuq|Uus|Uut|V|W|Xe|Y|Yb|Zn|Zr";

#my $symbols = "Ac|Ag|Al|...|Y|Yb|Zn|Zr";

my $regex = qr{ ([/ ]) ( (?:$symbols) (?: \d (?:$symbols) )* \d? ) ([ /]) }x;

printf "$regex\n";

while (<>)
{
    s/$regex/$1\\chemical{$2}$3/g;  # Handles first and third (, ...) in H2O CO2 H2SO4
    s/$regex/$1\\chemical{$2}$3/g;  # Handles second (fourth, ...)
    print $_;
}

The first capture deals with the space or slash before the symbol. The second capture is gruesome, using the humungous string in $symbols twice. The (?:...) are purely for grouping, not capture. The pattern looks for a chemical symbol, optionally followed by zero or more sequences of a digit and another symbol, possibly with a trailing digit. Note that this is what you specified, but will miss compounds such as H2SO4, CO2, KMnO4, and so on. You can pick those up with a simple adaptation:

my $regex = qr{ ([/ ]) ( (?:$symbols) (?: \d* (?:$symbols) )* \d* ) ([ /]) }x;

I'm also assuming single digits in all compounds. That works for many, but some of the longer hydrocarbons won't be so good: CH4, C2H6, C3H8, C4H10, ... Again, you can deal with that by replacing the 0-or-1 ? with 0-or-more *. You still have problems with commas after compounds in lists, compounds at the beginning of a line, compounds at the end of a line, etc — your specification rules them all out.

You might do better replacing the first and third captures with \b to mark the boundary between 'words' and 'non-words', where a chemical symbol would count as a word. This deals with the issues with commas and beginning and end of line, but picks up more than you specified.

my $regex = qr{ \b ( (?:$symbols) (?: \d* (?:$symbols) )* \d* ) \b }x;

printf "$regex\n";

while (<>)
{
    s/$regex/\\chemical{$1}/g;
    print $_;
}

Note that this formulation doesn't need the double substitution; a single one is sufficient, so it is definitely cleaner.

like image 156
Jonathan Leffler Avatar answered Nov 14 '22 20:11

Jonathan Leffler