I have a file containing many chemical formulas throughout. I need to mark any text that is a chemical formula. I want to search a file for any place containing a combination of at least one chemical symbol and at least one number, and add \chemical{}
around it. E.g. H2O
becomes \chemical{H2O}
and FeS2
becomes \chemical{FeS2}
.
) or forward-slashes (/"
. E.g.: /Ar
becomes /\chemical{Ar}
, but Arizona
should not be identified as ` \chemical{Ar}izona".How can I find all of the chemical formulas appearing in the file?
I'd use Perl. It is more monotonous than exciting. You create a regex containing all the alternative symbols, and then build up a more complex regex from that and some other bits and pieces:
#!/usr/bin/env perl
use strict;
use warnings;
my $symbols = "Ac|Ag|Al|Am|Ar|As|At|Au|B|Ba|Be|Bh|Bi|Bk|Br|C|Ca|Cd|Ce|Cf|Cl|Cm|Cn|Co|Cr|Cs|Cu|Db|Ds|Dy|Er|Es|Eu|F|Fe|Fm|Fr|Ga|Gd|Ge|H|He|Hf|Hg|Ho|Hs|I|In|Ir|K|Kr|La|Li|Lr|Lu|Md|Mg|Mn|Mo|Mt|N|Na|Nb|Nd|Ne|Ni|No|Np|O|Os|P|Pa|Pb|Pd|Pm|Po|Pr|Pt|Pu|Ra|Rb|Re|Rf|Rg|Rh|Rn|Ru|S|Sb|Sc|Se|Sg|Si|Sm|Sn|Sr|Ta|Tb|Tc|Te|Th|Ti|Tl|Tm|U|Uuh|Uuo|Uup|Uuq|Uus|Uut|V|W|Xe|Y|Yb|Zn|Zr";
#my $symbols = "Ac|Ag|Al|...|Y|Yb|Zn|Zr";
my $regex = qr{ ([/ ]) ( (?:$symbols) (?: \d (?:$symbols) )* \d? ) ([ /]) }x;
printf "$regex\n";
while (<>)
{
s/$regex/$1\\chemical{$2}$3/g; # Handles first and third (, ...) in H2O CO2 H2SO4
s/$regex/$1\\chemical{$2}$3/g; # Handles second (fourth, ...)
print $_;
}
The first capture deals with the space or slash before the symbol. The second capture is gruesome, using the humungous string in $symbols
twice. The (?:...)
are purely for grouping, not capture. The pattern looks for a chemical symbol, optionally followed by zero or more sequences of a digit and another symbol, possibly with a trailing digit. Note that this is what you specified, but will miss compounds such as H2SO4, CO2, KMnO4, and so on. You can pick those up with a simple adaptation:
my $regex = qr{ ([/ ]) ( (?:$symbols) (?: \d* (?:$symbols) )* \d* ) ([ /]) }x;
I'm also assuming single digits in all compounds. That works for many, but some of the longer hydrocarbons won't be so good: CH4, C2H6, C3H8, C4H10, ... Again, you can deal with that by replacing the 0-or-1 ?
with 0-or-more *
. You still have problems with commas after compounds in lists, compounds at the beginning of a line, compounds at the end of a line, etc — your specification rules them all out.
You might do better replacing the first and third captures with \b
to mark the boundary between 'words' and 'non-words', where a chemical symbol would count as a word. This deals with the issues with commas and beginning and end of line, but picks up more than you specified.
my $regex = qr{ \b ( (?:$symbols) (?: \d* (?:$symbols) )* \d* ) \b }x;
printf "$regex\n";
while (<>)
{
s/$regex/\\chemical{$1}/g;
print $_;
}
Note that this formulation doesn't need the double substitution; a single one is sufficient, so it is definitely cleaner.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With