Apologies if this has already been asked and answered but I couldn't find a satisfactory answer.
I have a list of chemical formulas including, in this order: C, H, N and O. And I would like to pull the number after each of these letters. The problem is that not all the formulas contain an N. All contain a C, H and O however. And the number can be either single, double or (in the case of H only) triple digit.
Thus the data looks like this:
I'd like each element number for the list in separate columns. So in the first example it would be:
20 37 1 5
I've been trying:
=IFERROR(MID(LEFT(A2,FIND("H",A2)-1),FIND("C",A2)+1,LEN(A2)),"")
to separate out the C#. However, after this I get stuck as the H# is flanked by either an O or N.
Is there an excel formula or VBA that can do this?
This is a good task for regular expressions (regex). Because VBA doesn't support regular expressions out of the box we need to reference a Windows library first.
Add reference to regex under Tools then References
and selecting Microsoft VBScript Regular Expression 5.5
Add this function to a module
Option Explicit
Public Function ChemRegex(ByVal ChemFormula As String, ByVal Element As String) As Long
Dim strPattern As String
strPattern = "([CNHO])([0-9]*)"
'this pattern is limited to the elements C, N, H and O only.
Dim regEx As New RegExp
Dim Matches As MatchCollection, m As Match
If strPattern <> "" Then
With regEx
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = strPattern
End With
Set Matches = regEx.Execute(ChemFormula)
For Each m In Matches
If m.SubMatches(0) = Element Then
ChemRegex = IIf(Not m.SubMatches(1) = vbNullString, m.SubMatches(1), 1)
'this IIF ensures that in CH4O the C and O are count as 1
Exit For
End If
Next m
End If
End Function
Use the function like this in a cell formula
E.g. in cell B2: =ChemRegex($A2,B$1)
and copy it to the other cells
CH₃OH
or CH₂COOH
Note that the code above cannot count something like CH3OH
where elements occur more than once. Then only the first H3
is count the last is omitted.
If you need also to recognize formulas in the format like CH3OH
or CH2COOH
(and summarize the occurrences of the elements) then you need to change the code to recognize these too …
If m.SubMatches(0) = Element Then
ChemRegex = ChemRegex + IIf(Not m.SubMatches(1) = vbNullString, m.SubMatches(1), 1)
'Exit For needs to be removed.
End If
NaOH
or CaCl₂
In addition to the change above for multiple occurrences of elements use this pattern:
strPattern = "([A-Z][a-z]?)([0-9]*)" 'https://regex101.com/r/nNv8W6/2
Note that they need to be in the correct upper/lower letter case. CaCl2
works but not cacl2
or CACL2
.
Note that this doesn't proof if these letter combinations are existing elements of the periodic table. So this will also recognize eg. Xx2Zz5Q
as fictive elements Xx = 2
, Zz = 5
and Q = 1
.
To accept only combinations that exist in the periodic table use the following pattern:
strPattern = "([A][cglmrstu]|[B][aehikr]?|[C][adeflmnorsu]?|[D][bsy]|[E][rsu]|[F][elmr]?|[G][ade]|[H][efgos]?|[I][nr]?|[K][r]?|[L][airuv]|[M][cdgnot]|[N][abdehiop]?|[O][gs]?|[P][abdmortu]?|[R][abefghnu]|[S][bcegimnr]?|[T][abcehilms]|[U]|[V]|[W]|[X][e]|[Y][b]?|[Z][nr])([0-9]*)"
'https://regex101.com/r/Hlzta2/3
'This pattern includes all 118 elements up to today.
'If new elements are found/generated by scientist they need to be added to the pattern.
Ca(OH)₂
Therefore another RegEx is needed to handle the parenthesis and multiply them.
Public Function ChemRegex(ByVal ChemFormula As String, ByVal Element As String) As Long
Dim regEx As New RegExp
With regEx
.Global = True
.MultiLine = True
.IgnoreCase = False
End With
'first pattern matches every element once
regEx.Pattern = "([A][cglmrstu]|[B][aehikr]?|[C][adeflmnorsu]?|[D][bsy]|[E][rsu]|[F][elmr]?|[G][ade]|[H][efgos]?|[I][nr]?|[K][r]?|[L][airuv]|[M][cdgnot]|[N][abdehiop]?|[O][gs]?|[P][abdmortu]?|[R][abefghnu]|[S][bcegimnr]?|[T][abcehilms]|[U]|[V]|[W]|[X][e]|[Y][b]?|[Z][nr])([0-9]*)"
Dim Matches As MatchCollection
Set Matches = regEx.Execute(ChemFormula)
Dim m As Match
For Each m In Matches
If m.SubMatches(0) = Element Then
ChemRegex = ChemRegex + IIf(Not m.SubMatches(1) = vbNullString, m.SubMatches(1), 1)
End If
Next m
'second patternd finds parenthesis and multiplies elements within
regEx.Pattern = "(\((.+?)\)([0-9]+)+)+?"
Set Matches = regEx.Execute(ChemFormula)
For Each m In Matches
ChemRegex = ChemRegex + ChemRegex(m.SubMatches(1), Element) * (m.SubMatches(2) - 1) '-1 because all elements were already counted once in the first pattern
Next m
End Function
This will also recognize parenthesis. Note that it does not recognize nested parenthesis.
Also have a look at a similar question: Determine total number of atoms in a chemical formula
This seems to work just fine:
Formula in B2
is below. Drag across and down
=IFERROR(IFERROR(--(MID($A2,SEARCH(B$1,$A2)+1,3)),IFERROR(--(MID($A2,SEARCH(B$1,$A2)+1,2)),--MID($A2,SEARCH(B$1,$A2)+1,1))),0)
Or a shorter array formula, which must be entered with ctrl+shift+enter
=MAX(IFERROR(--MID($A2,SEARCH(B$1,$A2)+1,ROW($A$1:$A$3)),0))
If you wanted to keep the VBA super simple, something like this works as well:
Public Function ElementCount(str As String, element As String) As Long
Dim i As Integer
Dim s As String
For i = 1 To 3
s = Mid(str, InStr(str, element) + 1, i)
On Error Resume Next
ElementCount = CLng(s)
On Error GoTo 0
Next i
End Function
Use it like so:
=ElementCount(A1,"C")
I did this in VBA, using regular expressions. You can probably do it like Vityata suggests by looping through the string too, but I suspect that this is slightly faster and easier to read.
Option Explicit
Function find_associated_number(chemical_formula As Range, element As String) As Variant
Dim regex As Object: Set regex = CreateObject("VBScript.RegExp")
Dim pattern As String
Dim matches As Object
If Len(element) > 1 Or chemical_formula.CountLarge <> 1 Then
find_associated_number = CVErr(xlErrName)
Else
pattern = element + "(\d+)\D"
With regex
.pattern = pattern
.ignorecase = True
If .test(chemical_formula) Then
Set matches = .Execute(chemical_formula)
find_associated_number = matches(0).submatches(0)
Else
find_associated_number = CVErr(xlErrNA)
End If
End With
End If
End Function
Then you use the formula in your sheet like normal:
Column C contains the number of carbon atoms, column D the number of nitrogen atoms. Just expand on this by copying the formula and changing the element it searches for.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With