I'm trying to create a function (in Python) that takes its input (a chemical formula) and splits in into a list. For example, if the input was "HC2H3O2", it would turn it into:
molecule_list = ['H', 1, 'C', 2, 'H', 3, 'O', 2]
This, works well so far, but if I input an element with two letters in it, for example sodium (Na), it would split it into:
['N', 'a']
I'm searching for a way to make my function look through the string for keys found in a dictionary called elements. I'm also considering using regex for this, but I'm not sure how to implement it. This is what my function is right now:
def split_molecule(inputted_molecule):
"""Take the input and split it into a list
eg: C02 => ['C', 1, 'O', 2]
"""
# step 1: convert inputted_molecule to a list
# step 2a: if there are two periodic elements next to each other, insert a '1'
# step 2b: if the last element is an element, append a '1'
# step 3: convert all numbers in list to ints
# step 1:
# problem: it splits Na into 'N', 'a'
# it needs to split by periodic elements
molecule_list = list(inputted_molecule)
# because at most, the list can double when "1" is inserted
max_length_of_molecule_list = 2*len(molecule_list)
# step 2a:
for i in range(0, max_length_of_molecule_list):
try:
if (molecule_list[i] in elements) and (molecule_list[i+1] in elements):
molecule_list.insert(i+1, "1")
except IndexError:
break
# step2b:
if (molecule_list[-1] in elements):
molecule_list.append("1")
# step 3:
for i in range(0, len(molecule_list)):
if molecule_list[i].isdigit():
molecule_list[i] = int(molecule_list[i])
return molecule_list
To convert a string in a list of words, you just need to split it on whitespace. You can use split() from the string class. The default delimiter for this method is whitespace, i.e., when called on a string, it'll split that string at whitespace characters.
Python split() method is used to split the string into chunks, and it accepts one argument called separator. A separator can be any character or a symbol. If no separators are defined, then it will split the given string and whitespace will be used by default.
To split a string without removing the delimiter: Use the str. split() method to split the string into a list. Use a list comprehension to iterate over the list.
Description. Python string method split() returns a list of all the words in the string, using str as the separator (splits on all whitespace if left unspecified), optionally limiting the number of splits to num.
How about
import re
print re.findall('[A-Z][a-z]?|[0-9]+', 'Na2SO4MnO4')
result
['Na', '2', 'S', 'O', '4', 'Mn', 'O', '4']
Regex explained:
Find everything that is either
[A-Z] # A,B,...Z, ie. an uppercase letter
[a-z] # followed by a,b,...z, ie. a lowercase latter
? # which is optional
| # or
[0-9] # 0,1,2...9, ie a digit
+ # and perhaps some more of them
This expression is pretty dumb since it accepts arbitrary "elements", like "Xy". You can improve it by replacing the [A-Z][a-z]?
part with the actual list of elements' names, separated by |
, like Ba|Na|Mn...|C|O
Of course, regular expressions can only handle very simple formulas, to parse something like
8(NH4)3P4Mo12O40 + 64NaNO3 + 149NH4NO3 + 135H2O
you're going to need a real parser, e.g. pyparsing (be sure to check "chemical formulas" under "Examples"). Good luck!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With