Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Break string into list elements based on keywords

Tags:

python

regex

I'm trying to create a function (in Python) that takes its input (a chemical formula) and splits in into a list. For example, if the input was "HC2H3O2", it would turn it into:

molecule_list = ['H', 1, 'C', 2, 'H', 3, 'O', 2]

This, works well so far, but if I input an element with two letters in it, for example sodium (Na), it would split it into:

['N', 'a']

I'm searching for a way to make my function look through the string for keys found in a dictionary called elements. I'm also considering using regex for this, but I'm not sure how to implement it. This is what my function is right now:

def split_molecule(inputted_molecule):
    """Take the input and split it into a list
    eg: C02 => ['C', 1, 'O', 2]
    """
    # step 1: convert inputted_molecule to a list
    # step 2a: if there are two periodic elements next to each other, insert a '1'
    # step 2b: if the last element is an element, append a '1'
    # step 3: convert all numbers in list to ints

    # step 1:
    # problem: it splits Na into 'N', 'a'
    # it needs to split by periodic elements
    molecule_list = list(inputted_molecule)

    # because at most, the list can double when "1" is inserted
    max_length_of_molecule_list = 2*len(molecule_list)
    # step 2a:
    for i in range(0, max_length_of_molecule_list):
        try:
            if (molecule_list[i] in elements) and (molecule_list[i+1] in elements):
                molecule_list.insert(i+1, "1")
        except IndexError:
            break
    # step2b:     
    if (molecule_list[-1] in elements):
        molecule_list.append("1")

    # step 3:
    for i in range(0, len(molecule_list)):
        if molecule_list[i].isdigit():
            molecule_list[i] = int(molecule_list[i])

    return molecule_list
like image 800
ohblahitsme Avatar asked Mar 20 '12 07:03

ohblahitsme


People also ask

How do I split a string into a list of words?

To convert a string in a list of words, you just need to split it on whitespace. You can use split() from the string class. The default delimiter for this method is whitespace, i.e., when called on a string, it'll split that string at whitespace characters.

How do you split a string into a list of elements in Python?

Python split() method is used to split the string into chunks, and it accepts one argument called separator. A separator can be any character or a symbol. If no separators are defined, then it will split the given string and whitespace will be used by default.

How do I split a string into a list without splitting?

To split a string without removing the delimiter: Use the str. split() method to split the string into a list. Use a list comprehension to iterate over the list.

What does the split () method return from a list of words?

Description. Python string method split() returns a list of all the words in the string, using str as the separator (splits on all whitespace if left unspecified), optionally limiting the number of splits to num.


1 Answers

How about

import re
print re.findall('[A-Z][a-z]?|[0-9]+', 'Na2SO4MnO4')

result

['Na', '2', 'S', 'O', '4', 'Mn', 'O', '4']

Regex explained:

Find everything that is either

    [A-Z]   # A,B,...Z, ie. an uppercase letter
    [a-z]   # followed by a,b,...z, ie. a lowercase latter
    ?       # which is optional
    |       # or
    [0-9]   # 0,1,2...9, ie a digit
    +       # and perhaps some more of them

This expression is pretty dumb since it accepts arbitrary "elements", like "Xy". You can improve it by replacing the [A-Z][a-z]? part with the actual list of elements' names, separated by |, like Ba|Na|Mn...|C|O

Of course, regular expressions can only handle very simple formulas, to parse something like

  8(NH4)3P4Mo12O40 + 64NaNO3 + 149NH4NO3 + 135H2O

you're going to need a real parser, e.g. pyparsing (be sure to check "chemical formulas" under "Examples"). Good luck!

like image 190
georg Avatar answered Sep 24 '22 14:09

georg