Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find consecutive repeats of unknown substring

I'm trying to make a system to factorise (if that term is correct in Chemistry) a given expanded chemical formula, such as C6H2NO2NO2NO2CH3 into brackets so that it is C₆H₂(NO₂)₃CH₃. (Ignore the subscript, or lack thereof in the first instance). Problem is, I don't know what the repeated molecule is going to be, or even how long it will be. How would I find and count repeats?

For context, here's my code so far which generates the formula from a 2D list of elements:

private String getFormula(List<List<Element>> elements)
{
    String formula = ""; //TODO Switch out for StringBuilder

    for(List<Element> currentElement : elements)
    {
        formula += currentElement.get(0).getSymbol(); //Every element per list is identical, so looking at 0 will always be safe
        if(currentElement.size() > 1) formula += currentElement.size(); //Only display a number if there is more than 1 element
    }

    return formula;
}
like image 913
Jake Stanger Avatar asked Mar 29 '16 19:03

Jake Stanger


3 Answers

EDIT Updated so that it considers order. Thanks @JakeStanger!
2nd EDIT Updated to reflect new condition where molecule ends with |


I used a regular expression to split by | since from the given String we know that a new molecule starts after |. I used a Hashmap to keep track of how many molecules of each type were given. In the end I iterated through each value in the Hashmap and appended to String result depending on whether it was one molecule or not. Cheers!

   public static String factorise(String input) {
        String result = "";
        Map<String, Integer> molecules = new LinkedHashMap<>();
        String[] res = input.split("\\|");
        for (String t : res) {
            //Check if we already have this element in our map
            if (!molecules.containsKey(t)) {
                //If not then add it and set the count to 1
                molecules.put(t, 1);
            } else {
                //If we do then update the count by 1
                molecules.put(t, molecules.get(t) + 1);
            }
        }
        //Iterate through each molecule
        for (String key : molecules.keySet()) {
            if (molecules.get(key) == 1) {
                //If the count is only at one, then we just need to append it.
                result += key;
            } else {
                //Otherwise, we need the parentheces and the number of repetitions followed after
                result = result + "(" + key + ")" + molecules.get(key);
            }
        }
        return result;
    }

Running

    System.out.println(factorise("C6|H2|NO2|NO2|NO2|CH3|OH|OH"));
    System.out.println(factorise("HO|HO"));

Yields the following when run:

run:
C6H2(NO2)3CH3(OH)2
(HO)2
BUILD SUCCESSFUL (total time: 0 seconds)

like image 168
robotlos Avatar answered Nov 01 '22 21:11

robotlos


This answer gives the code: String result = source.replaceAll("(.+)\\1+", "$1"), which replaces all repeated substrings.

I'm thinking that a slight modification of that code should do what you want. If you use "($1)" as the replacement, it will wrap the match in parenthesis. You could probably step through the replacement and determine what number should come after the parenthesis.

To prevent the regex from capturing preceding numbers, try "([A-Za-z]+[1-9]*)\\1+".

This link explains how to count number of matches. It's a bit more complex:

 Pattern pattern = Pattern.compile("([A-Za-z]+[1-9]*)\\1+");
    Matcher  matcher = pattern.matcher(YOUR_CHEM_STRING);

   int count = 0;
   String prior="";
    while (matcher.find()){
       if(m.group().equals(prior){
             count++;
       }else{
           YOUR_CHEM_STRING.replaceAll("([A-Za-z]+[1-9]*)\\1+","($1)"+count);
           count=0;
       }
    }
like image 2
Laurel Avatar answered Nov 01 '22 22:11

Laurel


You count split up the formula elements into a list and then parse it, counting consecutive repeats. When the count is greater than 1 you will add it with parenthesis.

String formula = "C6H2NO2NO2NO2CH3";
Pattern p = Pattern.compile("[a-zA-Z]+[0-9]+");
Matcher m = p.matcher(formula);
List<String> parts = new ArrayList<String>();
while(m.find()) {
    parts.add(m.group());
}

String shrink = "";
int count = 0;
for(int i=0; i<parts.size(); i++) {
    count++;
    if(i+1 == parts.size() || !parts.get(i+1).equals(parts.get(i))) {
        if(count == 1)
            shrink += parts.get(i);
        else
            shrink += "("+parts.get(i)+")"+count;
        count = 0;
    }
}
System.out.println(shrink); // result = "C6H2(NO2)3CH3"

If you are able to send the elements list, try this:

public static String shortForumla(List<List<Element>> elements) {
    String shrink = "";
    int count = 0;
    for(int i=0; i<elements.size(); i++) {
        String symbol = elements.get(i).get(0).symbol();
        if(i+1 == elements.size() || !elements.get(i+1).get(0).symbol().equals(symbol)) {
            if(count == 1)
                shrink += symbol;
            else
                shrink += "("+symbol+")"+count;
            count = 0;
        }
    }
    return shrink;
}
like image 1
Matthew Wright Avatar answered Nov 01 '22 21:11

Matthew Wright