Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract sub-string between two certain words using regex in java

Tags:

java

regex

I would like to extract sub-string between certain two words using java.

For example:

This is an important example about regex for my work.

I would like to extract everything between "an" and "for".

What I did so far is:

String sentence = "This is an important example about regex for my work and for me";
Pattern pattern = Pattern.compile("(?<=an).*.(?=for)");
Matcher matcher = pattern.matcher(sentence);

boolean found = false;
while (matcher.find()) {
    System.out.println("I found the text: " + matcher.group().toString());
    found = true;
}
if (!found) {
    System.out.println("I didn't found the text");
}

It works well.

But I want to do two additional things

  1. If the sentence is: This is an important example about regex for my work and for me. I want to extract till the first "for" i.e. important example about regex

  2. Some times I want to limit the number of words between the pattern to 3 words i.e. important example about

Any ideas please?

like image 916
Daisy Avatar asked Aug 15 '11 08:08

Daisy


People also ask

How do I extract a particular substring from a string in Java?

You can extract a substring from a String using the substring() method of the String class to this method you need to pass the start and end indexes of the required substring.

How do I extract a string between two characters?

Extract part string between two different characters with formulas. To extract part string between two different characters, you can do as this: Select a cell which you will place the result, type this formula =MID(LEFT(A1,FIND(">",A1)-1),FIND("<",A1)+1,LEN(A1)), and press Enter key.


3 Answers

For your first question, make it lazy. You can put a question mark after the quantifier and then the quantifier will match as less as possible.

(?<=an).*?(?=for)

I have no idea what the additional . at the end is good for in .*. its unnecessary.

For your second question you have to define what a "word" is. I would say here probably just a sequence of non whitespace followed by a whitespace. Something like this

\S+\s

and repeat this 3 times like this

(?<=an)\s(\S+\s){3}(?=for)

To ensure that the pattern mathces on whole words use word boundaries

(?<=\ban\b)\s(\S+\s){1,5}(?=\bfor\b)

See it online here on Regexr

{3} will match exactly 3 for a minimum of 1 and a max of 3 do this {1,3}

Alternative:

As dma_k correctly stated in your case here its not necessary to use look behind and look ahead. See here the Matcher documentation about groups

You can use capturing groups instead. Just put the part you want to extract in brackets and it will be put into a capturing group.

\ban\b(.*?)\bfor\b

See it online here on Regexr

You can than access this group like this

System.out.println("I found the text: " + matcher.group(1).toString());
                                                        ^

You have only one pair of brackets, so its simple, just put a 1 into matcher.group(1) to access the first capturing group.

like image 166
stema Avatar answered Oct 05 '22 00:10

stema


Your regex is "an\\s+(.*?)\\s+for". It extracts all characters between an and for ignoring white spaces (\s+). The question mark means "greedy". It is needed to prevent pattern .* to eat everything including word "for".

like image 29
AlexR Avatar answered Oct 05 '22 00:10

AlexR


public class SubStringBetween {

public static String subStringBetween(String sentence, String before, String after) {

    int startSub = SubStringBetween.subStringStartIndex(sentence, before);
    int stopSub = SubStringBetween.subStringEndIndex(sentence, after);

    String newWord = sentence.substring(startSub, stopSub);
    return newWord;
}

public static int subStringStartIndex(String sentence, String delimiterBeforeWord) {

    int startIndex = 0;
    String newWord = "";
    int x = 0, y = 0;

    for (int i = 0; i < sentence.length(); i++) {
        newWord = "";

        if (sentence.charAt(i) == delimiterBeforeWord.charAt(0)) {
            startIndex = i;
            for (int j = 0; j < delimiterBeforeWord.length(); j++) {
                try {
                    if (sentence.charAt(startIndex) == delimiterBeforeWord.charAt(j)) {
                        newWord = newWord + sentence.charAt(startIndex);
                    }
                    startIndex++;
                } catch (Exception e) {
                }

            }
            if (newWord.equals(delimiterBeforeWord)) {
                x = startIndex;
            }
        }
    }
    return x;
}

public static int subStringEndIndex(String sentence, String delimiterAfterWord) {

    int startIndex = 0;
    String newWord = "";
    int x = 0;

    for (int i = 0; i < sentence.length(); i++) {
        newWord = "";

        if (sentence.charAt(i) == delimiterAfterWord.charAt(0)) {
            startIndex = i;
            for (int j = 0; j < delimiterAfterWord.length(); j++) {
                try {
                    if (sentence.charAt(startIndex) == delimiterAfterWord.charAt(j)) {
                        newWord = newWord + sentence.charAt(startIndex);
                    }
                    startIndex++;
                } catch (Exception e) {
                }

            }
            if (newWord.equals(delimiterAfterWord)) {
                x = startIndex;
                x = x - delimiterAfterWord.length();
            }
        }
    }
    return x;
}

}

like image 41
FilippoE Avatar answered Oct 04 '22 23:10

FilippoE