Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use regular expressions to match everything before a certain type of word

Tags:

I am new to regular expressions.

Is it possible to match everything before a word that meets a certain criteria:

E.g.

THIS IS A TEST - - +++ This is a test

I would like it to encounter a word that begins with an uppercase and the next character is lower case. This constitutes a proper word. I would then like to delete everything before that word.

The example above should produce: This is a test

I only want to this processing until it finds the proper word and then stop.

Any help would be appreciated.

Thanks

like image 928
John Daly Avatar asked Feb 17 '09 23:02

John Daly


2 Answers

Replace

^.*?(?=[A-Z][a-z])

with the empty string. This works for ASCII input. For non-ASCII input (Unicode, other languages), different strategies apply.

Explanation

.*?    Everything, until
(?=    followed by
[A-Z]  one of A .. Z and
[a-z]  one of a .. z
)

The Java Unicode-enabled variant would be this:

^.*?(?=\p{Lu}\p{Ll})
like image 169
Tomalak Avatar answered Sep 21 '22 10:09

Tomalak


Having woken up a bit, you don't need to delete anything, or even create a sub-group - just find the pattern expressed elsewhere in answers. Here's a complete example:

import java.util.regex.*;

public class Test
{
    public static void main(String args[])
    {
        Pattern pattern = Pattern.compile("[A-Z][a-z].*");

        String original = "THIS IS A TEST - - +++ This is a test";
        Matcher match = pattern.matcher(original);
        if (match.find())
        {
            System.out.println(match.group());
        }
        else
        {
            System.out.println("No match");
        }        
    }
}

EDIT: Original answer

This looks like it's doing the right thing:

import java.util.regex.*;

public class Test
{
    public static void main(String args[])
    {
        Pattern pattern = Pattern.compile("^.*?([A-Z][a-z].*)$");

        String original = "THIS IS A TEST - - +++ This is a test";
        String replaced = pattern.matcher(original).replaceAll("$1");

        System.out.println(replaced);
    }
}

Basically the trick is not to ignore everything before the proper word - it's to group everything from the proper word onwards, and replace the whole text with that group.

The above would fail with "*** FOO *** I am fond of peanuts" because the "I" wouldn't be considered a proper word. If you want to fix that, change the [a-z] to [a-z\s] which will allow for whitespace instead of a letter.

like image 33
Jon Skeet Avatar answered Sep 21 '22 10:09

Jon Skeet