Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract words starting with a particular character from a string

I got the following string:

 String line = "#food was testy. #drink lots of. #night was fab. #three #four";

I want to take #food #drink #night #three and #four from it.

I tried this code:

    String[] words = line.split("#");
    for (String word: words) {
        System.out.println(word);
    }

But it gives food was testy, drink lots of, nigth was fab, three and four.

like image 906
Devendra Singh Avatar asked Apr 03 '15 08:04

Devendra Singh


1 Answers

split will only cuts the whole string at where it founds a #. That explain your current result.

You may want to extract the first word of every pieces of string, but the good tool to perform your task is RegEx

Here how you can achieve it:

String line = "#food was testy. #drink lots of. #night was fab. #three #four";

Pattern pattern = Pattern.compile("#\\w+");

Matcher matcher = pattern.matcher(line);
while (matcher.find())
{
    System.out.println(matcher.group());
}

Output is:

#food
#drink
#night
#three
#four

The magic happen in "#\w+".

  • # the pattern start with a #
  • \w Matches any letter (a-z, A-Z), number (0-9), or underscore.
  • + Matches one or more consecutive \w characters.

So we search for stuff starting with # followed by one or more letter, number or underscore.

We use '\\' for '\' because of Escape Sequences.

You can play with it here.

find and group are explained here:

  • The find method scans the input sequence looking for the next subsequence that matches the pattern.
  • group() returns the input subsequence matched by the previous match.

[edit]

The use of \w can be an issue if you need to detect accented characters or non-latin characters.

For example in:

"Bonjour mon #bébé #chat."

The matches will be:

  • #b
  • #chat

It depends on what you will accept as possible hashTag. But it is an other question and multiple discussions exist about it.

For example, if you want any characters from any language, #\p{L}+ looks good, but the underscore is not in it...

like image 147
Orace Avatar answered Sep 28 '22 08:09

Orace