Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java regex to extract text between tags

Tags:

java

regex

I have a file with some custom tags and I'd like to write a regular expression to extract the string between the tags. For example if my tag is:

[customtag]String I want to extract[/customtag] 

How would I write a regular expression to extract only the string between the tags. This code seems like a step in the right direction:

Pattern p = Pattern.compile("[customtag](.+?)[/customtag]"); Matcher m = p.matcher("[customtag]String I want to extract[/customtag]"); 

Not sure what to do next. Any ideas? Thanks.

like image 983
b10hazard Avatar asked Jul 03 '11 02:07

b10hazard


People also ask

How do I get a substring between tags in a string?

The StringUtils class in the Commons LangS library can be used to extract a substring from between tags in a String. The tags can be either the same tag or different tags. In addition, StringUtils also has a method that returns an array of Strings if multiple substrings are found.

What is \b in Java regex?

The subexpression/metacharacter “\b” matches the word boundaries when outside the brackets. Matches the backspace (0x08) when inside the brackets.


2 Answers

You're on the right track. Now you just need to extract the desired group, as follows:

final Pattern pattern = Pattern.compile("<tag>(.+?)</tag>", Pattern.DOTALL); final Matcher matcher = pattern.matcher("<tag>String I want to extract</tag>"); matcher.find(); System.out.println(matcher.group(1)); // Prints String I want to extract 

If you want to extract multiple hits, try this:

public static void main(String[] args) {     final String str = "<tag>apple</tag><b>hello</b><tag>orange</tag><tag>pear</tag>";     System.out.println(Arrays.toString(getTagValues(str).toArray())); // Prints [apple, orange, pear] }  private static final Pattern TAG_REGEX = Pattern.compile("<tag>(.+?)</tag>", Pattern.DOTALL);  private static List<String> getTagValues(final String str) {     final List<String> tagValues = new ArrayList<String>();     final Matcher matcher = TAG_REGEX.matcher(str);     while (matcher.find()) {         tagValues.add(matcher.group(1));     }     return tagValues; } 

However, I agree that regular expressions are not the best answer here. I'd use XPath to find elements I'm interested in. See The Java XPath API for more info.

like image 107
hoipolloi Avatar answered Sep 21 '22 07:09

hoipolloi


To be quite honest, regular expressions are not the best idea for this type of parsing. The regular expression you posted will probably work great for simple cases, but if things get more complex you are going to have huge problems (same reason why you cant reliably parse HTML with regular expressions). I know you probably don't want to hear this, I know I didn't when I asked the same type of questions, but string parsing became WAY more reliable for me after I stopped trying to use regular expressions for everything.

jTopas is an AWESOME tokenizer that makes it quite easy to write parsers by hand (I STRONGLY suggest jtopas over the standard java scanner/etc.. libraries). If you want to see jtopas in action, here are some parsers I wrote using jTopas to parse this type of file

If you are parsing XML files, you should be using an xml parser library. Dont do it youself unless you are just doing it for fun, there are plently of proven options out there

like image 37
jdc0589 Avatar answered Sep 22 '22 07:09

jdc0589