Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding regex in Java: split("\t") vs split("\\t") - when do they both work, and when should they be used

Tags:

java

regex

split

I have recently figured out that I haven't been using regex properly in my code. Given the example of a tab delimited string str, I have been using str.split("\t"). Now I realize that this is wrong and to match the tabs properly I should use str.split("\\t").

However I happen to stumble upon this fact by pure chance, as I was looking for regex patterns for something else. You see, the faulty code split("\t")has been working quite fine in my case, and now I am confused as to why it does work if it's the wrong way to declare a regex for matching the tab character. Hence the question, for the sake of actually understanding how regex is handled in Java, instead of just copying the code into Eclipse and not really caring why it works...

In a similar fashion I have come upon a piece of text which is not only tab-delimited but also comma delimited. More clearly put, the tab-delimited lists I am parsing sometimes include "compound" items which look like: item1,item2,item3 and I would like to parse them as separate elements, for the sake of simplicity. In that case the appropriate regex expression should be: line.split("[\\t,]"), or am I mistaken here as well??

Thanks in advance,

like image 650
posdef Avatar asked Sep 21 '10 16:09

posdef


People also ask

What does split \t do?

\t is the tab character. [\t ]+ is a regular expression saying any sequence of 1 or more tabs/spaces. Splits the line on one or more tabs/spaces.

What does \\ mean in Java regex?

The backslash \ is an escape character in Java Strings. That means backslash has a predefined meaning in Java. You have to use double backslash \\ to define a single backslash. If you want to define \w , then you must be using \\w in your regex.

What does \\ mean in regex?

\\. matches the literal character . . the first backslash is interpreted as an escape character by the Emacs string reader, which combined with the second backslash, inserts a literal backslash character into the string being read. the regular expression engine receives the string \.

Can you use regex in Split Java?

split(String regex) method splits this string around matches of the given regular expression. This method works in the same way as invoking the method i.e split(String regex, int limit) with the given expression and a limit argument of zero. Therefore, trailing empty strings are not included in the resulting array.


1 Answers

When using "\t", the escape sequence \t is replaced by Java with the character U+0009. When using "\\t", the escape sequence \\ in \\t is replaced by Java with \, resulting in \t that is then interpreted by the regular expression parser as the character U+0009.

So both notations will be interpreted correctly. It’s just the question when it is replaced with the corresponding character.

like image 50
Gumbo Avatar answered Sep 21 '22 15:09

Gumbo