Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting strings through regular expressions by punctuation and whitespace etc in java

I have this text file that I read into a Java application and then count the words in it line by line. Right now I am splitting the lines into words by a

String.split([\\p{Punct}\\s+])" 

But I know I am missing out on some words from the text file. For example, the word "can't" should be divided into two words "can" and "t".

Commas and other punctuation should be completely ignored and considered as whitespace. I have been trying to understand how to form a more precise Regular Expression to do this but I am a novice when it comes to this so I need some help.

What could be a better regex for the purpose I have described?

like image 638
Snorkelfarsan Avatar asked Sep 12 '11 07:09

Snorkelfarsan


People also ask

How do you split a string by space and punctuation?

This method considers the word between two spaces as one token and returns an array of words (between spaces) in the current String. Therefore, to split a string at every space and punctuation, invoke the split() method on it by passing the above specified regular expression as a parameter.

How do you split a string with white space characters?

You can split a String by whitespaces or tabs in Java by using the split() method of java. lang. String class. This method accepts a regular expression and you can pass a regex matching with whitespace to split the String where words are separated by spaces.

Is punctuation considered whitespace in Java?

Commas and other punctuation should be completely ignored and considered as whitespace.


1 Answers

You have one small mistake in your regex. Try this:

String[] Res = Text.split("[\\p{Punct}\\s]+"); 

[\\p{Punct}\\s]+ move the + form inside the character class to the outside. Other wise you are splitting also on a + and do not combine split characters in a row.

So I get for this code

String Text = "But I know. For example, the word \"can\'t\" should";  String[] Res = Text.split("[\\p{Punct}\\s]+"); System.out.println(Res.length); for (String s:Res){     System.out.println(s); } 

this result

10
But
I
know
For
example
the
word
can
t
should

Which should meet your requirement.

As an alternative you can use

String[] Res = Text.split("\\P{L}+"); 

\\P{L} means is not a unicode code point that has the property "Letter"

like image 138
stema Avatar answered Sep 20 '22 22:09

stema