Java Counting # of occurrences of a word in a string

Tags:

regex

I have a large text file I am reading from and I need to find out how many times some words come up. For example, the word the. I'm doing this line by line each line is a string.

I need to make sure that I only count legit the's--the the in other would not count. This means I know I need to use regular expressions in some way. What I was trying so far is this:

numSpace += line.split("[^a-z]the[^a-z]").length;

I realize the regular expression may not be correct at the moment but I tried without that and just tried to find occurrences of the word the and I get wrong numbers too. I was under the impression this would split the string up into an array and how many times that array was split up was how many times the word is in the string. Any ideas I would be grateful.

Update: Given some ideas, I've come up with this:

numThe += line.split("[^a-zA-Z][Tt]he[^a-zA-Z]", -1).length - 1;

Though still getting some strange numbers. I was able to get an accurate general count (without the regular expression), now my issue is with the regexp.

297

asked Apr 14 '10 05:04

Doug

1 Answers

Using split to count isn't the most efficient, but if you insist on doing that, the proper way is this:

haystack.split(needle, -1).length -1

If you don't set limit to -1, split defaults to 0, which removes trailing empty strings, which messes up your count.

From the API:

The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. [...] If n is zero then [...] trailing empty strings will be discarded.

You also need to subtract 1 from the length of the array, because N occurrences of the delimiter splits the string into N+1 parts.

As for the regex itself (i.e. the needle), you can use \b the word boundary anchors around the word. If you allow word to contain metacharacters (e.g. count occurrences of "$US"), you may want to Pattern.quote it.

I've come up with this:
numThe += line.split("[^a-zA-Z][Tt]he[^a-zA-Z]", -1).length - 1;
Though still getting some strange numbers. I was able to get an accurate general count (without the regular expression), now my issue is with the regexp.

Now the issue is that you're not counting [Tt]he that appears as the first or last word, because the regex says that it has to be preceded/followed by some character, something that matches [^a-zA-Z] (that is, your match must be of length 5!). You're not allowing the case where there isn't a character at all!

You can try something like this instead:

"(^|[^a-zA-Z])[Tt]he([^a-zA-Z]|$)"

This isn't the most concise solution, but it works.

Something like this (using negative lookarounds) also works:

"(?<![a-zA-Z])[Tt]he(?![^a-zA-Z])"

This has the benefit of matching just [Tt]he, without any extra characters around it like your previous solution did. This is relevant in case you actually want to process the tokens returned by split, because the delimiter in this case isn't "stealing" anything from the tokens.

Non-`split`

Though using split to count is rather convenient, it isn't the most efficient (e.g. it's doing all kinds of work to return those strings that you discard). The fact that as you said you're counting line-by-line means that the pattern would also have to be recompiled and thrown away every line.

A more efficient way would be to use the same regex you did before and do the usual Pattern.compile and while (matcher.find()) count++;

answered Sep 24 '22 21:09

polygenelubricants

Related questions
                            
                                How to find the count of substring in java
                            
                                File upload spring cloud feign client
                            
                                Android: invalid parent reference
                            
                                how to call MySQL stored procedure in spring boot using hibernate?
                            
                                java.io.EOFException: Unexpected end of ZLIB input stream using Apache POI
                            
                                ignite won't start with spring-boot 2.0.5 - h2 property NESTED_JOINS doesn't exist
                            
                                Async task not supporting in android 9.0 (Pie)
                            
                                Flutter not creating Java Classes only Kotlin Instead
                            
                                Could not resolve com.google.android.gms:play-services-location:16.+
                            
                                What is the best way to migrate an existing messy webapp to elegant MVC? [closed]
                            
                                Fail fast finally clause in Java
                            
                                Java development in a Perl shop: How to select the right tool?
                            
                                "Dynamic" Casting in Java
                            
                                What's the point of using constants for property keys?
                            
                                Usefulness of ArrayList<E>.clear()?
                            
                                Is php very limited?
                            
                                Java Reading Undecoded URL from Servlet
                            
                                Why array values in java is stored in heap?
                            
                                Using Java reflection to create eval() method
                            
                                Extracting a given number of the highest values in a List

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Java Counting # of occurrences of a word in a string

Tags:

java

regex

Doug

People also ask

1 Answers

Non-`split`

polygenelubricants

Recent Activity

Donate For Us

Java Counting # of occurrences of a word in a string

Tags:

java

regex

Doug

People also ask

1 Answers

Non-split

polygenelubricants

Related questions

Recent Activity

Donate For Us

Non-`split`