Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find duplicates inside a string?

Tags:

java

string

I want to find out if a string that is comma separated contains only the same values:

test,asd,123,test
test,test,test

Here the 2nd string contains only the word "test". I'd like to identify these strings.

As I want to iterate over 100GB, performance matters a lot.

Which might be the fastest way of determining a boolean result if the string contains only one value repeatedly?

public static boolean stringHasOneValue(String string) {
   String value = null;
   for (split : string.split(",")) {
      if (value == null) {
         value = split;
      } else {
         if (!value.equals(split)) return false;
      }
   }
   return true;
}
like image 553
membersound Avatar asked Nov 11 '15 16:11

membersound


1 Answers

No need to split the string at all, in fact no need for any string manipulation.

  • Find the first word (indexOf comma).
  • Check the remaining string length is an exact multiple of that word+the separating comma. (i.e. length-1 % (foundLength+1)==0)
  • Loop through the remainder of the string checking the found word against each portion of the string. Just keep two indexes into the same string and move them both through it. Make sure you check the commas too (i.e. bob,bob,bob matches bob,bobabob does not).
  • As assylias pointed out there is no need to reset the pointers, just let them run through the String and compare the 1st with 2nd, 2nd with 3rd, etc.

Example loop, you will need to tweak the exact position of startPos to point to the first character after the first comma:

for (int i=startPos;i<str.length();i++) {
   if (str.charAt(i) != str.charAt(i-startPos)) {
      return false;
   }
}
return true;

You won't be able to do it much faster than this given the format the incoming data is arriving in but you can do it with a single linear scan. The length check will eliminate a lot of mismatched cases immediately so is a simple optimization.

like image 111
Tim B Avatar answered Oct 01 '22 14:10

Tim B