Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handling large String lists in java

I've got a task, where I've got to go through several billion string lines and check, whether each of those is unique. All the lines themselves cannot be accommodated within the RAM memory of the PC. Also, the number of lines is likely to be larger than Integer.MAX_VALUE.

I'm assuming that the best way to handle this amount of data is to put hash codes of each of the strings into some sort of HashTable.

So, here are my questions:

  1. What should I use instead of String.hashCode()? (the return value is int, but I probably will need long)
  2. What is the fastest way/framework for working with lists of this size? What I mostly need is an ability to quickly check if the list contains an element or not
like image 656
Arsen Zahray Avatar asked Oct 01 '11 23:10

Arsen Zahray


2 Answers

You are over thinking the problem, this can all be done very simply with one MySQL table which saves data to the disk instead of holding everything in memory. That much data was never meant to be efficiently handled by a standalone application.

CREATE TABLE TONS_OF_STRINGS
(
  unique_string varchar(255) NOT NULL,
  UNIQUE (unique_string)
)

Just loop through the values (assuming a comma separated list here) and try to insert each token. Each failed token is a duplicate.

public static void main(args) {
  Connection con = DriverManager.getConnection("jdbc:mysql://localhost/database","username","password");
  FileReader file = new FileReader("SomeGiantFile.csv");
  Scanner scan = new Scanner(file);
  scan.useDelimiter(",");
  String token;
  while ( scan.hasNext() ) {
    token = scan.next();
    try {
      PreparedStatement ps = con.prepareStatement("Insert into TONS_OF_STRING (UNIQUE_STRING) values (?)");
      ps.setString(1, token);
      ps.executeUpdate();
    } catch (SQLException e) {
      System.out.println("Found duplicate: " + token );
    }
  }
  con.close();
  System.out.println("Well that was easy, I'm all done!");
  return 0;
}

Don't forget to clear the table when you are done though, thats a lot of data.

like image 154
rwyland Avatar answered Nov 15 '22 19:11

rwyland


It is not sufficient to simply store 32 or 64 bit hashcodes because two distinct strings (out of a few billion) can easily have the same hashcode. Once you have two strings with the same hashcode, you need to compare the actual strings to see if they are actually equal.

Here's the way I'd solve this problem:

  1. Read the file / stream of strings:

    1. Read each line

    2. Calculate the hash code for the line

    3. Write the hashcode and the string to a temporary file with a suitable field separator in between

  2. Use a decent external sort program to sort the temporary file using the hashcode field as the primary sort key and the string field as the secondary sort key.

  3. Read the temporary file a line at a time. If two successive lines have the same hashcode field and different string fields then you've found a duplicate string.

Note: This approach will work equally well with 32 or 64 bit hashcodes.

like image 3
Stephen C Avatar answered Nov 15 '22 19:11

Stephen C