I've got a task, where I've got to go through several billion string lines and check, whether each of those is unique. All the lines themselves cannot be accommodated within the RAM memory of the PC. Also, the number of lines is likely to be larger than Integer.MAX_VALUE.
I'm assuming that the best way to handle this amount of data is to put hash codes of each of the strings into some sort of HashTable.
So, here are my questions:
String.hashCode()
? (the return value is int, but I probably will need long)You are over thinking the problem, this can all be done very simply with one MySQL table which saves data to the disk instead of holding everything in memory. That much data was never meant to be efficiently handled by a standalone application.
CREATE TABLE TONS_OF_STRINGS
(
unique_string varchar(255) NOT NULL,
UNIQUE (unique_string)
)
Just loop through the values (assuming a comma separated list here) and try to insert each token. Each failed token is a duplicate.
public static void main(args) {
Connection con = DriverManager.getConnection("jdbc:mysql://localhost/database","username","password");
FileReader file = new FileReader("SomeGiantFile.csv");
Scanner scan = new Scanner(file);
scan.useDelimiter(",");
String token;
while ( scan.hasNext() ) {
token = scan.next();
try {
PreparedStatement ps = con.prepareStatement("Insert into TONS_OF_STRING (UNIQUE_STRING) values (?)");
ps.setString(1, token);
ps.executeUpdate();
} catch (SQLException e) {
System.out.println("Found duplicate: " + token );
}
}
con.close();
System.out.println("Well that was easy, I'm all done!");
return 0;
}
Don't forget to clear the table when you are done though, thats a lot of data.
It is not sufficient to simply store 32 or 64 bit hashcodes because two distinct strings (out of a few billion) can easily have the same hashcode. Once you have two strings with the same hashcode, you need to compare the actual strings to see if they are actually equal.
Here's the way I'd solve this problem:
Read the file / stream of strings:
Read each line
Calculate the hash code for the line
Write the hashcode and the string to a temporary file with a suitable field separator in between
Use a decent external sort program to sort the temporary file using the hashcode field as the primary sort key and the string field as the secondary sort key.
Read the temporary file a line at a time. If two successive lines have the same hashcode field and different string fields then you've found a duplicate string.
Note: This approach will work equally well with 32 or 64 bit hashcodes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With