Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

construct a unique number for a string in java

We have a requirement of reading/writing more than 10 million strings into a file. Also we do not want duplicates in the file. Since the strings would be flushed to a file as soon as they are read we are not maintaining it in memory.

We cannot use hashcode because of collisions in the hash code due to which we might miss a string as duplicate. Two other approaches i found in my googling:

1.Use a message digest algorithm like MD5 - but it might be too costly to calculate and store.

2.Use a checksum algorithm. [i am not sure if this produces a unique key for a string- can someone please confirm]

Is there any other approach avaiable. Thanks.

like image 843
praveen Avatar asked Jun 14 '10 13:06

praveen


People also ask

How do you generate unique strings every time?

Using randomUUID() java. util. UUID is another Java class that can be used to generate a random string. It offers a static randomUUID() method that returns a random alphanumeric string of 32 characters.

How do you create a unique string in Java?

Method 1: Using Math. random() Here the function getAlphaNumericString(n) generates a random number of length a string. This number is an index of a Character and this Character is appended in temporary local variable sb. In the end sb is returned.


3 Answers

If you're okay with a microscopical risk of collisions, you could use some hash function such as MD5 as you suggest, and rely on the hashes.

Another alternative, possibly with a larger memory footprint, is to store the, already encountered strings, in a trie (a special type of tree).


Update: Yet another alternative, would be to use a Bloom filter. This however, still relies on hashing but can be adjusted to have an arbitrarily small probability of collisions.

like image 93
aioobe Avatar answered Oct 23 '22 18:10

aioobe


Storing 10 million strings in memory is indeed a lot, so I understand the reason to write it to file immediately instead of storing in e.g. a TreeSet<String> first, but where would you like to store the 10 million unique numerical keys which you want to compare with? When you want to keep it unique and numerical (which has much littler base/radix than letters), you can't make the key shorter than the string itself already is, so you won't save any memory. Or maybe at highest with data compression like GZIP, but this would only add a lot of overhead. MD5 is also inappropriate since two different strings can yield the same hash.

I really see no better solution for this than using a decent RDBMS (SQL database) wherein you set the column as UNIQUE and handle the constraint violation accordingly. A RDBMS is highly optimized for this kind of tasks.

If you really can't consider a database, then you need to re-read the file for any existing entry before the write/flush. Maybe not very fast, but certainly memory efficient.

like image 27
BalusC Avatar answered Oct 23 '22 18:10

BalusC


There is no way to make a function that would produce a unique key for a string, which is shorter than that string.
There are data structures which can solve your task. B-tree might fit if you data is large enough. Depending on the nature of your input, there might be more effective ways.

like image 1
unbeli Avatar answered Oct 23 '22 18:10

unbeli