Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way to count the unique elements in a list of billion elements?

My problem is not usual. Let's imagine few billions of strings. Strings are usually less then 15 characters. In this list I need to find out the number of the unique elements.

First of all, what object should I use? You shouldn't forget if I add a new element I have to check if it is already existing in the list. It is not a problem in the beginning, but after few millions of words it can really slow down the process.

That's why I thought that Hashtable would be the ideal for this task because checking the list is ideally only log(1). Unfortunately a single object in .net can be only 2GB.

Next step will be to implement a custom hashtable which contains a list of 2GB hashtables.

I am wondering maybe some of you know a better solution. (Computer has extremely high specification.)

like image 442
Andras Csehi Avatar asked Jan 12 '10 22:01

Andras Csehi


People also ask

How do you count unique values in a list?

You can use the combination of the SUM and COUNTIF functions to count unique values in Excel. The syntax for this combined formula is = SUM(IF(1/COUNTIF(data, data)=1,1,0)). Here the COUNTIF formula counts the number of times each value in the range appears.

How do you count unique values in Java?

Method 1 : To check the status of visited elements create a array of size n and a variable say count_dis=0 to count the distinct element. Run a loop from index 0 to n and check if (visited[i]==1) then skip that element. Check if(arr[i]==arr[j]), then set visited[j]=1.


1 Answers

I would skip the data structures exercise and just use an SQL database. Why write another custom data structure that you have to analyze and debug, just use a database. They are really good at answering queries like this.

like image 82
D.Shawley Avatar answered Oct 14 '22 21:10

D.Shawley