Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Print unique lines of a 10GB file

Tags:

perl

I have a 10GB file with 200 million lines. I need to get unique lines of this file.

My code:

 while(<>) {
     chomp;
     $tmp{$_}=1;
 }
 #print...

I only have 2GB memory. How can I solve this problem?

like image 857
John Water Avatar asked Apr 05 '12 02:04

John Water


3 Answers

As I commented on David's answer, a database is the way to go, but a nice one might be DBM::Deep since its pure-Perl and easy to install and use; its essentially a Perl hash tied to a file.

use DBM::Deep;
tie my %lines, 'DBM::Deep', 'data.db';

while(<>) {
    chomp;
    $lines{$_}=1;
}

This is basically what you already had, but the hash is now a database tied to a file (here data.db) rather than kept in memory.

like image 150
Joel Berger Avatar answered Oct 11 '22 05:10

Joel Berger


If you don't care about preserving order, I bet the following is faster than the previously posted solutions (e.g. DBM::Deep):

sort -u file
like image 41
ikegami Avatar answered Oct 11 '22 03:10

ikegami


In most cases, you could store the line as a key in a hash. However, when you get this big, this really isn't very efficient. In this case, you'd be better off using a database.

One thing to try is the Berkeley Database that use to be included in Unix (BDB). Now, it's apparently owned by Oracle.

Perl can use the BerkeleyDB module to talk with a BDB database. In fact, you can even tie a Perl hash to a BDB database. Once this is done, you can use normal Perl hashes to access and modify the database.

BDB is pretty robust. Bitcoins uses it, and so does SpamAssassin, so it is very possible that it can handle the type of database you have to create in order to find duplicate lines. If you already have DBD installed, writing a program to handle your task shouldn't take that long. If it doesn't work, you wouldn't have wasted too much time with this.

The only other thing I can think of is using an SQL database which would be slower and much more complex.


Addendum

Maybe I'm over thinking this...

I decided to try a simple hash. Here's my program:

#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use autodie;

use constant DIR => "/usr/share/dict";

use constant WORD_LIST => qw(words web2a propernames connectives);

my %word_hash;
for my $count (1..100) {
    for my $file (WORD_LIST) {
        open my $file_fh, "<", DIR . "/$file";
        while (my $word = <$file_fh>) {
            chomp $word;
            $word_hash{"$file-$word-$count"} = $word;
        }
    }
}

The files read in contain a total of about 313,000 lines. I do this 100 times to get a hash with 31,300,000 keys in it. It is about as inefficient as it can be. Each and every key will be unique. The amount of memory will be massive. Yet...

It worked. It took about 10 minutes to run despite the massive inefficiencies in the program, and it maxed out at around 6 gigabytes. However, most of that was in virtual memory. Strangely, even though it was running, gobbling memory, and taking 98% of the CPU, my system didn't really slow down all that much. I guess the question really is what type of performance are you expecting? If taking about 10 minutes to run isn't that much of an issue for you, and you don't expect this program to be used that often, then maybe go for simplicity and use a simple hash.

I'm now downloading DBD from Oracle, compiling it, and installing it. I'll try the same program using DBD and see what happens.


Using a BDB Database

After doing the work, I think if you have MySQL installed, using Perl DBI would be easier. I had to:

  • Download Berkeley DB from Oracle, and you need an Oracle account. I didn't remember my password, and told it to email me. Never got the email. I spent 10 minutes trying to remember my email address.
  • Once downloaded, it has to be compiled. Found directions for compiling for the Mac and it seemed pretty straight forward.
  • Running CPAN crashed. Ends up that CPAN is looking for /usr/local/BerkeleyDB and it was installed as /usr/local/BerkeleyDB.5.3. Creating a link fixed the issue.

All told, about 1/2 an hour getting BerkeleyDB installed. Once installed, modifying my program was fairly straight forward:

#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use autodie;

use BerkeleyDB;

use constant {
    DIR       => "/usr/share/dict",
    BDB_FILE  => "bdb_file",
};

use constant WORD_LIST => qw(words web2a propernames connectives);

unlink BDB_FILE if -f BDB_FILE;

our %word_hash;
tie %word_hash, "BerkeleyDB::Hash",
    -Filename => BDB_FILE,
    -Flags    => DB_CREATE
        or die qq(Cannot create DBD_Database file ") . BDB_FILE . qq("\n);

for my $count (1..10) {
    for my $file (WORD_LIST) {
        open my $file_fh, "<", DIR . "/$file";
        while (my $word = <$file_fh>) {
            chomp $word;
            $word_hash{"$file-$word-$count"} = $word;
        }
    }
}

All I had to do was add a few lines.

Running the program was a disappointment. It wasn't faster, but much, much slower. It took over 2 minutes while using a pure hash took a mere 13 seconds.

However, it used a lot less memory. While the old program gobbled gigabytes, the BDB version barely used a megabyte. Instead, it created a 20MB database file.

But, in these days of VM and cheap memory, did it accomplish anything? In the old days before virtual memory and good memory handling, a program would crash your computer if it used all of the memory (and memory was measured in megabytes and not gigabytes). Now, if your program wants more memory than is available, it simply is given virtual memory.

So, in the end, using a Berkeley database is not a good solution. Whatever I saved in programming time by using tie was wasted with the installation process. And, it was slow.

Using BDB simply used a DBD file instead of memory. A modern OS will do the same, and is faster. Why do the work when the OS will handle it for you?

The only reason to use a database is if your system really doesn't have the required resources. 200 million lines is a big file, but a modern OS will probably be okay with it. If your system really doesn't have the resource, use a SQL database on another system, and not a DBD database.

like image 21
David W. Avatar answered Oct 11 '22 04:10

David W.