Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what is the efficent way to process larges text files?

I have two files:
1- with 1400000 line or record --- 14 MB
2- with 16000000 -- 170 MB

I want to find if each record or line in file 1 is also in file 2 or not

I develop a java app that do the following: Read file line by line and pass each line to a method that loop in file 2

Here is my code:

public boolean hasIDin(String bioid) throws Exception {

    BufferedReader br = new BufferedReader(new FileReader("C://AllIDs.txt"));
    long bid = Long.parseLong(bioid);
    String thisLine;
    while((thisLine = br.readLine( )) != null)
    {
         if (Long.parseLong(thisLine) == bid)
            return true;

    }
        return false;
    }

public void getMBD() throws Exception{

     BufferedReader br = new BufferedReader(new FileReader("C://DIDs.txt"));
     OutputStream os = new FileOutputStream("C://MBD.txt");
     PrintWriter pr = new PrintWriter(os);
     String thisLine;
     int count=1;
     while ((thisLine = br.readLine( )) != null){
         String bioid = thisLine;
         System.out.println(count);
         if(! hasIDin(bioid))
                pr.println(bioid);
     count++;
     }
    pr.close();
}  

When I run it seems it will take more 1944.44444444444 hours to complete as every line processing takes 5 sec. That is about three months!

Is there any ideas to make it done in a much much more less time.

Thanks in advance.

like image 927
Abu Muhammad Avatar asked Dec 16 '22 19:12

Abu Muhammad


2 Answers

Why don't you;

  • read all the lines in file2 into a set. Set is fine, but TLongHashSet would be more efficient.
  • for each line in the second file see if it is in the set.

Here is a tuned implementation which prints the following and uses < 64 MB.

Generating 1400000 ids to /tmp/DID.txt
Generating 16000000 ids to /tmp/AllIDs.txt
Reading ids in /tmp/DID.txt
Reading ids in /tmp/AllIDs.txt
Took 8794 ms to find 294330 valid ids

Code

public static void main(String... args) throws IOException {
    generateFile("/tmp/DID.txt", 1400000);
    generateFile("/tmp/AllIDs.txt", 16000000);

    long start = System.currentTimeMillis();
    TLongHashSet did = readLongs("/tmp/DID.txt");
    TLongHashSet validIDS = readLongsUnion("/tmp/AllIDs.txt",did);

    long time = System.currentTimeMillis() - start;
    System.out.println("Took "+ time+" ms to find "+ validIDS.size()+" valid ids");
}

private static TLongHashSet readLongs(String filename) throws IOException {
    System.out.println("Reading ids in "+filename);
    BufferedReader br = new BufferedReader(new FileReader(filename), 128*1024);
    TLongHashSet ids = new TLongHashSet();
    for(String line; (line = br.readLine())!=null;)
        ids.add(Long.parseLong(line));
    br.close();
    return ids;
}

private static TLongHashSet readLongsUnion(String filename, TLongHashSet validSet) throws IOException {
    System.out.println("Reading ids in "+filename);
    BufferedReader br = new BufferedReader(new FileReader(filename), 128*1024);
    TLongHashSet ids = new TLongHashSet();
    for(String line; (line = br.readLine())!=null;) {
        long val = Long.parseLong(line);
        if (validSet.contains(val))
            ids.add(val);
    }
    br.close();
    return ids;
}

private static void generateFile(String filename, int number) throws IOException {
    System.out.println("Generating "+number+" ids to "+filename);
    PrintWriter pw = new PrintWriter(new BufferedWriter(new FileWriter(filename), 128*1024));
    Random rand = new Random();
    for(int i=0;i<number;i++)
        pw.println(rand.nextInt(1<<26));
    pw.close();
}
like image 101
Peter Lawrey Avatar answered Jan 07 '23 05:01

Peter Lawrey


170Mb + 14Mb is not so huge files. My suggestion is to load the smallest one file into java.util.Map, parse the biggest one line-by-line (record-by-record) file and check if the current line present in this Map.

P.S. The question looks like something trivial in terms of RDBMS - maybe it's worth to use any?

like image 32
Vadim Avatar answered Jan 07 '23 05:01

Vadim