Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Big text file processing

I need to implement lazy loading in Mathematica. I have a 600 Mb CSV text file which I need to process. This file contains a lot of duplicated records:

1;0;0;13;6
1;0;0;13;6
..........
2;0;0;13;6
2;0;0;13;6
..........
etc.

So instead of loading them all into memory, I'd like to create a list containing records and the number of times this record was encountered in the file:

{{10000,{1,0,0,13,6}}, {20000,{2,0,0,13,6}}, ...}

I couldn't find a way to do it with Import function. I'm looking for something like

Import["my_file.csv", "CSV", myProcessingFunction]

where myProcessingFunction will take one record at a time and create a dataset. Is it possible to do this with Import or any other Mathematica function?

like image 725
Max Avatar asked Nov 26 '10 12:11

Max


2 Answers

If it were me, I'd probably do this using unix sort and uniq, but since you ask about Mathematica.... I'd use ReadList[] to read blocks of lines, and define downvalues to find the unique strings an keep track of how many we've seen before.

(* Create some test data *)
Export["/tmp/test.txt", Flatten[{Range[1000], Range[1000]}], "Lines"];

countUniqueLines[file_String, blockSize_Integer] := Module[{stream, map, block, keys, out}, 
    map[_]:=0;
    stream = OpenRead[file];
    CheckAbort[While[(block=ReadList[stream, String, blockSize])=!={}, 
        (map[#]=map[#]+1)& /@ block;];, Close[stream];Clear[map]];
    Close[stream];
    keys = Cases[DownValues[map][[All, 1, 1, 1]], _String];
    out = {#, map[#]}& /@ keys;
    Clear[map];
    out
]

countUniqueLines["/tmp/test.txt", 500]


(* Alternative implementation if you have a little more memory *)
Tally[Import["/tmp/test.txt", "Lines"]]
like image 153
Joshua Martell Avatar answered Dec 01 '22 16:12

Joshua Martell


I think you want the Read[] function.

like image 30
High Performance Mark Avatar answered Dec 01 '22 15:12

High Performance Mark