Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Indexing Huge text file

I have one huge text file (over 100 gigs) with 6 columns of data (tab as separator). In first column I have integer value (2500 distinct values in set). I need to split this file into multiple smaller files depending on value in first column (note that rows are NOT sorted). Each of this smaller files will be used to prepare plot in matlab.

I have only 8 GB of ram.

The problem is how to do that efficiently? Any ideas?

like image 655
gozwei Avatar asked Apr 11 '11 10:04

gozwei


3 Answers

Using bash:

cat 100gigfile | while read line; do
  intval="$( echo "$line" | cut -f 1)"
  chunkfile="$( printf '%010u.txt' "$intval" )"
  echo "$line" >> "$chunkfile"
done

That will split your 100 gig file into (as you say) 2500 individual files named according the value of the first field. You may have to adjust the format argument to printf to your taste.

like image 72
odrm Avatar answered Oct 11 '22 21:10

odrm


one-liner with bash+awk:

awk '{print $0 >> $1".dat" }' 100gigfile 

this will append every line of your large file to a file named as the first column's value + ".dat" extension, e.g. line 12 aa bb cc dd ee ff will go to the 12.dat file.

like image 38
davka Avatar answered Oct 11 '22 23:10

davka


For linux 64 bit (I am not sure if it works for windows), you can mmap the file, and copy blocks to new files. I think that this would be most efficient way of doing it.

like image 1
BЈовић Avatar answered Oct 11 '22 21:10

BЈовић