I have one huge text file (over 100 gigs) with 6 columns of data (tab as separator). In first column I have integer value (2500 distinct values in set). I need to split this file into multiple smaller files depending on value in first column (note that rows are NOT sorted). Each of this smaller files will be used to prepare plot in matlab.
I have only 8 GB of ram.
The problem is how to do that efficiently? Any ideas?
Using bash:
cat 100gigfile | while read line; do
intval="$( echo "$line" | cut -f 1)"
chunkfile="$( printf '%010u.txt' "$intval" )"
echo "$line" >> "$chunkfile"
done
That will split your 100 gig file into (as you say) 2500 individual files named according the value of the first field. You may have to adjust the format argument to printf to your taste.
one-liner with bash+awk:
awk '{print $0 >> $1".dat" }' 100gigfile
this will append every line of your large file to a file named as the first column's value + ".dat" extension, e.g. line 12 aa bb cc dd ee ff
will go to the 12.dat
file.
For linux 64 bit (I am not sure if it works for windows), you can mmap the file, and copy blocks to new files. I think that this would be most efficient way of doing it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With