Indexing Huge text file

Question

I have one huge text file (over 100 gigs) with 6 columns of data (tab as separator). In first column I have integer value (2500 distinct values in set). I need to split this file into multiple smaller files depending on value in first column (note that rows are NOT sorted). Each of this smaller files will be used to prepare plot in matlab.

I have only 8 GB of ram.

The problem is how to do that efficiently? Any ideas?

I have only 8 GB of ram.

The problem is how to do that efficiently? Any ideas?

odrm · Accepted Answer

Using bash:

cat 100gigfile | while read line; do
  intval="$( echo "$line" | cut -f 1)"
  chunkfile="$( printf '%010u.txt' "$intval" )"
  echo "$line" >> "$chunkfile"
done

That will split your 100 gig file into (as you say) 2500 individual files named according the value of the first field. You may have to adjust the format argument to printf to your taste.

davka · Answer

one-liner with bash+awk:

awk '{print $0 >> $1".dat" }' 100gigfile

this will append every line of your large file to a file named as the first column's value + ".dat" extension, e.g. line 12 aa bb cc dd ee ff will go to the 12.dat file.

BЈовић · Answer

For linux 64 bit (I am not sure if it works for windows), you can mmap the file, and copy blocks to new files. I think that this would be most efficient way of doing it.

Indexing Huge text file

Tags:

c++

linux

windows

matlab

gozwei

3 Answers

odrm

davka

BЈовић

Recent Activity

Donate For Us

Indexing Huge text file

Tags:

c++

linux

windows

matlab

gozwei

3 Answers

odrm

davka

BЈовић

Related questions

Recent Activity

Donate For Us