Suppose I have a large text file such as:
variableStep chrom=chr1
sometext1
sometext1
sometext1
variableStep chrom=chr2
sometext2
variableStep chrom=chr3
sometext3
sometext3
sometext3
sometext3
I would like to split this file into 3 files: file 1 has the content
sometext1
sometext1
sometext2
file 2 has the content
sometext2
and file 3 has the content
sometext3
sometext3
sometext3
sometext3
Note that none of the "sometext1" "sometext2" "sometext3" will have the word "variableStep".
I can do this in python by simply iterating over the lines and opening a new file handle and write the subsequent lines to it everytime I encounter a "variableStep" in the beginning of the line, however, I am wondering if this can be done on the command line. Note that the real files are massive (multiple Gbs so reading all the content in one go will not be feasible).
Thanks
This will create file1
, file2
, etc with the desired content:
awk '/variableStep/{close(f); f="file" ++c;next} {print>f;}' file
/variableStep/{close(f); f="file" ++c;next}
Every time we reach a line that contains variableStep
, we close the last file used, assign to f
the name of the next file to use, and then skip the rest of the commands and jump to the next line.
c
is a counter telling us the number for the current file. It is incremented by ++
every time that we create a new file name.
print>f
For all other lines, we print them to a file named according to the value of variable f
.
Since this processes the file line-by-line, it should be suitable even for massive files.
The first output file looks like:
$ cat file1
sometext1
sometext1
sometext1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With