Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split a file into blocks defined by a keyword

Tags:

bash

Suppose I have a large text file such as:

variableStep chrom=chr1
sometext1
sometext1
sometext1
variableStep chrom=chr2
sometext2
variableStep chrom=chr3
sometext3
sometext3
sometext3
sometext3

I would like to split this file into 3 files: file 1 has the content

sometext1
sometext1
sometext2

file 2 has the content

sometext2

and file 3 has the content

sometext3
sometext3
sometext3
sometext3

Note that none of the "sometext1" "sometext2" "sometext3" will have the word "variableStep".

I can do this in python by simply iterating over the lines and opening a new file handle and write the subsequent lines to it everytime I encounter a "variableStep" in the beginning of the line, however, I am wondering if this can be done on the command line. Note that the real files are massive (multiple Gbs so reading all the content in one go will not be feasible).

Thanks

like image 374
Lee Sande Avatar asked May 07 '15 04:05

Lee Sande


1 Answers

This will create file1, file2, etc with the desired content:

awk '/variableStep/{close(f); f="file" ++c;next} {print>f;}' file

How it works

  • /variableStep/{close(f); f="file" ++c;next}

    Every time we reach a line that contains variableStep, we close the last file used, assign to f the name of the next file to use, and then skip the rest of the commands and jump to the next line.

    c is a counter telling us the number for the current file. It is incremented by ++ every time that we create a new file name.

  • print>f

    For all other lines, we print them to a file named according to the value of variable f.

Since this processes the file line-by-line, it should be suitable even for massive files.

The first output file looks like:

$ cat file1
sometext1
sometext1
sometext1
like image 81
John1024 Avatar answered Oct 21 '22 12:10

John1024