Given common bash-tools, it is easy to split a big file (in my case a MySQL dump and thus a TSV-file) into smaller parts using the split
command. Furthermore, this command supports splitting a file after n
new lines (i.e. -l
argument). But this command does not distinguished between escaped and unescaped newline characters and thus might break a single table row into two incomplete parts.
Example (TSV with 2 columns)
cool 2014-12-15 17:31:00
do not censor it ...^M\\n 2016-01-24 22:33:00
watch out ari, you've got compeition! hahah 2001-12-05 19:11:01
Oh God, the poor guy! xD\\nCan't wait to watch this! 2011-07-11 22:01:20
wish i could do that.\\n 2001-02-07 00:24:11
Funny! I will use this reason when I drink something in other houses 2015-06-10 12:20:00
As you can see, there are two columns (first contains the comment and the second the date), which are separated by an tab. I visualised just the escaped newlines, tabs and unescaped newlines are not printed. If you put these lines into a file and split it (e.g., split example.tsv -l 1
) you will get 9 files, but there are only 6 comments (3 contain escaped newlines)! This is because escaped newlines are treated as regular newlines prefixed with a backslash. This is a huge problem for me, because splitting the file might lead to incomplete table rows in the output-files.
Is it somehow possible to ignore escaped newlines or does someone know another command which can do this?
This will break the file every 20 lines (or whatever you set n
to) but not on lines that end with a backslash:
awk -v n=20 'NR==1 || (c>n && !(last~/\\$/)){c=0; close(f); f="file" ++count ".out"} {c++; print>f; last=$0}' file
-v n=20
This creates an awk variable n
which we will use to decide when to split the file.
NR==1 || (c>n && !(last~/\\$/)){c=0; close(f); f="file" ++count ".out"}
Every time that we need to start a new file, we (a) set the line counter, c
, to zero, (b) close the previous file, and (c) define a name for the next file.
We need to start a new file when (i) we are on the first input line, NR==1
, or else when (ii) the line counter c
exceeds the limit n
and the last
line did not end with \
.
c++; print>f; last=$0
This increments the line counter, c
, prints the current line to file f
, and updates last
to the value of the current line.
Let's try this test file:
$ cat file
text1 2014-12-15 17:31:01
text2\
2014-12-15 17:31:02
text3 2014-12-15 17:31:03
text4a\
text4b\
2014-12-15 17:31:04
text5 2014-12-15 17:31:05
Now, let's run our command. To keep the example short, we set n=2
:
$ awk -v n=2 'NR==1 || (c>n && !(last~/\\$/)){c=0; close(f); f="file" ++count ".out"} {c++; print>f; last=$0}' file
After the command is run, new files appear in the directory:
$ ls
file file1.out file2.out file3.out
The new files contain the old contents split every 2 lines except not split on lines ending in \
:
$ cat file1.out
text1 2014-12-15 17:31:01
text2\
2014-12-15 17:31:02
$ cat file2.out
text3 2014-12-15 17:31:03
text4a\
text4b\
2014-12-15 17:31:04
$ cat file3.out
text5 2014-12-15 17:31:05
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With