Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

split file in bash after unescaped newline

Given common bash-tools, it is easy to split a big file (in my case a MySQL dump and thus a TSV-file) into smaller parts using the split command. Furthermore, this command supports splitting a file after n new lines (i.e. -l argument). But this command does not distinguished between escaped and unescaped newline characters and thus might break a single table row into two incomplete parts.

Example (TSV with 2 columns)

cool    2014-12-15 17:31:00
do not censor it ...^M\\n      2016-01-24 22:33:00
watch out ari, you've got compeition! hahah     2001-12-05 19:11:01
Oh God, the poor guy!  xD\\nCan't wait to watch this!      2011-07-11 22:01:20
wish i could do that.\\n       2001-02-07 00:24:11
Funny! I will use this reason when I drink something in other houses    2015-06-10 12:20:00

As you can see, there are two columns (first contains the comment and the second the date), which are separated by an tab. I visualised just the escaped newlines, tabs and unescaped newlines are not printed. If you put these lines into a file and split it (e.g., split example.tsv -l 1) you will get 9 files, but there are only 6 comments (3 contain escaped newlines)! This is because escaped newlines are treated as regular newlines prefixed with a backslash. This is a huge problem for me, because splitting the file might lead to incomplete table rows in the output-files.

Is it somehow possible to ignore escaped newlines or does someone know another command which can do this?

like image 736
NaN Avatar asked Oct 28 '22 22:10

NaN


1 Answers

This will break the file every 20 lines (or whatever you set n to) but not on lines that end with a backslash:

awk -v n=20 'NR==1 || (c>n && !(last~/\\$/)){c=0; close(f); f="file" ++count ".out"} {c++; print>f; last=$0}' file

How it works

  1. -v n=20

    This creates an awk variable n which we will use to decide when to split the file.

  2. NR==1 || (c>n && !(last~/\\$/)){c=0; close(f); f="file" ++count ".out"}

    Every time that we need to start a new file, we (a) set the line counter, c, to zero, (b) close the previous file, and (c) define a name for the next file.

    We need to start a new file when (i) we are on the first input line, NR==1, or else when (ii) the line counter c exceeds the limit n and the last line did not end with \.

  3. c++; print>f; last=$0

    This increments the line counter, c, prints the current line to file f, and updates last to the value of the current line.

Example

Let's try this test file:

$ cat file
text1   2014-12-15 17:31:01
text2\  
        2014-12-15 17:31:02
text3   2014-12-15 17:31:03
text4a\
text4b\ 
        2014-12-15 17:31:04
text5   2014-12-15 17:31:05

Now, let's run our command. To keep the example short, we set n=2:

$ awk -v n=2 'NR==1 || (c>n && !(last~/\\$/)){c=0; close(f); f="file" ++count ".out"} {c++; print>f; last=$0}' file

After the command is run, new files appear in the directory:

$ ls
file  file1.out  file2.out  file3.out

The new files contain the old contents split every 2 lines except not split on lines ending in \:

$ cat file1.out
text1   2014-12-15 17:31:01
text2\
        2014-12-15 17:31:02
$ cat file2.out
text3   2014-12-15 17:31:03
text4a\
text4b\
        2014-12-15 17:31:04
$ cat file3.out
text5   2014-12-15 17:31:05
like image 190
John1024 Avatar answered Nov 02 '22 19:11

John1024