Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

splitting up big file using AWK, cannot get past 252 split files

Tags:

regex

awk

I want to split a large file (7.5MB) into multiple smaller files based on regex timestamp pattern, and there are 566 timestamps in the file:

The large file is made up of multiple blocks of data, each block contains: timestamp + data, and it looks like this (line 1 is the first timestamp):

12/20/2022 23:18:56

blah
blah
blah
blah
blah
blah
12/20/2022 23:23:56

blah
blah
blah
12/20/2022 23:28:56
blah
...
...
...

Each smaller, split-up file should only contain one timestamp & one block of data, e.g.:

12/20/2022 23:23:56

blah
blah
blah

I'm using awk to look for each timestamp, and once found, each timestamp + data is written to a split file, until the next timestamp is found, which then creates the next split file:

regex='([0-9]{2}\/[0-9]{2}\/[0-9]{4})'
awk -v regex=$regex '$0 ~ regex{x="split"++i}; i > 0 {print > x;}' $bigfile

This works great (i.e. files split1-252 are exactly what I expected) until awk encounters the 253rd occurrence of the timestamp, and then it errors out:

awk: can't open file split253
 source line number 1

As far as I can tell, there's nothing different about the 253rd timestamp, so I saved 253rd through 566th timestamp occurrences as a new file, so the new file has a total of 314 occurrence of the timestamp pattern, and rerun my code against the new file. Interestingly enough, awk errored out again with the exact same message:

awk: can't open file split253
 source line number 1

It almost seems the way I have written theawk command can only handle creating 252 files based on the regex pattern, but I'm not sure what's causing this limitation? Any advice would be greatly appreciated.

I've been research/googling this for couple of days now, and did find another post with similar issue, and I did try setting an initial value for x, but that still gave me the same error. Furthermore, if an initial value for x is needed, I thought AWK would error out immediately, rather than working correctly for split1-252, and then error out at 253.

like image 439
katiayx Avatar asked May 06 '26 08:05

katiayx


2 Answers

There's always a limit to how many files one process can have open, and different awk versions also have their own limits which can be as low as 10. Some awks (e.g. GNU awk) handle the external limit internally but it slows them down while other awks just fail as you see. Just close the output files as you go:

regex='[0-9]{2}/[0-9]{2}/[0-9]{4}'
awk -v regex="$regex" '$0 ~ regex{close(x); x="split"(++i)}; i{print > x}' "$bigfile"

I tidied up your regexp, quoting, etc. too. Obviously you don't actually need to declare a shell variable to hold the regexp:

awk '/[0-9]{2}\/[0-9]{2}\/[0-9]{4}/{close(x); x="split"(++i)}; i{print > x}' "$bigfile"
like image 177
Ed Morton Avatar answered May 09 '26 01:05

Ed Morton


You can also use csplit:

csplit -qz ip.txt '/[0-9]\{2\}\/[0-9]\{2\}\/[0-9]\{4\}/' '{*}'

This will create files named xx00, xx01, xx02, etc. You can customize the output names. For example, -n1 -f'split' will give names like split0, split1, etc.

like image 22
Sundeep Avatar answered May 09 '26 01:05

Sundeep



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!