I'm trying to perform simple literal search/replace on a large (30G) one-line file, using sed
.
I would expect this to take some time but, when I run it, it returns after a few seconds and, when I look at the generated file, it's zero length.
input file has 30G
$ ls -lha Full-Text-Tokenized-Single-Line.txt
-rw-rw-r-- 1 ubuntu ubuntu 30G Jun 9 19:51 Full-Text-Tokenized-Single-Line.txt
run the command:
$ sed 's/<unk>/ /g' Full-Text-Tokenized-Single-Line.txt > Full-Text-Tokenized-Single-Line-No-unks.txt
the output file has zero length!
$ ls -lha Full-Text-Tokenized-Single-Line-No-unks.txt
-rw-rw-r-- 1 ubuntu ubuntu 0 Jun 9 19:52 Full-Text-Tokenized-Single-Line-No-unks.txt
Things I've tried
's/foo/bar/g'
) instead: doesn't work: zero-length file is returned.return code is 0
sed version is (GNU sed) 4.2.2
Just use awk, it's designed for handling records separated by arbitrary strings. With GNU awk for multi-char RS:
awk -v RS='<unk>' '{ORS=(RT?" ":"")}1' file
The above splits the input into records separated by <unk>
so if enough <unk>
s are present in the input then the individual records will be small enough to fit in memory. It then prints each record followed by a blank char so the overall impact to the data is that all <unk>
s become blank chars.
If that direct approach doesn't work for you THEN it'd be time to start looking for alternative solutions.
with line-based editors like sed
you can't expect this to work, since its unit of work (record) is the line terminated with line breaks.
One suggestion if you have white space in your file (to prevent searched pattern to split) is use
fold -s file_with_one_long_line |
sed 's/find/replace/g' |
tr -d '\n' > output
ps. fold
default width is 80, in case you have words longer than 80 you can add -w 1000
or at least the longest word size to prevent word splitting.
Officially gnu sed has no line limit http://www.linuxtopia.org/online_books/linux_tool_guides/the_sed_faq/sedfaq6_005.html However the page state that:
"no limit" means there is no "fixed" limit. Limits are actually determined by one's hardware, memory, operating system, and which C library is used to compile sed.
I tried running sed on a 7gb single file could reproduce same issue. This page https://community.hpe.com/t5/Languages-and-Scripting/Sed-Maximum-Line-Length/td-p/5136721 suggest using perl instead
perl -pe 's/start=//g;s/stop=//g;s/<unk>/ /g' file > output
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With