Running sed on a large (30G) one-line file returns an empty output

Question

I'm trying to perform simple literal search/replace on a large (30G) one-line file, using sed.

I would expect this to take some time but, when I run it, it returns after a few seconds and, when I look at the generated file, it's zero length.

input file has 30G

$ ls -lha Full-Text-Tokenized-Single-Line.txt  
-rw-rw-r-- 1 ubuntu ubuntu 30G Jun  9 19:51 Full-Text-Tokenized-Single-Line.txt

run the command:

$ sed 's/<unk>/ /g' Full-Text-Tokenized-Single-Line.txt > Full-Text-Tokenized-Single-Line-No-unks.txt

the output file has zero length!

$ ls -lha Full-Text-Tokenized-Single-Line-No-unks.txt 
-rw-rw-r-- 1 ubuntu ubuntu 0 Jun  9 19:52 Full-Text-Tokenized-Single-Line-No-unks.txt

Things I've tried

running the very same example on a shorter file: works
using -e modifier: doesn't work
escaping "<" and ">": doesn't work
using a simple pattern line ('s/foo/bar/g') instead: doesn't work: zero-length file is returned.

EDIT (more information)

return code is 0
sed version is (GNU sed) 4.2.2

Ed Morton · Accepted Answer

Just use awk, it's designed for handling records separated by arbitrary strings. With GNU awk for multi-char RS:

awk -v RS='<unk>' '{ORS=(RT?" ":"")}1' file

The above splits the input into records separated by <unk> so if enough <unk>s are present in the input then the individual records will be small enough to fit in memory. It then prints each record followed by a blank char so the overall impact to the data is that all <unk>s become blank chars.

If that direct approach doesn't work for you THEN it'd be time to start looking for alternative solutions.

karakfa · Answer

with line-based editors like sed you can't expect this to work, since its unit of work (record) is the line terminated with line breaks.

One suggestion if you have white space in your file (to prevent searched pattern to split) is use

fold -s file_with_one_long_line | 
sed 's/find/replace/g'          | 
tr -d '
' > output

ps. fold default width is 80, in case you have words longer than 80 you can add -w 1000 or at least the longest word size to prevent word splitting.

Ramast · Answer

Officially gnu sed has no line limit http://www.linuxtopia.org/online_books/linux_tool_guides/the_sed_faq/sedfaq6_005.html However the page state that:

"no limit" means there is no "fixed" limit. Limits are actually determined by one's hardware, memory, operating system, and which C library is used to compile sed.

I tried running sed on a 7gb single file could reproduce same issue. This page https://community.hpe.com/t5/Languages-and-Scripting/Sed-Maximum-Line-Length/td-p/5136721 suggest using perl instead

perl -pe 's/start=//g;s/stop=//g;s/<unk>/ /g' file > output

Running sed on a large (30G) one-line file returns an empty output

Tags:

bash

command-line

sed

ubuntu

EDIT (more information)

Felipe

3 Answers

Ed Morton

karakfa

Ramast

Recent Activity

Donate For Us

Running sed on a large (30G) one-line file returns an empty output

Tags:

bash

command-line

sed

ubuntu

EDIT (more information)

Felipe

3 Answers

Ed Morton

karakfa

Ramast

Related questions

Recent Activity

Donate For Us