Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running sed on a large (30G) one-line file returns an empty output

I'm trying to perform simple literal search/replace on a large (30G) one-line file, using sed.

I would expect this to take some time but, when I run it, it returns after a few seconds and, when I look at the generated file, it's zero length.

  • input file has 30G

    $ ls -lha Full-Text-Tokenized-Single-Line.txt  
    -rw-rw-r-- 1 ubuntu ubuntu 30G Jun  9 19:51 Full-Text-Tokenized-Single-Line.txt
    
  • run the command:

    $ sed 's/<unk>/ /g' Full-Text-Tokenized-Single-Line.txt > Full-Text-Tokenized-Single-Line-No-unks.txt
    
  • the output file has zero length!

    $ ls -lha Full-Text-Tokenized-Single-Line-No-unks.txt 
    -rw-rw-r-- 1 ubuntu ubuntu 0 Jun  9 19:52 Full-Text-Tokenized-Single-Line-No-unks.txt
    

Things I've tried

  • running the very same example on a shorter file: works
  • using -e modifier: doesn't work
  • escaping "<" and ">": doesn't work
  • using a simple pattern line ('s/foo/bar/g') instead: doesn't work: zero-length file is returned.

EDIT (more information)

  • return code is 0

  • sed version is (GNU sed) 4.2.2

like image 253
Felipe Avatar asked Jun 09 '17 20:06

Felipe


3 Answers

Just use awk, it's designed for handling records separated by arbitrary strings. With GNU awk for multi-char RS:

awk -v RS='<unk>' '{ORS=(RT?" ":"")}1' file

The above splits the input into records separated by <unk> so if enough <unk>s are present in the input then the individual records will be small enough to fit in memory. It then prints each record followed by a blank char so the overall impact to the data is that all <unk>s become blank chars.

If that direct approach doesn't work for you THEN it'd be time to start looking for alternative solutions.

like image 131
Ed Morton Avatar answered Oct 05 '22 05:10

Ed Morton


with line-based editors like sed you can't expect this to work, since its unit of work (record) is the line terminated with line breaks.

One suggestion if you have white space in your file (to prevent searched pattern to split) is use

fold -s file_with_one_long_line | 
sed 's/find/replace/g'          | 
tr -d '\n' > output

ps. fold default width is 80, in case you have words longer than 80 you can add -w 1000 or at least the longest word size to prevent word splitting.

like image 38
karakfa Avatar answered Oct 05 '22 05:10

karakfa


Officially gnu sed has no line limit http://www.linuxtopia.org/online_books/linux_tool_guides/the_sed_faq/sedfaq6_005.html However the page state that:

"no limit" means there is no "fixed" limit. Limits are actually determined by one's hardware, memory, operating system, and which C library is used to compile sed.

I tried running sed on a 7gb single file could reproduce same issue. This page https://community.hpe.com/t5/Languages-and-Scripting/Sed-Maximum-Line-Length/td-p/5136721 suggest using perl instead

perl -pe 's/start=//g;s/stop=//g;s/<unk>/ /g' file > output
like image 27
Ramast Avatar answered Oct 05 '22 04:10

Ramast