Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sed performance optimization

Tags:

bash

io

sed

I notice that when I use sed with the -i argument, it uses MUCH less disk read/write resources than when I redirect sed's output into a completely new file, and therefore the latter is MUCH faster (at least in my experience). Why is this?

Here are the specific commands I was using -

     sed -i '/\r/ s///g' file.txt <-- Slower one
     sed '/\r/ s///g' file.txt > file2.txt <-- Much faster one

Furthermore, I notice that when I use sed on a file that's say, ~35MB in size, it's able to process it in about ~0.3 seconds (when I redirect instead of using the -i arg). However, when I process a file that's about 7 times as large, the operation takes around ~20 seconds (once again, utilizing redirection instead of the -i arg). Why is this? Does this mean that sed works much faster on a bunch of smaller files rather than on one huge file? When I have a file that's ~25GB in size, would it be in my best interest to split the file up before processing it with sed?

like image 342
John Doe Avatar asked Nov 21 '25 13:11

John Doe


1 Answers

I tested this on Linux with GNU sed 4.4, which should be similar-ish to your Cygwin. strace -o dump sed ... shows what's going on in each case:

With redirection, buffered output results in 2498 reads/writes for a 5MB file:

openat(AT_FDCWD, "file.txt", O_RDONLY)  = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=5213926, ...}) = 0
read(3, "The Project Gutenberg EBook of T"..., 4096) = 4096
fstat(1, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
read(3, "\nBook 01        Genesis\r\n\r\n01:00"..., 4096) = 4096
write(1, "The Project Gutenberg EBook of T"..., 4096) = 4096
read(3, "wn image, in the image of God\r\n "..., 4096) = 4096
write(1, "002 And the earth was without fo"..., 4096) = 4096
read(3, "cattle, and to the fowl of the a"..., 4096) = 4096
write(1, "replenish the earth, and subdue "..., 4096) = 4096

With -i, unbuffered I/O results in 115,805 reads/writes for the same file:

openat(AT_FDCWD, "file.txt", O_RDONLY)  = 3
openat(AT_FDCWD, "./sed6RccPF", O_RDWR|O_CREAT|O_EXCL, 0600) = 4
read(3, "The Project Gutenberg EBook of T"..., 4096) = 4096
write(4, "The Project Gutenberg EBook of T"..., 61) = 61
write(4, "of the King James Bible\n", 24) = 24  
write(4, "\n", 1)                       = 1
write(4, "Copyright laws are changing all "..., 69) = 69
write(4, "copyright laws for your country "..., 69) = 69
write(4, "this or any other Project Gutenb"..., 43) = 43 
write(4, "\n", 1)                       = 1                

The latest git commit behaves the same way.

Until this is resolved, you'll probably want to use redirection (or better yet, a more suitable tool like tr in this case).

sed processes at the same speed regardless of file size, any difference you see there is more likely due to caching, either by the OS or the drive.

like image 57
that other guy Avatar answered Nov 24 '25 05:11

that other guy