How to make this sed script faster?

Tags:

I have inherited this sed script snippet that attempts to remove certain empty spaces:

s/[\s\t]*|/|/g
s/|[\s\t]*/|/g
s/[\s] *$//g
s/^|/null|/g

that operates on a file that is around 1Gb large. This script runs for 2 hours on our unix server. Any ideas how to speed it up?

Notes that the \s stands for a space and \t stands for a tab, the actual script uses the actual space and tab and not those symbols

The input file is a pipe delimited file and is located locally not on the network. The 4 lines are in a file executed with sed -f

558

asked Dec 01 '09 19:12

erotsppa

1 Answers

The best I was able to do with sed, was this script:

s/[\s\t]*|[\s\t]*/|/g
s/[\s\t]*$//
s/^|/null|/

In my tests, this ran about 30% faster than your sed script. The increase in performance comes from combining the first two regexen and omitting the "g" flag where it's not needed.

However, 30% faster is only a mild improvement (it should still take about an hour and a half to run the above script on your 1GB data file). I wanted to see if I could do any better.

In the end, no other method I tried (awk, perl, and other approaches with sed) fared any better, except -- of course -- a plain ol' C implementation. As would be expected with C, the code is a bit verbose for posting here, but if you want a program that's likely going to be faster than any other method out there, you may want to take a look at it.

In my tests, the C implementation finishes in about 20% of the time it takes for your sed script. So it might take about 25 minutes or so to run on your Unix server.

I didn't spend much time optimizing the C implementation. No doubt there are a number of places where the algorithm could be improved, but frankly, I don't know if it's possible to shave a significant amount of time beyond what it already achieves. If anything, I think it certainly places an upper limit on what kind of performance you can expect from other methods (sed, awk, perl, python, etc).

Edit: The original version had a minor bug that caused it to possibly print the wrong thing at the end of the output (e.g. could print a "null" that shouldn't be there). I had some time today to take a look at it and fixed that. I also optimized away a call to strlen() that gave it another slight performance boost.

127

answered Oct 16 '22 09:10

Dan Moulding

Related questions
                            
                                What is the closest thing to Windows COM/DCOM in the Linux world?
                            
                                git clone into home directory
                            
                                C++ Get string from Clipboard on Linux
                            
                                Install gitlab-ce on ubuntu server 17.04
                            
                                Easy way to display a continuously updating image in C/Linux
                            
                                Command not found in Bash's IF-ELSE condition when using [! -d "$DIR"]
                            
                                How to get script file path inside script itself when called through sym link
                            
                                Create a new empty file from linux command line with same permissions and ownership?
                            
                                Access denied for user 'root'@'localhost' (using password: Yes) after password reset LINUX
                            
                                imx6 Device Tree compilation -- FATAL ERROR: Unable to parse input tree
                            
                                Is there an equivalent to the .Net FileSystemWatcher in the Linux world?
                            
                                How can I make Perl wait for child processes started in the background with system()?
                            
                                For loop for files in multiple folders - bash shell
                            
                                Find the Process run by nohup command
                            
                                How can I stop a symfony process which is listening on http://127.0.0.1:8000
                            
                                How to start Linux Programming [closed]
                            
                                How many threads to create and when?
                            
                                Undefined reference when using ncurses on linux
                            
                                List files that are in directory1 but NOT in directory2 and vice versa?
                            
                                Why are Makefiles in Linux so useful?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to make this sed script faster?

Tags:

performance

linux

unix

sed

erotsppa

People also ask

1 Answers

Dan Moulding

Recent Activity

Donate For Us