I have a trim
function that I sometimes use in awk
, but it's kind of slow for big inputs:
#!/bin/bash
time {
yes $'\t Lorem ipsum dolor sit amet consectetur adipiscing elit. Duis dapibus rutrum facilisis. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Etiam tristique libero eu nibh porttitor amet fermentum.\t \r' |
head -n 1000000 |
awk '
{ trim($0) }
function trim(string) {
gsub(/^[ \t\r]+|[ \t\r]+$/, "", string);
return string
}
'
}
real 0m9.074s
user 0m9.179s
sys 0m0.381s
How can I speed it up?
On my machine (Windows laptop running git bash with gawk 5.0.0) doing 2 separate sub()s
seems to be very slightly faster than one gsub()
:
$ cat gsub.sh
#!/bin/bash
time {
yes $'\t Lorem ipsum dolor sit amet consectetur adipiscing elit. Duis dapibus rutrum facilisis. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Etiam tristique libero eu nibh porttitor amet fermentum.\t \r' |
head -n 1000000 |
awk '
{ trim($0) }
function trim(string) {
gsub(/^[ \t\r]+|[ \t\r]+$/,"",string)
return string
}
'
}
$ cat subs.sh
#!/bin/bash
time {
yes $'\t Lorem ipsum dolor sit amet consectetur adipiscing elit. Duis dapibus rutrum facilisis. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Etiam tristique libero eu nibh porttitor amet fermentum.\t \r' |
head -n 1000000 |
awk '
{ trim($0) }
function trim(string) {
sub(/^[ \t\r]+/,"",string)
sub(/[ \t\r]+$/,"",string)
return string
}
'
}
Timing over 3 runs to remove caching impact:
./gsub.sh ./subs.sh
real 0m2.288s real 0m2.213s
user 0m2.325s user 0m2.419s
sys 0m3.170s sys 0m3.075s
real 0m2.269s real 0m2.219s
user 0m2.371s user 0m2.420s
sys 0m3.310s sys 0m3.139s
real 0m2.275s real 0m2.250s
user 0m2.434s user 0m2.434s
sys 0m3.216s sys 0m3.199s
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With