I need to process volumes of text and one of the steps is to remove all non-alphanumeric characters. I'm trying to find an efficient way to do it.
So far I have two functions:
func stripMap(str, chr string) string {
return strings.Map(func(r rune) rune {
if strings.IndexRune(chr, r) < 0 {
return r
}
return -1
}, str)
}
Here I actually have to feed a string of all non-alpha characters.
And plain old regex
func stripRegex(in string) string {
reg, _ := regexp.Compile("[^a-zA-Z0-9 ]+")
return reg.ReplaceAllString(in, "")
}
The regex one seems to be much slower
BenchmarkStripMap-8 30000 37907 ns/op 8192 B/op 2 allocs/op
BenchmarkStripRegex-8 10000 131449 ns/op 57552 B/op 35 allocs/op
Looking for suggestions. Any other better way to do it? Improve the above?
Because the surviving runes are less than utf8.RuneSelf, this problem can be solved by operating on bytes. If any byte is not in [^a-zA-Z0-9 ]
, then the byte is part of a rune to be removed.
func strip(s string) string {
var result strings.Builder
for i := 0; i < len(s); i++ {
b := s[i]
if ('a' <= b && b <= 'z') ||
('A' <= b && b <= 'Z') ||
('0' <= b && b <= '9') ||
b == ' ' {
result.WriteByte(b)
}
}
return result.String()
}
A variation on this function is to preallocate the result by calling result.Grow:
func strip(s string) string {
var result strings.Builder
result.Grow(len(s))
...
This ensures that the function makes one memory allocation, but that memory allocation may be significantly larger than needed if the ratio of surviving runes to source runes is low.
The strip
function in this answer is written to work with string
argument and result types because those are the types used in the question.
If the application is working a []byte
source text and that source text can be modified, then it will be more efficient to update the []byte
in place. To do this, copy the surviving bytes to the beginning of the slice and reslice when done. This avoids memory allocations and overhead in strings.Builder. This variation is similar to one in peterSO's answer to this question.
func strip(s []byte) []byte {
n := 0
for _, b := range s {
if ('a' <= b && b <= 'z') ||
('A' <= b && b <= 'Z') ||
('0' <= b && b <= '9') ||
b == ' ' {
s[n] = b
n++
}
}
return s[:n]
}
Depending on actual data used, one of the approaches in this answer may be faster than the approaches in the question.
Efficient way to remove all non-alphanumeric characters from large text.
In Go, "Efficient way" means that we run Go testing
package benchmarks.
Your descripion of large text is vague. Let's assume that it began as text from a file or other byte
slice.
You could have overhead for string([]byte)
, several make([]byte)
, and string([]byte)
.
You can use strings.Builder
to reduce the overhead to string([]byte)
and several make([]byte)
.
You can reduce that even further to string([]byte)
by starting with a clean([]byte) string
function.
For example,
func clean(s []byte) string {
j := 0
for _, b := range s {
if ('a' <= b && b <= 'z') ||
('A' <= b && b <= 'Z') ||
('0' <= b && b <= '9') ||
b == ' ' {
s[j] = b
j++
}
}
return string(s[:j])
}
For a large text, the complete works of Shakespeare as a []byte
,
$ go fmt && go test strip_test.go -bench=. -benchmem
BenchmarkSendeckyMap-8 20 65988121 ns/op 11730958 B/op 2 allocs/op
BenchmarkSendeckyRegex-8 5 242834302 ns/op 40013144 B/op 130 allocs/op
BenchmarkThunder-8 100 21791532 ns/op 34682926 B/op 43 allocs/op
BenchmarkPeterSO-8 100 16172591 ns/op 5283840 B/op 1 allocs/op
$
strip_test.go
:
package main
import (
"io/ioutil"
"regexp"
"strings"
"testing"
)
func stripMap(str, chr string) string {
return strings.Map(func(r rune) rune {
if strings.IndexRune(chr, r) >= 0 {
return r
}
return -1
}, str)
}
var alphanum = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 "
func BenchmarkSendeckyMap(b *testing.B) {
for N := 0; N < b.N; N++ {
b.StopTimer()
bytShakespeare := []byte(strShakespeare)
b.StartTimer()
strShakespeare = string(bytShakespeare)
stripMap(strShakespeare, alphanum)
}
}
func stripRegex(in string) string {
reg, _ := regexp.Compile("[^a-zA-Z0-9 ]+")
return reg.ReplaceAllString(in, "")
}
func BenchmarkSendeckyRegex(b *testing.B) {
for N := 0; N < b.N; N++ {
b.StopTimer()
bytShakespeare := []byte(strShakespeare)
b.StartTimer()
strShakespeare = string(bytShakespeare)
stripRegex(strShakespeare)
}
}
func strip(s string) string {
var result strings.Builder
for i := 0; i < len(s); i++ {
b := s[i]
if ('a' <= b && b <= 'z') ||
('A' <= b && b <= 'Z') ||
('0' <= b && b <= '9') ||
b == ' ' {
result.WriteByte(b)
}
}
return result.String()
}
func BenchmarkThunder(b *testing.B) {
for N := 0; N < b.N; N++ {
b.StopTimer()
bytShakespeare := []byte(strShakespeare)
b.StartTimer()
strShakespeare = string(bytShakespeare)
strip(strShakespeare)
}
}
func clean(s []byte) string {
j := 0
for _, b := range s {
if ('a' <= b && b <= 'z') ||
('A' <= b && b <= 'Z') ||
('0' <= b && b <= '9') ||
b == ' ' {
s[j] = b
j++
}
}
return string(s[:j])
}
func BenchmarkPeterSO(b *testing.B) {
for N := 0; N < b.N; N++ {
b.StopTimer()
bytShakespeare := []byte(strShakespeare)
b.StartTimer()
clean(bytShakespeare)
}
}
var strShakespeare = func() string {
// The Complete Works of William Shakespeare by William Shakespeare
// http://www.gutenberg.org/files/100/100-0.txt
data, err := ioutil.ReadFile(`/home/peter/shakespeare.100-0.txt`)
if err != nil {
panic(err)
}
return string(data)
}()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With