Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to remove all non-alphanumeric characters from large text

Tags:

go

I need to process volumes of text and one of the steps is to remove all non-alphanumeric characters. I'm trying to find an efficient way to do it.

So far I have two functions:

func stripMap(str, chr string) string {
    return strings.Map(func(r rune) rune {
        if strings.IndexRune(chr, r) < 0 {
            return r
        }
        return -1
    }, str)
}

Here I actually have to feed a string of all non-alpha characters.

And plain old regex

func stripRegex(in string) string {
    reg, _ := regexp.Compile("[^a-zA-Z0-9 ]+")
    return reg.ReplaceAllString(in, "")
}

The regex one seems to be much slower

BenchmarkStripMap-8        30000         37907 ns/op        8192 B/op          2 allocs/op

BenchmarkStripRegex-8          10000        131449 ns/op       57552 B/op         35 allocs/op

Looking for suggestions. Any other better way to do it? Improve the above?

like image 397
r.sendecky Avatar asked Jan 31 '19 13:01

r.sendecky


2 Answers

Because the surviving runes are less than utf8.RuneSelf, this problem can be solved by operating on bytes. If any byte is not in [^a-zA-Z0-9 ], then the byte is part of a rune to be removed.

func strip(s string) string {
    var result strings.Builder
    for i := 0; i < len(s); i++ {
        b := s[i]
        if ('a' <= b && b <= 'z') ||
            ('A' <= b && b <= 'Z') ||
            ('0' <= b && b <= '9') ||
            b == ' ' {
            result.WriteByte(b)
        }
    }
    return result.String()
}

A variation on this function is to preallocate the result by calling result.Grow:

func strip(s string) string {
    var result strings.Builder
    result.Grow(len(s))
    ...

This ensures that the function makes one memory allocation, but that memory allocation may be significantly larger than needed if the ratio of surviving runes to source runes is low.

The strip function in this answer is written to work with string argument and result types because those are the types used in the question.

If the application is working a []byte source text and that source text can be modified, then it will be more efficient to update the []byte in place. To do this, copy the surviving bytes to the beginning of the slice and reslice when done. This avoids memory allocations and overhead in strings.Builder. This variation is similar to one in peterSO's answer to this question.

func strip(s []byte) []byte {
    n := 0
    for _, b := range s {
        if ('a' <= b && b <= 'z') ||
            ('A' <= b && b <= 'Z') ||
            ('0' <= b && b <= '9') ||
            b == ' ' {
            s[n] = b
            n++
        }
    }
    return s[:n]
}

Depending on actual data used, one of the approaches in this answer may be faster than the approaches in the question.

like image 99
Bayta Darell Avatar answered Oct 19 '22 20:10

Bayta Darell


Efficient way to remove all non-alphanumeric characters from large text.


In Go, "Efficient way" means that we run Go testing package benchmarks.

Your descripion of large text is vague. Let's assume that it began as text from a file or other byte slice.

You could have overhead for string([]byte), several make([]byte), and string([]byte).

You can use strings.Builder to reduce the overhead to string([]byte) and several make([]byte).

You can reduce that even further to string([]byte) by starting with a clean([]byte) string function.

For example,

func clean(s []byte) string {
    j := 0
    for _, b := range s {
        if ('a' <= b && b <= 'z') ||
            ('A' <= b && b <= 'Z') ||
            ('0' <= b && b <= '9') ||
            b == ' ' {
            s[j] = b
            j++
        }
    }
    return string(s[:j])
}

For a large text, the complete works of Shakespeare as a []byte,

$ go fmt && go test strip_test.go -bench=. -benchmem
BenchmarkSendeckyMap-8       20     65988121 ns/op    11730958 B/op      2 allocs/op
BenchmarkSendeckyRegex-8      5    242834302 ns/op    40013144 B/op    130 allocs/op
BenchmarkThunder-8          100     21791532 ns/op    34682926 B/op     43 allocs/op
BenchmarkPeterSO-8          100     16172591 ns/op     5283840 B/op      1 allocs/op
$

strip_test.go:

package main

import (
    "io/ioutil"
    "regexp"
    "strings"
    "testing"
)

func stripMap(str, chr string) string {
    return strings.Map(func(r rune) rune {
        if strings.IndexRune(chr, r) >= 0 {
            return r
        }
        return -1
    }, str)
}

var alphanum = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 "

func BenchmarkSendeckyMap(b *testing.B) {
    for N := 0; N < b.N; N++ {
        b.StopTimer()
        bytShakespeare := []byte(strShakespeare)
        b.StartTimer()
        strShakespeare = string(bytShakespeare)
        stripMap(strShakespeare, alphanum)
    }
}

func stripRegex(in string) string {
    reg, _ := regexp.Compile("[^a-zA-Z0-9 ]+")
    return reg.ReplaceAllString(in, "")
}

func BenchmarkSendeckyRegex(b *testing.B) {
    for N := 0; N < b.N; N++ {
        b.StopTimer()
        bytShakespeare := []byte(strShakespeare)
        b.StartTimer()
        strShakespeare = string(bytShakespeare)
        stripRegex(strShakespeare)
    }
}

func strip(s string) string {
    var result strings.Builder
    for i := 0; i < len(s); i++ {
        b := s[i]
        if ('a' <= b && b <= 'z') ||
            ('A' <= b && b <= 'Z') ||
            ('0' <= b && b <= '9') ||
            b == ' ' {
            result.WriteByte(b)
        }
    }
    return result.String()
}

func BenchmarkThunder(b *testing.B) {
    for N := 0; N < b.N; N++ {
        b.StopTimer()
        bytShakespeare := []byte(strShakespeare)
        b.StartTimer()
        strShakespeare = string(bytShakespeare)
        strip(strShakespeare)
    }
}

func clean(s []byte) string {
    j := 0
    for _, b := range s {
        if ('a' <= b && b <= 'z') ||
            ('A' <= b && b <= 'Z') ||
            ('0' <= b && b <= '9') ||
            b == ' ' {
            s[j] = b
            j++
        }
    }
    return string(s[:j])
}

func BenchmarkPeterSO(b *testing.B) {
    for N := 0; N < b.N; N++ {
        b.StopTimer()
        bytShakespeare := []byte(strShakespeare)
        b.StartTimer()
        clean(bytShakespeare)
    }
}

var strShakespeare = func() string {
    // The Complete Works of William Shakespeare by William Shakespeare
    // http://www.gutenberg.org/files/100/100-0.txt
    data, err := ioutil.ReadFile(`/home/peter/shakespeare.100-0.txt`)
    if err != nil {
        panic(err)
    }
    return string(data)
}()
like image 6
peterSO Avatar answered Oct 19 '22 21:10

peterSO