Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

removing punctuation of strings effectively

Tags:

string

julia

I read in a text and want to remove all the punctuation of it. My first idea was:

data = readlines("text.txt")
data = lowercase.(data)
data = replace.(data, [','], [""])
data = replace.(data, ['.'], [""])
data = replace.(data, ['?'], [""])
data = replace.(data, [';'], [""])
data = replace.(data, ['!'], [""])
data = replace.(data, [':'], [""])
data = replace.(data, ['('], [""])
data = replace.(data, [')'], [""])

This gets quite fast annoying. I did not find a way to combine them all in one statement. With replace.(data, [".", ";"], ["", ""]) I get a DimensionMismatch.

Any ideas?

like image 791
Hamlet Avatar asked Dec 14 '22 18:12

Hamlet


1 Answers

When broadcasting if you do not want a collection (like an array or a tuple) to be iterated over you should wrap it in an array (in the example I use only two characters , and ; as substitution, but this can be more):

julia> data = ["a,b;c","x,y;z"]
2-element Array{String,1}:
 "a,b;c"
 "x,y;z"

julia> replace.(data, [[',',';']], "")
2-element Array{String,1}:
 "abc"
 "xyz"

The key part is [[',',';']] which wraps an array of substitution alternatives into a one element array.

Another approach would be to use a regular expression:

julia> replace.(data, r"[,;]", "")
2-element Array{String,1}:
 "abc"
 "xyz"

Now the substitution pattern r"[,;]" does not need to be wrapped.

If you care about the performance the first pattern with [[',',';']] is a bit faster, but using regular expression is more flexible as it allows you to capture more complex patterns.

EDIT

Now it would be:

julia> replace.(data, [',',';'] => "")
2-element Array{String,1}:
 "abc"
 "xyz"

or

julia> replace.(data, r"[,;]" => "")
2-element Array{String,1}:
 "abc"
 "xyz"
like image 160
Bogumił Kamiński Avatar answered Dec 24 '22 16:12

Bogumił Kamiński