Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I remove non word characters from a text?

Tags:

regex

ruby

I want 'This Is A 101 Test' to be 'This Is A Test', but I can't get the syntax right.

src = 'This Is A 101 Test'
puts "A) " + src                       # base => "This Is A 101 Test"
puts "B) " + src[/([a-z]+)/]           # only does first word => "his"
puts "C) " + src.gsub!(/\D/, "")       # Does digits, I want alphabetic => "101"
puts "D) " + src.gsub!(/\W///g)        # Nothing. => ""
puts "E) " + src.gsub(/(\W|\d)/, "")   # Nothing. => ""
like image 216
Michael Durrant Avatar asked Feb 02 '12 15:02

Michael Durrant


People also ask

How do I remove non characters from a string?

To remove all non-alphanumeric characters from a string, call the replace() method, passing it a regular expression that matches all non-alphanumeric characters as the first parameter and an empty string as the second. The replace method returns a new string with all matches replaced. Copied!

How do I get rid of non alphabetic characters?

replaceAll() method. A common solution to remove all non-alphanumeric characters from a String is with regular expressions. The idea is to use the regular expression [^A-Za-z0-9] to retain only alphanumeric characters in the string. You can also use [^\w] regular expression, which is equivalent to [^a-zA-Z_0-9] .

How do I remove non character characters from a string in Python?

A simple solution is to use regular expressions for removing non-alphanumeric characters from a string. The idea is to use the special character \W , which matches any character which is not a word character.


2 Answers

First off, you need to be careful with gsub and gsub!. The latter is "dangerous!" and will modify the value of src. If you're executing these statements in order, be aware that a.gsub!(/a/, "b") and a = a.gsub(/a/, "b") will both do the same thing to a. Part of the issue with your code is that src is being modified.

The B method returns "his" but makes no changes to source

src[/([a-z]+)/]     # => "his"
src                 # => "This Is A 101 Test"

The C method removes all characters that aren't numbers:

src.gsub!(/\D/, "") # => "101"
src                 # => "101"

The D method doesn't work because the syntax is wrong. The gsub method accepts a regular expression/string to search and then a string to use for replacement. If you try it in IRB it will act as though you need another / somewhere.

The E method replaces all non-word characters and all numbers:

src.gsub(/(\W|\d)/, "") # => "This Is A  Test" (note the two spaces)
src                     # => "This Is A 101 Test"

You point out that it's returning "". Well, what's actually happening is that C and D as listed (with syntax issues fixed) are destructive changes. (Also, if run on "101", D will actually return nil as no substitutions were performed.) So E is just being run on "101", and since you're replacing all non-words and all numbers with "", it becomes "101".


The answer you're looking for would be something like:

src.gsub!(/\d\s?/, "") # => "This Is A Test"
src                    # => "This Is A Test"

And my favorite for dealing with all scenarios of double spaces (because squeeze is quite efficient at combining like characters, strip is quite efficient at stripping trailing whitespace, and those ! return nil if they make no replacements):

src = src.gsub(/\d+/, "").squeeze(" ").strip
like image 147
brymck Avatar answered Sep 19 '22 14:09

brymck


To remove all "non word characters" you can instead keep only those.

src = 'This Is A 101 Test'
src.gsub(/[^a-zA-Z ]/,'').gsub(/ +/,' ')
=> "This Is A Test"

I recommend Rubular for trying out Ruby regular expressions.

like image 44
Jonas Elfström Avatar answered Sep 20 '22 14:09

Jonas Elfström