Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ruby: Extracting Words From String

I'm trying to parse words out of a string and put them into an array. I've tried the following thing:

@string1 = "oriented design, decomposition, encapsulation, and testing. Uses " puts @string1.scan(/\s([^\,\.\s]*)/) 

It seems to do the trick, but it's a bit shaky (I should include more special characters for example). Is there a better way to do so in ruby?

Optional: I have a cs course description. I intend to extract all the words out of it and place them in a string array, remove the most common word in the English language from the array produced, and then use the rest of the words as tags that users can use to search for cs courses.

like image 767
sybohy Avatar asked Oct 01 '11 19:10

sybohy


People also ask

How do you split a word in Ruby?

The general syntax for using the split method is string. split() . The place at which to split the string is specified as an argument to the method. The split substrings will be returned together in an array.

How do you extract a substring from a string in Ruby?

There is no substring method in Ruby, and hence we rely upon ranges and expressions. If we want to use the range, we have to use periods between the starting and ending index of the substring to get a new substring from the main string.

How do I use GSUB in Ruby?

gsub! is a String class method in Ruby which is used to return a copy of the given string with all occurrences of pattern substituted for the second argument. If no substitutions were performed, then it will return nil. If no block and no replacement is given, an enumerator is returned instead.

How do you replace a string in Ruby?

To replace a word in string, you do: sentence. gsub(/match/, "replacement") .


2 Answers

The split command.

   words = @string1.split(/\W+/) 

will split the string into an array based on a regular expression. \W means any "non-word" character and the "+" means to combine multiple delimiters.

like image 141
David Nehme Avatar answered Sep 20 '22 15:09

David Nehme


For me the best to spliting sentences is:

line.split(/[^[[:word:]]]+/) 

Even with multilingual words and punctuation marks work perfectly:

line = 'English words, Polski Żurek!!! crème fraîche...' line.split(/[^[[:word:]]]+/) => ["English", "words", "Polski", "Żurek", "crème", "fraîche"]  
like image 23
lazzy.developer Avatar answered Sep 20 '22 15:09

lazzy.developer