Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ruby extract substring from an array of strings

Tags:

substring

ruby

I have an array of strings.

irb(main):009:0* str_arr
=> ["hello how are you?", "I am fine.What are you doing?", "Hey, I am having a haircut. See you at Hotel KingsMen at 10 am."]

And I am trying to extract some info from it. The name of Hotel and the time.

irb(main):010:0> q = str_arr[2].scan(/(.*)Hotel(.*)at(.*)\./)
=> [["Hey, I am having a haircut. See you at ", " KingsMen ", " 10 am"]]

The thing is I cannot fix the index at 2. I need something like this:

irb(main):023:0> str_arr.each { |str| $res = str.scan(/(.*)Hotel(.*)at(.*)\./) }
=> ["hello how are you?", "I am fine.What are you doing?", "Hey, I am having a haircut. See you at Hotel KingsMen at 10 am."]
irb(main):024:0> $res
=> [["Hey, I am having a haircut. See you at ", " KingsMen ", " 10 am"]]

But I don't want to use global variable. Any suggestions to improve my code?

like image 771
0aslam0 Avatar asked Mar 16 '23 22:03

0aslam0


2 Answers

s = ["hello how are you?", "I am fine.What are you doing?", "Hey, I am having a haircut. See you at Hotel KingsMen at 10 am."]
s.join.scan(/Hotel\s(.+)?\sat\s(.+)?\./).flatten
#=> ["KingsMen", "10 am"]

Regex description:

  1. \s - any whitespace character,

  2. . - any character, .+ - one or more of any character, () - capture everything inside, so (.+) - capture one or more characters

  3. a? means zero or one of a

like image 132
Rustam A. Gasanov Avatar answered Mar 19 '23 11:03

Rustam A. Gasanov


Here is your array:

arr = ["hello how are you?",
       "I am fine. What are you doing?",
       "Hey, I am having a haircut. See you at Hotel KingsMen at 10 am."]

The first step is to join the elements into a string. I've chosen to use a space for the separator, but you could use something else:

str = arr.join(' ')
  #=> "hello how...doing? Hey,...haircut. See you at Hotel KingsMen at 10 am." 

Without loss of generality, let's suppose this string were one of the following:

str1 = "See you at Hotel KingsMen at 10 am."  
str2 = "See you at 10:15am at Kingsmen hotel on Bloor Street."  

Which hotel?

Let's first look at how to get the name of the hotel. We want a method that will work with both of these strings. We assume that the name of the hotel is just two words, with one of those words being "hotel", but we don't know which of the two words comes first, and we allow "hotel" to begin with a capital or lowercase letter.

We see in str1 that it could be "at Hotel" or "Hotel KingsMen", and in str2 it could be "Kingston hotel" or "hotel on". The correct result is obtained by making the reasonable assumption that the word other than "hotel" is capitalized.

Here's one way to do it:

def hotel(str)
  str[/\b[hH]otel\s+\K[A-Z][a-zA-Z]*|[A-Z][a-zA-Z]*(?=\s[Hh]otel\b)/]
end

hotel(str1) #=> "KingsMen" 
hotel(str2) #=> "Kingsmen" 

Here:

  • \b is a (zero-width) word break
  • \K means match what comes before but do not include it in the match that is returned.
  • | means match what comes before or what comes after.
  • (?=\s[Hh]otel\b) is a ("zero-width") positive lookahead, which indicates what must immediately follow what comes before, but is not part of the match.

What time?

Here we must make an assumption about the way time is expressed. Should "noon", "1100 hours" and "14:21" be possibilites? OK, this is just an exercise, so let's assume that it's a 12-hour clock with hours and possibly minutes, but no seconds.

We could use the following regex to extract that information:

def time(str)
  str[/\b(?:1[012]|[1-9])(?::[0-5]{2})?\s?(?:[ap]m?)/i]
end

time(str1) #=> "10 am" 
time(str2) #=> "10:15am" 

Here:

  • (?:...) is a non-capture group, which is part of the match.
  • 1[012]|[1-9] says to match a) 1 followed by a 0, 1 or 2, or (|) b) one digit between 1 and 9.
  • the second colon in (?::...) indicates that a match beginning with a colon is to be made in another non-capture group).
  • [0-5]{2} means to match two ({2}) characters, each a digit between 0 and 5.
  • i in /i means to disregard case.

Suppose now we had:

str3 = "I'm leaving at 9:30 am, so I'll see you at Hotel KingsMen at 10 am."  

We want to select "10 am" rather than "9:30 am". For that we need additional assumptions. For example, we might assume that the time is preceded by the word "at" and that "at" appears immediately after the name of the hotel:

Hotel KingsMen at 10am

or

Kingsmen hotel at 10:15 am

We could use a fairly complex regex to extract the time here, or we could first find the hotel name and it's location in the string, then look for the time immediately after.

like image 36
Cary Swoveland Avatar answered Mar 19 '23 10:03

Cary Swoveland