Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to match hashtags in a sentence using ruby

I am trying to extract hashtags for a simple college project using ruby on rails. I am facing issue with tags that include only numericals and with tags with no space.

text = "Pack my #box with #5 dozen liquor.#jugs link.com/liquor#jugs #2good #first#second"

The regex I have is /(?:^|\s)#(\w+)/i (source)

This regex returns #["box", "5", "2good", "first"]

How to make sure it only returns #["box", "2good"] and ignore the rest as they are not 'real' hashtags?

like image 481
gkolan Avatar asked Aug 24 '12 03:08

gkolan


3 Answers

Can you try this regex:

/(?:^|\s)(?:(?:#\d+?)|(#\w+?))\s/i

UPDATE 1:
There are a few cases where the above regex will not match like: #blah23blah and #23blah23. Hence modified the regex to take care of all cases.

Regex:

/(?:\s|^)(?:#(?!\d+(?:\s|$)))(\w+)(?=\s|$)/i

Breakdown:

  • (?:\s|^) --Matches the preceding space or start of line. Does not capture the match.
  • # --Matches hash but does not capture.
  • (?!\d+(?:\s|$))) --Negative Lookahead to avoid ALL numeric characters between # and space (or end of line)
  • (\w+) --Matches and captures all word characters
  • (?=\s|$) --Positive Lookahead to ensure following space or end of line. This is required to ensure it matches adjacent valid hash tags.

Sample text modified to capture most cases:

#blah Pack my #box with #5 dozen #good2 #3good liquor.#jugs link.com/liquor#jugs #mkvef214asdwq sd #3e4 flsd #2good #first#second #3

Matches:

Match 1: blah
Match 2: box
Match 3: good2
Match 4: 3good
Match 5: mkvef214asdwq
Match 6: 3e4
Match 7: 2good

Rubular link

UPDATE 2:

To exclude words starting or ending with underscore, just include your exclusions in the negative lookahead like this:

/(?:\s|^)(?:#(?!(?:\d+|\w+?_|_\w+?)(?:\s|$)))(\w+)(?=\s|$)/i

The sample, regex and matches are recorded in this Rubular link

like image 132
Kash Avatar answered Oct 22 '22 23:10

Kash


I'd go about it this way:

text.scan(/ #[[:digit:]]?[[:alpha:]]+ /).map{ |s| s.strip[1..-1] }

which returns:

[
    [0] "box",
    [1] "2good"
]

I don't try to do everything in a regex. I prefer to keep them as simple as possible, then filter and mutilate once I've gotten the basic data captured. My reasoning is that regex are more difficult to maintain the more complex they become. I'd rather spend my time doing something else than maintaining patterns.

like image 2
the Tin Man Avatar answered Oct 22 '22 22:10

the Tin Man


Try this:

/\s#([[\d]]?[[a-z]]+\s)/i

Output:

1.9.3-p194 :010 > text = "Pack my #box with #5 dozen liquor.#jugs link.com/liquor#jugs #2good #first#second"
 => "Pack my #box with #5 dozen liquor.#jugs link.com/liquor#jugs #2good #first#second" 
1.9.3-p194 :011 > puts text.scan /\s#([[\d]]?[[a-z]]+\s)/i 
box 
2good 
 => nil
like image 1
Sayuj Avatar answered Oct 23 '22 00:10

Sayuj