I am trying to extract hashtags for a simple college project using ruby on rails. I am facing issue with tags that include only numericals and with tags with no space.
text = "Pack my #box with #5 dozen liquor.#jugs link.com/liquor#jugs #2good #first#second"
The regex I have is /(?:^|\s)#(\w+)/i
(source)
This regex returns #["box", "5", "2good", "first"]
How to make sure it only returns #["box", "2good"]
and ignore the rest as they are not 'real' hashtags?
Can you try this regex:
/(?:^|\s)(?:(?:#\d+?)|(#\w+?))\s/i
UPDATE 1:
There are a few cases where the above regex will not match like: #blah23blah and #23blah23.
Hence modified the regex to take care of all cases.
Regex:
/(?:\s|^)(?:#(?!\d+(?:\s|$)))(\w+)(?=\s|$)/i
Breakdown:
(?:\s|^)
--Matches the preceding space or start of line. Does not
capture the match.#
--Matches hash but does not capture.(?!\d+(?:\s|$)))
--Negative Lookahead to avoid ALL numeric characters
between # and space (or end of line)(\w+)
--Matches and captures all word characters(?=\s|$)
--Positive Lookahead to ensure following space or end of
line. This is required to ensure it matches adjacent valid hash tags.Sample text modified to capture most cases:
#blah Pack my #box with #5 dozen #good2 #3good liquor.#jugs link.com/liquor#jugs #mkvef214asdwq sd #3e4 flsd #2good #first#second #3
Matches:
Match 1: blah
Match 2: box
Match 3: good2
Match 4: 3good
Match 5: mkvef214asdwq
Match 6: 3e4
Match 7: 2good
Rubular link
UPDATE 2:
To exclude words starting or ending with underscore, just include your exclusions in the negative lookahead like this:
/(?:\s|^)(?:#(?!(?:\d+|\w+?_|_\w+?)(?:\s|$)))(\w+)(?=\s|$)/i
The sample, regex and matches are recorded in this Rubular link
I'd go about it this way:
text.scan(/ #[[:digit:]]?[[:alpha:]]+ /).map{ |s| s.strip[1..-1] }
which returns:
[
[0] "box",
[1] "2good"
]
I don't try to do everything in a regex. I prefer to keep them as simple as possible, then filter and mutilate once I've gotten the basic data captured. My reasoning is that regex are more difficult to maintain the more complex they become. I'd rather spend my time doing something else than maintaining patterns.
Try this:
/\s#([[\d]]?[[a-z]]+\s)/i
Output:
1.9.3-p194 :010 > text = "Pack my #box with #5 dozen liquor.#jugs link.com/liquor#jugs #2good #first#second"
=> "Pack my #box with #5 dozen liquor.#jugs link.com/liquor#jugs #2good #first#second"
1.9.3-p194 :011 > puts text.scan /\s#([[\d]]?[[a-z]]+\s)/i
box
2good
=> nil
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With