Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex for a (twitter-like) hashtag that allows non-ASCII characters

I want a regex to match a simple hashtag like that in twitter (e.g. #someword). I want it also to recognize non standard characters (like those in Spanish, Hebrew or Chinese).

This was my initial regex: (^|\s|\b)(#(\w+))\b
--> but it doesn't recognize non standard characters.
Then, I tried using XRegExp.js, which worked, but ran too slowly.

Any suggestions for how to do it?

like image 257
limlim Avatar asked Jun 05 '13 13:06

limlim


2 Answers

Eventually I found this: twitter-text.js useful link, which is basically how twitter solve this problem.

like image 195
limlim Avatar answered Nov 07 '22 01:11

limlim


With native JS regexes that don't support unicode, your only option is to explicitly enumerate characters that can end the tag and match everything else, for example:

> s = "foo #הַתִּקְוָה. bar"
"foo #הַתִּקְוָה. bar"
> s.match(/#(.+?)(?=[\s.,:,]|$)/)
["#הַתִּקְוָה", "הַתִּקְוָה"]

The [\s.,:,] should include spaces, punctuation and whatever else can be considered a terminating symbol.

like image 35
georg Avatar answered Nov 07 '22 01:11

georg