I've spent what I consider an unreasonable amount of time trying to find the actual format for hashtags.
As far as my searching can tell- Twitter has not published one.
I know that many people have come up with regex's to parse them, however, your lib's regex is not my lib's regex and maybe I don't like yours anyway.
So I'm asking- is there any actual official spec? I don't want a regex answer, I want a BNF or something similar. Or minimally- a complete list of delimiters.
Additional difficulty points- grabbing them from random unicode messages (non-English) text is important too.
Note: I'm quite aware of entities and they aren't applicable to my case (months of twitter messages stored in a db).
On Twitter, adding a “#” to the beginning of an unbroken word or phrase creates a hashtag.
Hashtags will not work with letters or numbers in front of the # symbol. The # symbol must have a space directly in front of it in order for it to show correctly in searches.
Twitter unfortunately doesn't support searching of tweets using regular expressions which means that you do have to post process. There's not actually any official documentation from Twitter to that effect, but everyone who uses the Twitter search API post-processes their tweets using regex (including me).
Hashtags can be used on just about any social media platform, but they're most popular on Twitter and Instagram. If you are using social media to market your brand, then you should use hashtags. Hashtags can help boost your brand's social media reach and engagement.
From the starting point of twitter's support the basic rules seems to be that hashtags must be preceded by a space and stop on any whitespace or punctuation.
Quote from Twitter's support:
Check your hashtags for the following:
Therefore, the initial token is #
preceded by a space, and the terminator is any whitespace or punctuation. The "etc" in their list of punctuation (" , . ; ' ? ! etc.") is annoying, but I'll keep digging and see if I can find something authoritative on what else counts as punctuation.
After digging around a while, I found some interesting blog articles by Terence Eden (Hashtags and Implicit Knowledge, Hashtag Standards) that provide evidence that Twitter doesn't even have a standard, given that the software it develops on different platforms seems to have different rules of what constitutes a hashtag.
It also provided a link to the Twitter Conformance Library, which has twitter / twitter-text-conformance / autolink.yml. The hashtag
section in autolink.yml has many cases matching the above rules, but also some that violate them are are still supposed to be autolinked. Some examples:
- description: "DO NOT Autolink all-numeric hashtags"
text: "text #1234"
expected: "text #1234"
- description: "Autolink hashtag preceded by a period"
text: "text.#hashtag"
expected: "text.<a href=\"http://twitter.com/search?q=%23hashtag\" title=\"#hashtag\" class=\"tweet-url hashtag\">#hashtag</a>"
- description: "Autolink hashtag with full-width hash (U+FF03)"
text: "#hashtag"
expected: "<a href=\"http://twitter.com/search?q=%23hashtag\" title=\"#hashtag\" class=\"tweet-url hashtag\">#hashtag</a>"
Those are just a few examples that don't match the basic rules given in the first support article, and unfortunately the yml
is full of other examples as well.
There is in fact an official specification for hashtags. Twitter accepts only a subset of Unicode expressions for the hashtag syntax. Here is the regular expression to recognize all valid Hashtags used on Twitter (pulled from their own sourcecode.)
To see how it's generated see the source code of twitter-text.
/(#|#)([a-z0-9_\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u00ff\u0100-\u024f\u0253-\u0254\u0256-\u0257\u0300-\u036f\u1e00-\u1eff\u0400-\u04ff\u0500-\u0527\u2de0-\u2dff\ua640-\ua69f\u0591-\u05bf\u05c1-\u05c2\u05c4-\u05c5\u05d0-\u05ea\u05f0-\u05f4\ufb12-\ufb28\ufb2a-\ufb36\ufb38-\ufb3c\ufb40-\ufb41\ufb43-\ufb44\ufb46-\ufb4f\u0610-\u061a\u0620-\u065f\u066e-\u06d3\u06d5-\u06dc\u06de-\u06e8\u06ea-\u06ef\u06fa-\u06fc\u0750-\u077f\u08a2-\u08ac\u08e4-\u08fe\ufb50-\ufbb1\ufbd3-\ufd3d\ufd50-\ufd8f\ufd92-\ufdc7\ufdf0-\ufdfb\ufe70-\ufe74\ufe76-\ufefc\u200c-\u200c\u0e01-\u0e3a\u0e40-\u0e4e\u1100-\u11ff\u3130-\u3185\ua960-\ua97f\uac00-\ud7af\ud7b0-\ud7ff\uffa1-\uffdc\u30a1-\u30fa\u30fc-\u30fe\uff66-\uff9f\uff10-\uff19\uff21-\uff3a\uff41-\uff5a\u3041-\u3096\u3099-\u309e\u3400-\u4dbf\u4e00-\u9fff\u20000-\u2a6df\u2a700-\u2b73f\u2b740-\u2b81f\u2f800-\u2fa1f]*[a-z_\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u00ff\u0100-\u024f\u0253-\u0254\u0256-\u0257\u0300-\u036f\u1e00-\u1eff\u0400-\u04ff\u0500-\u0527\u2de0-\u2dff\ua640-\ua69f\u0591-\u05bf\u05c1-\u05c2\u05c4-\u05c5\u05d0-\u05ea\u05f0-\u05f4\ufb12-\ufb28\ufb2a-\ufb36\ufb38-\ufb3c\ufb40-\ufb41\ufb43-\ufb44\ufb46-\ufb4f\u0610-\u061a\u0620-\u065f\u066e-\u06d3\u06d5-\u06dc\u06de-\u06e8\u06ea-\u06ef\u06fa-\u06fc\u0750-\u077f\u08a2-\u08ac\u08e4-\u08fe\ufb50-\ufbb1\ufbd3-\ufd3d\ufd50-\ufd8f\ufd92-\ufdc7\ufdf0-\ufdfb\ufe70-\ufe74\ufe76-\ufefc\u200c-\u200c\u0e01-\u0e3a\u0e40-\u0e4e\u1100-\u11ff\u3130-\u3185\ua960-\ua97f\uac00-\ud7af\ud7b0-\ud7ff\uffa1-\uffdc\u30a1-\u30fa\u30fc-\u30fe\uff66-\uff9f\uff10-\uff19\uff21-\uff3a\uff41-\uff5a\u3041-\u3096\u3099-\u309e\u3400-\u4dbf\u4e00-\u9fff\u20000-\u2a6df\u2a700-\u2b73f\u2b740-\u2b81f\u2f800-\u2fa1f][a-z0-9_\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u00ff\u0100-\u024f\u0253-\u0254\u0256-\u0257\u0300-\u036f\u1e00-\u1eff\u0400-\u04ff\u0500-\u0527\u2de0-\u2dff\ua640-\ua69f\u0591-\u05bf\u05c1-\u05c2\u05c4-\u05c5\u05d0-\u05ea\u05f0-\u05f4\ufb12-\ufb28\ufb2a-\ufb36\ufb38-\ufb3c\ufb40-\ufb41\ufb43-\ufb44\ufb46-\ufb4f\u0610-\u061a\u0620-\u065f\u066e-\u06d3\u06d5-\u06dc\u06de-\u06e8\u06ea-\u06ef\u06fa-\u06fc\u0750-\u077f\u08a2-\u08ac\u08e4-\u08fe\ufb50-\ufbb1\ufbd3-\ufd3d\ufd50-\ufd8f\ufd92-\ufdc7\ufdf0-\ufdfb\ufe70-\ufe74\ufe76-\ufefc\u200c-\u200c\u0e01-\u0e3a\u0e40-\u0e4e\u1100-\u11ff\u3130-\u3185\ua960-\ua97f\uac00-\ud7af\ud7b0-\ud7ff\uffa1-\uffdc\u30a1-\u30fa\u30fc-\u30fe\uff66-\uff9f\uff10-\uff19\uff21-\uff3a\uff41-\uff5a\u3041-\u3096\u3099-\u309e\u3400-\u4dbf\u4e00-\u9fff\u20000-\u2a6df\u2a700-\u2b73f\u2b740-\u2b81f\u2f800-\u2fa1f]*)/gi
I found this : "Need help parsing tweet text?", on dev.twitter.com
Take a look on the Twitter text processing library we’re using for auto linking and extraction of usernames, lists & hashtags.
(there's ruby, java and javascript librairies)
They are quite enormous, as twitter must take into account every possible case.
The Twitter entity parsing libraries are available here: https://github.com/twitter/twitter-text
this is what I use, the closest i get:
/#(\w*[0-9a-zA-Z]+\w*[0-9a-zA-Z])/g
link of the hashtag Regex to test
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With