What is the official encoding for Twitter's streaming API? My best guess is UTF-8 based on what I've seen, but I would like to avoid making assumptions.
The only part of the Twitter site I've seen where they even hint at what they use as their official encoding is here:
Twitter does not want to penalize a user for the fact we use UTF-8 or for the fact that the API client in question used the longer representation
https://dev.twitter.com/docs/counting-characters
Does anyone have a more "official" answer? I'm writing a state-machine tokenizer for the streaming API which makes certain assumptions. The last thing I want is to encounter something like UTF-16.
Thanks! :D
Twitter Character EncodingAll Twitter attributes accept UTF-8 encoded text via the API.
The Twitter API allows you to stream public Tweets from the platform in real-time so that you can display them and basic metrics about them.
Browsers will typically use the value of the XML encoding declaration, or default to UTF-8 if there is none. Second, if there is a UTF-8 BOM on the document, and the XML encoding declaration is either UTF-8 or not included, the document will be interpreted as UTF-8, regardless of the charset used in the Content-Type.
Unlike Twitter's Search API where you are polling data from tweets that have already happened, Twitter's Streaming API is a push of data as tweets happen in near real-time. With Twitter's Streaming API, users register a set of criteria (keywords, usernames, locations, named places, etc.)
One indicator is that the JSON format, which Twitter uses for virtually everything, dictates (or at least defaults to) UTF-8. They should also set an appropriate HTTP header denoting the encoding (but I haven't confirmed this). If you're using XML instead, the XML opening tag explicitly denotes the encoding, which is UTF-8.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With