I'm getting some data from an API (telegram-bot) I'm using. I'm using the python-telegram-bot library which interacts with the Telegram Bot api. The data is returned in the UTF-8 encoding in JSON format. Example (snippet):
{'message': {'text': '👨\u200d👩\u200d👦\u200d👦http://google.com/æøå', 'entities': [{'type': 'url', 'length': 21, 'offset': 11}], 'message_id': 2655}}
It can be seen that 'entities' contains a single entity of type url and it has a length and an offset. Now say I wanted to extract the url of the link in the 'text' attribute:
data = {'message': {'text': '👨\u200d👩\u200d👦\u200d👦http://google.com/æøå', 'entities': [{'type': 'url', 'length': 21, 'offset': 11}], 'message_id': 2655}}
entities = data['entities']
for entity in entities:
start = entity['offset']
end = start + entity['length']
print('Url: ', text[start:end])
The code above, however, returns: '://google.com/æøå'
which is clearly not the actual url.
The reason for this is that the offset and length are in UTF-16 codepoints. So my question is: Is there any way to work with UTF-16 codepoints in python? I don't need more than to be able to count them.
I've already tried:
text.encode('utf-8').decode('utf-16')
But that gives the error: UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0xa5 in position 48: truncated data
Any help would be greatly appreciated. I'm using python 3.5, but since it's for a unified library it would be lovely to get it to work in python 2.x too.
Python has already correctly decoded the UTF-8 encoded JSON data to Python (Unicode) strings, so there is no need to handle UTF-8 here.
You'd have to encode to UTF-16, take the length of the encoded data, and divide by two. I'd encode to either utf-16-le
or utf-16-be
to prevent a BOM from being added:
>>> len(text.encode('utf-16-le')) // 2
32
To use the entity offsets, you can encode to UTF-16, slice on doubled offsets, then decode again:
text_utf16 = text.encode('utf-16-le')
for entity in entities:
start = entity['offset']
end = start + entity['length']
entity_text = text_utf16[start * 2:end * 2].decode('utf-16-le')
print('Url: ', entity_text)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With