Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python requests response encoded in utf-8 but cannot be decoded

I am trying to scrape my messenger.com (facebook messenger) chats using python and i have used google chromes developer tools to see the POST request for the chat history and i have copied the entire header and body into a format that requests can use.

I get HTTP code 200 implying the request at least got something but and i can print res.encoding to get the encoding it returned in which it says is utf-8. But i cannot decode it!

here is the function:

def download_thread(self, limit, offset, message_timestamp):
    """Download the specified number of messages from the
    provided thread, with an optional offset
    """
    data = request_data(self.thread, offset=offset,
                        limit=limit, group=self.group,
                        timestamp=message_timestamp)

    res = self.ses.post(url_thread, data=data, headers=headers)

    print(res.content)

    thread_contents = json.loads(res.content)
    print(thread_contents)
    return thread_contents

yields

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 0: invalid start byte

when it attempts to json.load (or loads) the data

But res.encoding does return utf-8.

I tried unzipping with gzip but that says it is is not gzipped content.

If i just try to do print(res.content) i get

Traceback (most recent call last):
  File "FBChatScraper.py", line 200, in <module>
    main()
  File "FBChatScraper.py", line 134, in main
    fbms.run()
0f\x82\x048\xbb\xb9=\x87\xebK0.\xff\x90\xdd\xeb\xfa\x16\xc6\xbbz\x8b\x82)\xe8\xaaV\x01^\xda\x8b\xbd\x15d-\xb1\x10@\x17\\\xd43\xa8\x92w\xe8\xc0\xcdU\xc4\xff\xc7\xfa\x90\xb2\xb3\xf5\x84\x11u\x0b\t\x8f\x83r\xf3}\xe5!y$\xe6\xf6c0\xf0\xb4\x98\xcat_\x0c\x08\xb5\xdd\x8ctx\x91\xa9\x95\rB%\xe2\x93\xa52\x85_\xa6\x10\xc2\xc9\xa3\xee4SDb\xa5\x18QJ\x83X\x19)\xaa$\xf4\xb4\xb7\x0b\x84\x15&\x88\x08L\xc9iP\xa2\xb9\xf2\xaf\x96\x96N\xd8\xcf=\x05\xc1\x18\x8d\xa0\xf2Y\x8e\n\xcf\xc8\x0fE4\xd6)\xa1\xd4\xb7D\xd6{i\xc8P\x96R\x11HC\xac\xbcKyT#~}\x93\xf7@K\xc7r/\x82\xb0\xe4\xefX\xf9j\x08\xa6Hp\xfcn\x06\xfdo\x9a\xd0wJ\xb4fJ(\x89+\x1c\xf6\x0eOI\x90\xac\x9eDD\xfd,\xa5\xe9\x89\x1blh\x86Z\x98\x05\xdd9\xc7\xf4\x80\xfcY\x8e\xad\xee\x99!\x15\x13+\x9b\x07\xe8Fdj\xfc\x11\xfc\xfe7\x06h\x02\x00@>]W\x92\xc9\x02\xb1c3\x82\xcd\xa4\xefN9\x90\xe6\x81y\x9c\x84er\xd4\xc3\x06\x1c\x06\x14\xcf\xc7\x07hj\xbfH\xdc\xf5~\xf7z\x18Ce\xaf^\x8c\xab \xdfV\xce\xb8\x11\xf8\x06\x03'

Traceback (most recent call last):
  File "FBChatScraper.py", line 200, in <module>
    main()
  File "FBChatScraper.py", line 134, in main
    fbms.run()
  File "FBChatScraper.py", line 43, in run
    thread_contents = self.download_thread(limit, offset, message_timestamp)
  File "FBChatScraper.py", line 74, in download_thread
    thread_contents = json.loads(res.content)
  File "/Users/silman/anaconda/lib/python3.6/json/__init__.py", line 349, in loads
    s = s.decode(detect_encoding(s), 'surrogatepass')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 0: invalid start byte

oddly printing the content in the middle of the traceback leading me to think there are some invisible characters pushing it down.

I am unable to get the response loaded into a json format because no matter how i handle the response content it isn't properly formatted for json library to interpret.

Moreover if i just do print(res.text) i get garbage:

Traceback (most recent call last):
  File "FBChatScraper.py", line 200, in <module>
    main()
  File "FBChatScraper.py", line 134, in main
    fbms.run()
}sP���c���f�u0���\� QZed�C��� M$x�Ҹ�H�����eǘ�]���5���^�*�ӄaM�Y��b���/ڶ�JW/���>H6z�\��l4����t=i��%Ҳu�x��%�x�
       F    <���{1i�#%;�rɲ=Rχm��1B�Z(+�(S-���#��\v�{b��
                                                           �    f/V�i̴��_��83�  �_����*��O��
                                                                                            ������Z��i-�TVeaG54�!v�a?ǯ|gu-g��.���"J$�L`&�tΊ#s)�H����s���q���^׷0��[)���j�ॽ�T���U���J�ЁwW���!eg�#j ��r��$y���3�4��4.��M�@Kb�AX�SDb�QJ�X)�,���a�   "Sp�h�����sOA0Vé|�������:%�rKdKC���@ M��.�^
�       �g���SWQHӳ.��BӄG�,����@E��������
                                        nras��L�/��ch@>]W���c3�ͤ�N9��y��er����hj�H��~�zCe�^�� �Vθ�

Traceback (most recent call last):
  File "FBChatScraper.py", line 200, in <module>
    main()
  File "FBChatScraper.py", line 134, in main
    fbms.run()
  File "FBChatScraper.py", line 43, in run
    thread_contents = self.download_thread(limit, offset, message_timestamp)
  File "FBChatScraper.py", line 74, in download_thread
    thread_contents = json.loads(res.content)
  File "/Users/silman/anaconda/lib/python3.6/json/__init__.py", line 349, in loads
    s = s.decode(detect_encoding(s), 'surrogatepass')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 0: invalid start byte

EDIT:

MWE as best i can, not sure what data from my post request is private so i left some out

using this data

url_thread = "https://www.messenger.com/api/graphqlbatch/"


request_data = {
  "batch_name": "MessengerGraphQLThreadFetcher",
  "__user": "<user_id>",
  "__a": "1",
  "__dyn": "<dyn>",
  "__req": "9",
  '__be'      : '-1',
  '__pc'      : 'PHASED:messengerdotcom_pkg',
  "fb_dtsg": "AQFni7TU2nes:AQGSC8FSDqyw",
  "ttstamp": "265817254666710077746711957586581715370521181008510710777",
  "__rev": "3791607",
  "jazoest": "<jazoest>",
  "queries": '<queries>'
  }

headers = {
  "authority": "www.messenger.com",
  "method": "POST",
  "path": "/api/graphqlbatch/",
  "scheme": "https",
  "accept": "*/*",
  "accept-encoding": "gzip, deflate, br",
  "accept-language": "en-US,en;q=0.9",
  "cache-control": "no-cache",
  "content-length": "754",
  "content-type" : "application/x-www-form-urlencoded",
  "cookie": "<cookies>",
  "origin": "https://www.messenger.com",
  "pragma": "no-cache",
  "referer": "https://www.messenger.com/t/<chatID>",
  "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"
}

You can get all the <items> by using chrome developer tools and lookng on the network tab for a POST request to Request URL: https://www.messenger.com/api/graphqlbatch/.

Its easy to find if you scroll up to reload old messages while chrome dev tools is recording.

Then put together a simple request with python

import requests as rq
import time

ses = rq.Session()
thread = <ID of thread found in URL of messenger.com>

conversation_type = <'thread_fbids' if group chat else 'user_ids'>

data = request_data
data['messages[{}][{}][offset]'.format(conversation_type, thread)] = 0
data['messages[{}][{}][timestamp]'.format(conversation_type, thread)] = int(time.time())
data['messages[{}][{}][limit]'.format(conversation_type, thread)] = 2000

res = ses.post(url_thread, data=data, headers=headers)

print(res.content)
thread_contents = json.loads(res.content)
print(thread_contents)

As what my dev tools got back you can see the start of the json here

like image 720
Taako Avatar asked Apr 06 '18 23:04

Taako


People also ask

Which Python request returns improperly decoded text instead of UTF-8?

python requests.get () returns improperly decoded text instead of UTF-8? Bookmark this question. Show activity on this post. When the content-type of the server is 'Content-Type:text/html', requests.get () returns improperly encoded data.

How do HTTP requests know the encoding of a response?

When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text.

How to use response encoding in GitHub API?

To illustrate use of response.encoding, let’s ping API of Github. To run this script, you need to have Python and requests installed on your PC. Check that utf-8 at the start of the output, it shows that string is encoded and decoded using “utf-8”.

What is the best way to get UTF-8 encoding for HTTP requests?

For response header Content-Type: text/html; charset=utf-8 the result is UTF-8. Luckily for us, requests uses chardet library and that usually works quite well (attribute requests.Response.apparent_encoding ), so you usually want to do:


1 Answers

The problem is this line in your request headers:

"accept-encoding": "gzip, deflate, br",

That br requests Brotli compression, a new-ish compression standard (see RFC 7932) that Google is pushing to replace gzip on the web. Chrome is asking for Brotli because recent versions of Chrome understand it natively. You're asking for Brotli because you copied the headers from Chrome. But requests doesn't understand Brotli natively.

You can pip install brotli and register the decompresser or just call it manually on res.content. But a simpler solution is to just remove the br:

"accept-encoding": "gzip, deflate",

… and then you should get gzip, which you and requests already know how to handle.

like image 72
abarnert Avatar answered Sep 21 '22 11:09

abarnert