Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Convert Raw String to Bytes String without adding escape chraracters

I have a string:

'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'

And I want:

b'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'

But I keep getting:

b'BZh91AY&SYA\\xaf\\x82\\r\\x00\\x00\\x01\\x01\\x80\\x02\\xc0\\x02\\x00 \\x00!\\x9ah3M\\x07<]\\xc9\\x14\\xe1BA\\x06\\xbe\\x084'

Context

I scraped a string off of a webpage and stored it in the variable un. Now I want to decompress it using BZip2:

bz2.decompress(un)

However, since un is a str object, I get this error:

TypeError: a bytes-like object is required, not 'str'

Therefore, I need to convert un to a bytes-like object without changing the single backslash to an escaped backslash.

Edit 1: Thank you for all the help! @wim I understand what you mean now, but I am at a loss as to how I can retrieve a bytes-like object from my webscraping method:

r = requests.get('http://www.pythonchallenge.com/pc/def/integrity.html')

doc = html.fromstring(r.content)
comment = doc.xpath('//comment()')[0].text.split('\n')[1:3]

pattern = re.compile("[a-z]{2}: '(.+)'")

un = re.search(pattern, comment[0]).group(1)

The packages that I am using are requests, lxml.html, re, and bz2.

Once again, my goal is to decompress un using bz2, but I am having difficulty getting a bytes-like object from my webscraping process.

Any pointers?

like image 669
Bryan Yao Avatar asked Nov 08 '22 04:11

Bryan Yao


1 Answers

Your bug exists earlier. The only acceptable solution is to change the scraping code so that it returns a bytes object and not a text object. Do not to try and "convert" your string un into bytes, it can not be done reliably.

Do NOT do this:

>>> un = 'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'
>>> bz2.decompress(un.encode('raw_unicode_escape'))
b'huge'

The "raw_unicode_escape" is just a Latin-1 encoding which has a built-in fallback for characters outside of it. This encoding uses \uXXXX and \UXXXXXXXX for other code points. Existing backslashes are not escaped in any way. It is used in the Python pickle protocol. For Unicode characters that cannot be represented as a \xXX sequence, your data will become corrupted.

The web scraping code has no business returning bz2-encoded bytes as a str, so that's where you need to address the cause of the problem, rather than attempting to deal with the symptoms.

like image 111
wim Avatar answered Nov 14 '22 21:11

wim