I have a string:
'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'
And I want:
b'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'
But I keep getting:
b'BZh91AY&SYA\\xaf\\x82\\r\\x00\\x00\\x01\\x01\\x80\\x02\\xc0\\x02\\x00 \\x00!\\x9ah3M\\x07<]\\xc9\\x14\\xe1BA\\x06\\xbe\\x084'
Context
I scraped a string off of a webpage and stored it in the variable un
. Now I want to decompress it using BZip2:
bz2.decompress(un)
However, since un
is a str
object, I get this error:
TypeError: a bytes-like object is required, not 'str'
Therefore, I need to convert un
to a bytes-like object without changing the single backslash to an escaped backslash.
Edit 1: Thank you for all the help! @wim I understand what you mean now, but I am at a loss as to how I can retrieve a bytes-like object from my webscraping method:
r = requests.get('http://www.pythonchallenge.com/pc/def/integrity.html')
doc = html.fromstring(r.content)
comment = doc.xpath('//comment()')[0].text.split('\n')[1:3]
pattern = re.compile("[a-z]{2}: '(.+)'")
un = re.search(pattern, comment[0]).group(1)
The packages that I am using are requests
, lxml.html
, re
, and bz2
.
Once again, my goal is to decompress un
using bz2
, but I am having difficulty getting a bytes-like object from my webscraping process.
Any pointers?
Your bug exists earlier. The only acceptable solution is to change the scraping code so that it returns a bytes object and not a text object. Do not to try and "convert" your string un
into bytes, it can not be done reliably.
Do NOT do this:
>>> un = 'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'
>>> bz2.decompress(un.encode('raw_unicode_escape'))
b'huge'
The "raw_unicode_escape" is just a Latin-1 encoding which has a built-in fallback for characters outside of it. This encoding uses \uXXXX and \UXXXXXXXX for other code points. Existing backslashes are not escaped in any way. It is used in the Python pickle protocol. For Unicode characters that cannot be represented as a \xXX sequence, your data will become corrupted.
The web scraping code has no business returning bz2-encoded bytes as a str
, so that's where you need to address the cause of the problem, rather than attempting to deal with the symptoms.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With