Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Encode a raw string so it can be decoded as json

I am throwing in the towel here. I'm trying to convert a string scraped from the source code of a website with scrapy (injected javascript) to json so I can easily access the data. The problem comes down to a decode error. I tried all kinds of encoding, decoding, escaping, codecs, regular expressions, string manipulations and nothing works. Oh, using Python 3.

I narrowed down the culprit on the string (or at least part of it)

scraped = '{"propertyNotes": [{"title": "Local Description", "text": "\u003Cp\u003EAPPS\u003C/p\u003E\n\n\u003Cp\u003EBig Island Revealed (comes as app or as a printed book)\u003C/p\u003E\n\n\u003Cp\u003EAloha Big Island\u003C/p\u003E\n\n\u003Cp\u003EBig Island\u003C/p\u003E\n\n\u003Cp\u003EBig Island Smart Maps (I like this one a lot)\u003C/p\u003E\n\n\u003Cp\u003EBig Island Adventures (includes videos)\u003C/p\u003E\n\n\u003Cp\u003EThe descriptions of beaches are helpful.  Suitability for swimming, ease of access, etc. is included.  Some beaches are great for picnics and scenic views, while others are suitable for swimming and snorkeling. Check before you go.\u003C/p\u003E"}]}'

scraped_raw = r'{"propertyNotes": [{"title": "Local Description", "text": "\u003Cp\u003EAPPS\u003C/p\u003E\n\n\u003Cp\u003EBig Island Revealed (comes as app or as a printed book)\u003C/p\u003E\n\n\u003Cp\u003EAloha Big Island\u003C/p\u003E\n\n\u003Cp\u003EBig Island\u003C/p\u003E\n\n\u003Cp\u003EBig Island Smart Maps (I like this one a lot)\u003C/p\u003E\n\n\u003Cp\u003EBig Island Adventures (includes videos)\u003C/p\u003E\n\n\u003Cp\u003EThe descriptions of beaches are helpful.  Suitability for swimming, ease of access, etc. is included.  Some beaches are great for picnics and scenic views, while others are suitable for swimming and snorkeling. Check before you go.\u003C/p\u003E"}]}'

data = json.loads(scraped_raw) #<= works
print(data["propertyNotes"])

failed = json.loads(scraped) #no work
print(failed["propertyNotes"])

Unfortunately, I cannot find a way for scrapy/splash to return the string as raw. So, somehow I need to have python interprets the string as raw while it is loading the json. Please help

Update:

What worked for that string was json.loads(str(data.encode('unicode_escape'), 'utf-8')) However, it didnt work with the larger string. The error I get doing this is JSONDecodeError: Invalid \escape on the larger json string

like image 966
Dennis Pitt Avatar asked Jun 29 '26 14:06

Dennis Pitt


1 Answers

The problem exists because the string you're getting has escaped control characters which when interpreted by python become actual bytes when encoded (while this is not necessarily bad, we know that these escaped characters are control characters that json would not expect). Similar to Turn's answer, you need to interpret the string without interpreting the escaped values which is done using

json.loads(scraped.encode('unicode_escape'))

This works by encoding the contents as expected by the latin-1 encoding whilst interpreting any \u003 like escaped character as literally \u003 unless it's some sort of control character.

If my understanding is correct however, you may not want this because you then lose the escaped control characters so the data might not be the same as the original.

You can see this in action by noticing that the control chars disappear after converting the encoded string back to a normal python string:

scraped.encode('unicode_escape').decode('utf-8')

If you want to keep the control characters you're going to have to attempt to escape the strings before loading them.

like image 95
jaitaiwan Avatar answered Jul 01 '26 05:07

jaitaiwan