Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas/Python ruining JSON data in DataFrames

I'm interacting with an API and getting JSON data back. At the top level of the JSON object I have 'regular' data but some fields have more advanced structures:

{
    "foo": 1,
    "bar": "string",
    "spam": {
                "egg":"green",
                "ham":"yum",
                "ran": {
                        "out_of":"fake_words"
                       }
            }
}

I need to preserve these advanced structures ("spam") as valid JSON

I'm getting the data using Requests and loading this into a Pandas DataFrame like so:

api_result = api.get_data().json()     #the json output of the Request object
df = read_json(json.dumps(api_result))

This gives me a nice DataFrame with three columns, as I expect (this is also what I want). The problem comes with the strings:

foo    bar        spam
1      'string'   {'egg':'green','ham':'yum','ran':{'out_of':'fake_words'}

Pandas or Python has changed all the strings in my data to single quotes (') instead of valid JSON double quotes ("). This behaviour ruins all the downstream processing that is expecting valid JSON objects since all the quotes are now single quotes.

EDIT--> My program writes out a csv that is ingested into a database table expecting valid JSON in many of the fields. This table is used by many other processes for further analysis and data preparation. <--EDIT

Is there any way to tell Pandas/Python to stop changing my strings from double to single quotes? I know the general concession is that single quotes are more Pythonic but now they're ruining everything for me.

Thanks!

like image 253
jbarney Avatar asked Feb 01 '26 12:02

jbarney


1 Answers

If you want to generate valid JSON in Python, the best route is the built-in json package. You can use the dumps function to create a valid JSON string from a Python dictionary:

>>> import json
>>> data = {'egg':'green','ham':'yum','ran':{'out_of':'fake_words'}}
>>> json.dumps(data)
'{"ham": "yum", "ran": {"out_of": "fake_words"}, "egg": "green"}'

Edited answer based on edited question:

The problem is that when you read JSON into a Pandas dataframe, it converts everything into Python objects. In your case, the JSON strings are being converted into Python dicts, and when you print the results it shows Python's string representation of the dicts. This representation is almost like JSON, but is not JSON.

What you need to do is to convert the dicts in your dataframe to valid JSON strings. To do this conversion in the "spam" column, you could use an apply() method call, e.g.

data['spam'] = data['spam'].apply(json.dumps)

Now the column contains JSON strings rather than Python dicts.

like image 127
jakevdp Avatar answered Feb 03 '26 09:02

jakevdp



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!