I have a very large dataset in CSV format in which one column is a JSON string. I want to read this information into a flat Pandas data frame. How can I achieve this efficiently?
Input CSV:
col1,col2,col3,col4
1,Programming,"{""col3_1"":null,""col3_2"":""Java""}",11
2,Sport,"{""col3_1"":null,""col3_2"":""Soccer""}",22
3,Food,"{""col3_1"":null,""col3_2"":""Pizza""}",33
Expected DataFrame:
+---------------------------------------------------------------+
| col1 | col2 | col3_1 | col3_2 | col4 |
+---------------------------------------------------------------+
| 1 | Programming | None | Java | 11 |
| 2 | Sport | None | Soccer | 22 |
| 3 | Food | None | Pizza | 33 |
+---------------------------------------------------------------+
I can currently get the expected output using the following code. I just want to know if there is a more efficient way to achieve the same.
import json
import pandas
dataset = pandas.read_csv('/dataset.csv')
dataset['col3'] = dataset['col3'].apply(json.loads)
dataset['col3_1'] = dataset['col3'].apply(lambda row: row['col3_1'])
dataset['col3_2'] = dataset['col3'].apply(lambda row: row['col3_2'])
dataset = dataset.drop(columns=['col3'])
you can parse JSON in Pandas column using json.loads() and convert it to Pandas columns using pd.Series():
In [85]: df.join(df.pop('col3').apply(lambda x: pd.Series(json.loads(x))))
Out[85]:
col1 col2 col4 col3_1 col3_2
0 1 Programming 11 None Java
1 2 Sport 22 None Soccer
2 3 Food 33 None Pizza
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With