I am trying to read a large CSV which includes JSON features (location here). For the first, say 100 lines, the file looks like this:
Time,location,labelA,labelB
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan
I followed this question to parse the location column. The solution basically defines a helper as:
def CustomParser(data):
import json
j1 = json.loads(data)
return j1
and then
df=pd.read_csv('data.csv', nrows=100,converters={'location':CustomParser},header=0)
I get the following error which is related to JSON format:
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Q1: How can I parse the feature location onto new columns?
Q2 (for general case): For nrows>100 in the data, also the last features (labelA and labelB) have JSON formats with different key and value. How can I possibly read the entire CSV with parsing every feature which includes JSON (even partially)?
CSV files can be converted to JSON format, but complex JSON files may lead to reading and writing errors.
In the above example, we have imported the csv and json libraries and defined a function as convjson(). We have then created an empty dictionary and read the data from the CSV file. We have converted the rows from CSV to the dictionary and add them to the data. We have then dumped the data into a JSON file.
Try the following code. It first creates pandas dataframe from spark DF (unless you care doing some else with spark df, you can load csv file directly into pandas). From pandas df, it creates groups based on FieldName column and then writes to file where json. dumps takes care of formatting.
dict
, whose key-value
pairs are separated by commas.dict
, from ,
to |
..replace(',', '|')
,
outside of {}
Time,location,labelA,labelB
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},{"ack":123,"bar":456},{"foo":123,"bar":456}
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},{"ack":123,"bar":456},{"foo":123,"bar":456}
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},{"ack":123,"bar":456},{"foo":123,"bar":456}
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},{"ack":123,"bar":456},{"foo":123,"bar":456}
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan
Path.cwd()
assumes current working directory
, if this is not to case:
Path('c:/some_path_to_my_file') / 'file_name.poo'
can be usedimport re
from pathlib import Path
p = Path.cwd() / 'test.csv'
p2 = Path.cwd() / 'test2.csv'
with p.open('r') as f:
with p2.open('w') as f2:
for cnt, line in enumerate(f):
if cnt == 0:
line = line.replace(',', '|')
else:
line = re.sub(r',(?=(((?!\}).)*\{)|[^\{\}]*$)', '|', line)
f2.write(line)
Time|location|labelA|labelB
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|{"ack":123,"bar":456}|{"foo":123,"bar":456}
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|{"ack":123,"bar":456}|{"foo":123,"bar":456}
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|{"ack":123,"bar":456}|{"foo":123,"bar":456}
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|{"ack":123,"bar":456}|{"foo":123,"bar":456}
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan
.read_csv
location
, labelA
and labelB
columns are str
ast.literal_eval
to convert to dict
literal_eval
won't work on nan
, so replace nan
with {}
for col in df.columns[1:]:
loops through each of the columns and:
try-except
will catch any columns that are not properly formedstr
to dict
keys
into columnsconcats
the columns to the existing dataframedrops
the old columnimport pandas as pd
from ast import literal_eval
df = pd.read_csv('test2.csv', sep='|')
print(df)
Time location labelA labelB
2019-09-10 {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8} {"ack":123,"bar":456} {"foo":123,"bar":456}
2019-09-10 {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8} NaN NaN
2019-09-10 {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8} {"ack":123,"bar":456} {"foo":123,"bar":456}
2019-09-10 {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8} NaN NaN
2019-09-10 {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8} {"ack":123,"bar":456} {"foo":123,"bar":456}
2019-09-10 {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8} NaN NaN
2019-09-10 {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8} {"ack":123,"bar":456} {"foo":123,"bar":456}
2019-09-10 {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8} NaN NaN
for col in df.columns[1:]:
try:
df[col].fillna('{}', inplace=True)
df[col] = df[col].apply(literal_eval)
df = pd.concat([df, df[col].apply(pd.Series)], axis=1)
df.drop(columns=[col], inplace=True)
except (SyntaxError, ValueError) as e:
print(f'{col}: {e}')
print(df)
Time lng alt time error lat ack bar foo bar
2019-09-10 12.9 413.0 2019-09-10 7.0 17.8 123.0 456.0 123.0 456.0
2019-09-10 12.9 413.0 2019-09-10 7.0 17.8 NaN NaN NaN NaN
2019-09-10 12.9 413.0 2019-09-10 7.0 17.8 123.0 456.0 123.0 456.0
2019-09-10 12.9 413.0 2019-09-10 7.0 17.8 NaN NaN NaN NaN
2019-09-10 12.9 413.0 2019-09-10 7.0 17.8 123.0 456.0 123.0 456.0
2019-09-10 12.9 413.0 2019-09-10 7.0 17.8 NaN NaN NaN NaN
2019-09-10 12.9 413.0 2019-09-10 7.0 17.8 123.0 456.0 123.0 456.0
2019-09-10 12.9 413.0 2019-09-10 7.0 17.8 NaN NaN NaN NaN
dict
or list
.read_csv
doesn't interprete containers (e.g. dict
) well, they are interpreted as a string, unless you specify the converters
parameter (pd.read_csv('test3.csv', sep='|', converters={'a': literal_eval})
.literal_eval
will not work on a column comprised of both containers and strings
or NaN
, unless the string
is only numeric (e.g. '8654')nan
with a {}
so literal_eval
wouldn't have an error.column_a
{"ack":123,"bar":456}
some string
{"ack":123,"bar":456}
some string
{"ack":123,"bar":456}
some string
literal_eval
will throw ValueError: malformed node or string:
location
column, if it is all dicts
. Use the following code:df['location'] = df['location'].apply(literal_eval)
df = pd.concat([df, df['location'].apply(pd.Series)], axis=1)
location
column is not formed properly
'{"lng":12.9975201,alt:413.0,"time:""2019-09-10T12:09:58Z""",error:7.0,lat:47.8258582}'
'{"lng":12.9975201,"alt":413.0,"time":"2019-09-10T12:09:58Z","error":7.0,"lat":47.8258582}'
location
column:location
column is Position
in the real datadef fix_pos(x):
word_dict = {'alt': '"alt"',
'"time:"': '"time":',
'"",error:': ',"error":',
'lat': '"lat"'}
for k, v in word_dict.items():
x = x.replace(k, v)
return x
df.Position = df.Position.apply(lambda x: fix_pos(x))
Zeit
, device
, Text
& Type
don't need to be processedPosition
is at index
4.for col in df.columns[4:]:
try:
df[col].fillna('{}', inplace=True)
df[col] = df[col].apply(literal_eval)
df = pd.concat([df, df[col].apply(pd.Series)], axis=1)
df.drop(columns=[col], inplace=True)
except (SyntaxError, ValueError) as e:
print(f'{col}: {e}')
literal_eval
to all columns has been updated with try-except
exception
the column
name and error message will be printed out.csv
file.device: unexpected EOF while parsing (<unknown>, line 1)
Text: malformed node or string: <_ast.Name object at 0x00000203B8473C08>
Typ: malformed node or string: <_ast.Name object at 0x00000203BE217E08>
Data: unexpected EOF while parsing (<unknown>, line 1)
Data1: invalid syntax (<unknown>, line 1)
Data2: invalid syntax (<unknown>, line 1)
Unnamed: 8: invalid syntax (<unknown>, line 1)
Unnamed: 9: unexpected EOF while parsing (<unknown>, line 1)
Unnamed: 10: invalid syntax (<unknown>, line 1)
Unnamed: 11: unexpected EOF while parsing (<unknown>, line 1)
Unnamed: 12: invalid syntax (<unknown>, line 1)
Unnamed: 13: invalid syntax (<unknown>, line 1)
Unnamed: 14: invalid syntax (<unknown>, line 1)
Unnamed: 15: invalid syntax (<unknown>, line 1)
Unnamed: 16: invalid syntax (<unknown>, line 1)
Unnamed: 17: invalid syntax (<unknown>, line 1)
Unnamed: 18: invalid syntax (<unknown>, line 1)
Unnamed: 19: invalid syntax (<unknown>, line 1)
Unnamed: 20: invalid syntax (<unknown>, line 1)
Unnamed: 21: unexpected EOF while parsing (<unknown>, line 1)
Unnamed: 22: invalid syntax (<unknown>, line 1)
Unnamed: 23: invalid syntax (<unknown>, line 1)
Unnamed: 24: invalid syntax (<unknown>, line 1)
Unnamed: 25: invalid syntax (<unknown>, line 1)
Unnamed: 26: invalid syntax (<unknown>, line 1)
Unnamed: 27: invalid syntax (<unknown>, line 1)
The problem here is that the commas inside your json
string are being treated as delimiters. You should modify the input data (if you don't have direct access to the file, you can always read the contents into a list of strings using open
first).
Here are a few modification options that you can try:
Option 1: Quote json
string with single quote
Use a single quote (or another character that doesn't otherwise appear in your data) as a quote character for your json
string.
>> cat data.csv
Time,location,labelA,labelB
2019-09-10,'{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}',nan,nan
Then use quotechar="'"
when you read the data:
import pandas as pd
import json
df=pd.read_csv('data.csv', converters={'location':json.loads}, header=0, quotechar="'")
Option 2: Quote json
string with double quote and escape
If the single quote can't be used, you can actually use the double quote as the quotechar
, as long as your escape the quotes inside the json
string:
>> cat data.csv
Time,location,labelA,labelB
2019-09-10,"{""lng"":12.9,""alt"":413.0,""time"":""2019-09-10"",""error"":7.0,""lat"":17.8}",nan,nan
Notice that this now matches the format of the question you linked.
df=pd.read_csv('data.csv', converters={'location':json.loads}, header=0, quotechar='"')
Option 3: Change the delimiter
Use a different character, for example the |
as the delimiter
>> cat data.csv
Time|location|labelA|labelB
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan
Now use the sep
argument to specify the new delimiter:
df=pd.read_csv('data.csv', converters={'location':json.loads}, header=0, sep="|")
Each of these methods produce the same output:
print(df)
# Time location labelA labelB
#0 2019-09-10 {u'lat': 17.8, u'lng': 12.9, u'error': 7.0, u'... NaN NaN
Once you have that, you can expand the location
column using one of the methods described in Flatten JSON column in a Pandas DataFrame
new_df = df.join(pd.io.json.json_normalize(df["location"])).drop(["location"], axis=1)
print(new_df)
# Time labelA labelB alt error lat lng time
#0 2019-09-10 NaN NaN 413.0 7.0 17.8 12.9 2019-09-10
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With