I am trying to read a large CSV which includes JSON features (location here). For the first, say 100 lines, the file looks like this:
Time,location,labelA,labelB
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan
I followed this question to parse the location column. The solution basically defines a helper as:
def CustomParser(data):
    import json
    j1 = json.loads(data)
    return j1
and then
df=pd.read_csv('data.csv', nrows=100,converters={'location':CustomParser},header=0)
I get the following error which is related to JSON format:
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Q1: How can I parse the feature location onto new columns?
Q2 (for general case): For nrows>100 in the data, also the last features (labelA and labelB) have JSON formats with different key and value. How can I possibly read the entire CSV with parsing every feature which includes JSON (even partially)?
CSV files can be converted to JSON format, but complex JSON files may lead to reading and writing errors.
In the above example, we have imported the csv and json libraries and defined a function as convjson(). We have then created an empty dictionary and read the data from the CSV file. We have converted the rows from CSV to the dictionary and add them to the data. We have then dumped the data into a JSON file.
Try the following code. It first creates pandas dataframe from spark DF (unless you care doing some else with spark df, you can load csv file directly into pandas). From pandas df, it creates groups based on FieldName column and then writes to file where json. dumps takes care of formatting.
dict, whose key-value pairs are separated by commas.dict, from , to |..replace(',', '|')
, outside of {}
Time,location,labelA,labelB
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},{"ack":123,"bar":456},{"foo":123,"bar":456}
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},{"ack":123,"bar":456},{"foo":123,"bar":456}
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},{"ack":123,"bar":456},{"foo":123,"bar":456}
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},{"ack":123,"bar":456},{"foo":123,"bar":456}
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan
Path.cwd() assumes current working directory, if this is not to case:
Path('c:/some_path_to_my_file') / 'file_name.poo' can be usedimport re
from pathlib import Path
p = Path.cwd() / 'test.csv'
p2 = Path.cwd() / 'test2.csv'
with p.open('r') as f:
    with p2.open('w') as f2:
        for cnt, line in enumerate(f):
            if cnt == 0:
                line = line.replace(',', '|')
            else:
                line = re.sub(r',(?=(((?!\}).)*\{)|[^\{\}]*$)', '|', line)
            f2.write(line)
Time|location|labelA|labelB
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|{"ack":123,"bar":456}|{"foo":123,"bar":456}
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|{"ack":123,"bar":456}|{"foo":123,"bar":456}
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|{"ack":123,"bar":456}|{"foo":123,"bar":456}
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|{"ack":123,"bar":456}|{"foo":123,"bar":456}
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan
.read_csv
location, labelA and labelB columns are str
ast.literal_eval to convert to dict
literal_eval won't work on nan, so replace nan with {}
for col in df.columns[1:]: loops through each of the columns and:
try-except will catch any columns that are not properly formedstr to dict
keys into columnsconcats the columns to the existing dataframedrops the old columnimport pandas as pd
from ast import literal_eval
df = pd.read_csv('test2.csv', sep='|')
print(df)
       Time                                                             location                 labelA                 labelB
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}  {"ack":123,"bar":456}  {"foo":123,"bar":456}
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}                    NaN                    NaN
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}  {"ack":123,"bar":456}  {"foo":123,"bar":456}
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}                    NaN                    NaN
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}  {"ack":123,"bar":456}  {"foo":123,"bar":456}
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}                    NaN                    NaN
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}  {"ack":123,"bar":456}  {"foo":123,"bar":456}
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}                    NaN                    NaN
for col in df.columns[1:]:
    try:
        df[col].fillna('{}', inplace=True)
        df[col] = df[col].apply(literal_eval)
        df = pd.concat([df, df[col].apply(pd.Series)], axis=1)
        df.drop(columns=[col], inplace=True)
    except (SyntaxError, ValueError) as e:
        print(f'{col}: {e}')
print(df)
       Time   lng    alt        time  error   lat    ack    bar    foo    bar
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8  123.0  456.0  123.0  456.0
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8    NaN    NaN    NaN    NaN
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8  123.0  456.0  123.0  456.0
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8    NaN    NaN    NaN    NaN
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8  123.0  456.0  123.0  456.0
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8    NaN    NaN    NaN    NaN
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8  123.0  456.0  123.0  456.0
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8    NaN    NaN    NaN    NaN
dict or list.read_csv doesn't interprete containers (e.g. dict) well, they are interpreted as a string, unless you specify the converters parameter (pd.read_csv('test3.csv', sep='|', converters={'a': literal_eval}).literal_eval will not work on a column comprised of both containers and strings or NaN, unless the string is only numeric (e.g. '8654')nan with a {} so literal_eval wouldn't have an error.column_a
{"ack":123,"bar":456}
some string
{"ack":123,"bar":456}
some string
{"ack":123,"bar":456}
some string
literal_eval will throw ValueError: malformed node or string:
location column, if it is all dicts.  Use the following code:df['location'] = df['location'].apply(literal_eval)
df = pd.concat([df, df['location'].apply(pd.Series)], axis=1)
location column is not formed properly
'{"lng":12.9975201,alt:413.0,"time:""2019-09-10T12:09:58Z""",error:7.0,lat:47.8258582}''{"lng":12.9975201,"alt":413.0,"time":"2019-09-10T12:09:58Z","error":7.0,"lat":47.8258582}'location column:location column is Position in the real datadef fix_pos(x):
    word_dict = {'alt': '"alt"',
                 '"time:"': '"time":',
                 '"",error:': ',"error":',
                 'lat': '"lat"'}
    for k, v in word_dict.items():
        x = x.replace(k, v)
    return x
df.Position = df.Position.apply(lambda x: fix_pos(x))
Zeit, device, Text & Type don't need to be processedPosition is at index 4.for col in df.columns[4:]:
    try:
        df[col].fillna('{}', inplace=True)
        df[col] = df[col].apply(literal_eval)
        df = pd.concat([df, df[col].apply(pd.Series)], axis=1)
        df.drop(columns=[col], inplace=True)
    except (SyntaxError, ValueError) as e:
        print(f'{col}: {e}')
literal_eval to all columns has been updated with try-except
exception the column name and error message will be printed out.csv file.device: unexpected EOF while parsing (<unknown>, line 1)
Text: malformed node or string: <_ast.Name object at 0x00000203B8473C08>
Typ: malformed node or string: <_ast.Name object at 0x00000203BE217E08>
Data: unexpected EOF while parsing (<unknown>, line 1)
Data1: invalid syntax (<unknown>, line 1)
Data2: invalid syntax (<unknown>, line 1)
Unnamed: 8: invalid syntax (<unknown>, line 1)
Unnamed: 9: unexpected EOF while parsing (<unknown>, line 1)
Unnamed: 10: invalid syntax (<unknown>, line 1)
Unnamed: 11: unexpected EOF while parsing (<unknown>, line 1)
Unnamed: 12: invalid syntax (<unknown>, line 1)
Unnamed: 13: invalid syntax (<unknown>, line 1)
Unnamed: 14: invalid syntax (<unknown>, line 1)
Unnamed: 15: invalid syntax (<unknown>, line 1)
Unnamed: 16: invalid syntax (<unknown>, line 1)
Unnamed: 17: invalid syntax (<unknown>, line 1)
Unnamed: 18: invalid syntax (<unknown>, line 1)
Unnamed: 19: invalid syntax (<unknown>, line 1)
Unnamed: 20: invalid syntax (<unknown>, line 1)
Unnamed: 21: unexpected EOF while parsing (<unknown>, line 1)
Unnamed: 22: invalid syntax (<unknown>, line 1)
Unnamed: 23: invalid syntax (<unknown>, line 1)
Unnamed: 24: invalid syntax (<unknown>, line 1)
Unnamed: 25: invalid syntax (<unknown>, line 1)
Unnamed: 26: invalid syntax (<unknown>, line 1)
Unnamed: 27: invalid syntax (<unknown>, line 1)
                        The problem here is that the commas inside your json string are being treated as delimiters. You should modify the input data (if you don't have direct access to the file, you can always read the contents into a list of strings using open first).
Here are a few modification options that you can try:
Option 1: Quote json string with single quote
Use a single quote (or another character that doesn't otherwise appear in your data) as a quote character for your json string.
>> cat data.csv
Time,location,labelA,labelB
2019-09-10,'{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}',nan,nan
Then use quotechar="'" when you read the data:
import pandas as pd
import json
df=pd.read_csv('data.csv', converters={'location':json.loads}, header=0, quotechar="'")
Option 2: Quote json string with double quote and escape
If the single quote can't be used, you can actually use the double quote as the quotechar, as long as your escape the quotes inside the json string:
>> cat data.csv
Time,location,labelA,labelB
2019-09-10,"{""lng"":12.9,""alt"":413.0,""time"":""2019-09-10"",""error"":7.0,""lat"":17.8}",nan,nan
Notice that this now matches the format of the question you linked.
df=pd.read_csv('data.csv', converters={'location':json.loads}, header=0, quotechar='"')
Option 3: Change the delimiter
Use a different character, for example the | as the delimiter
>> cat data.csv
Time|location|labelA|labelB
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan
Now use the sep argument to specify the new delimiter:
df=pd.read_csv('data.csv', converters={'location':json.loads}, header=0, sep="|")
Each of these methods produce the same output:
print(df)
#   Time        location                                            labelA  labelB
#0  2019-09-10  {u'lat': 17.8, u'lng': 12.9, u'error': 7.0, u'...   NaN     NaN
Once you have that, you can expand the location column using one of the methods described in Flatten JSON column in a Pandas DataFrame
new_df = df.join(pd.io.json.json_normalize(df["location"])).drop(["location"], axis=1)
print(new_df)
#   Time        labelA  labelB  alt    error  lat   lng   time
#0  2019-09-10  NaN     NaN     413.0  7.0    17.8  12.9  2019-09-10
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With