Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read CSV with JSON feature

I am trying to read a large CSV which includes JSON features (location here). For the first, say 100 lines, the file looks like this:

Time,location,labelA,labelB
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan

I followed this question to parse the location column. The solution basically defines a helper as:

def CustomParser(data):
    import json
    j1 = json.loads(data)
    return j1

and then

df=pd.read_csv('data.csv', nrows=100,converters={'location':CustomParser},header=0)

I get the following error which is related to JSON format:

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Q1: How can I parse the feature location onto new columns?

Q2 (for general case): For nrows>100 in the data, also the last features (labelA and labelB) have JSON formats with different key and value. How can I possibly read the entire CSV with parsing every feature which includes JSON (even partially)?

like image 668
physiker Avatar asked Sep 17 '19 15:09

physiker


People also ask

Can CSV contain JSON?

CSV files can be converted to JSON format, but complex JSON files may lead to reading and writing errors.

Can we convert CSV to JSON in Python?

In the above example, we have imported the csv and json libraries and defined a function as convjson(). We have then created an empty dictionary and read the data from the CSV file. We have converted the rows from CSV to the dictionary and add them to the data. We have then dumped the data into a JSON file.

How do I convert a CSV file to JSON using Pyspark?

Try the following code. It first creates pandas dataframe from spark DF (unless you care doing some else with spark df, you can load csv file directly into pandas). From pandas df, it creates groups based on FieldName column and then writes to file where json. dumps takes care of formatting.


2 Answers

Fix the file:

  • Unfortunately, the file is difficult to read because each row contains a dict, whose key-value pairs are separated by commas.
  • The easiest way to resolve the issue, is change the separators outside of each dict, from , to |.
  • The following code will read the existing file
    • It assumes, the first row is the header, use .replace(',', '|')
    • Remaining rows will use a regular expression to replace , outside of {}
    • Each line will be written to a new file.

Code:

Data:

Time,location,labelA,labelB
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},{"ack":123,"bar":456},{"foo":123,"bar":456}
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},{"ack":123,"bar":456},{"foo":123,"bar":456}
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},{"ack":123,"bar":456},{"foo":123,"bar":456}
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},{"ack":123,"bar":456},{"foo":123,"bar":456}
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan
  • Path.cwd() assumes current working directory, if this is not to case:
    • Path('c:/some_path_to_my_file') / 'file_name.poo' can be used
  • pathlib is part of the standard library
  • Python 3's pathlib Module: Taming the File System

File repair:

import re
from pathlib import Path

p = Path.cwd() / 'test.csv'
p2 = Path.cwd() / 'test2.csv'

with p.open('r') as f:
    with p2.open('w') as f2:
        for cnt, line in enumerate(f):
            if cnt == 0:
                line = line.replace(',', '|')
            else:
                line = re.sub(r',(?=(((?!\}).)*\{)|[^\{\}]*$)', '|', line)
            f2.write(line)

New file:

Time|location|labelA|labelB
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|{"ack":123,"bar":456}|{"foo":123,"bar":456}
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|{"ack":123,"bar":456}|{"foo":123,"bar":456}
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|{"ack":123,"bar":456}|{"foo":123,"bar":456}
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|{"ack":123,"bar":456}|{"foo":123,"bar":456}
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan

Parse the new file:

  • Now the columns will be properly separated by .read_csv
  • However, the location, labelA and labelB columns are str
    • Use ast.literal_eval to convert to dict
    • literal_eval won't work on nan, so replace nan with {}
  • for col in df.columns[1:]: loops through each of the columns and:
    • try-except will catch any columns that are not properly formed
    • converts them from str to dict
    • separates the keys into columns
    • concats the columns to the existing dataframe
    • drops the old column
import pandas as pd
from ast import literal_eval

df = pd.read_csv('test2.csv', sep='|')
print(df)

       Time                                                             location                 labelA                 labelB
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}  {"ack":123,"bar":456}  {"foo":123,"bar":456}
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}                    NaN                    NaN
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}  {"ack":123,"bar":456}  {"foo":123,"bar":456}
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}                    NaN                    NaN
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}  {"ack":123,"bar":456}  {"foo":123,"bar":456}
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}                    NaN                    NaN
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}  {"ack":123,"bar":456}  {"foo":123,"bar":456}
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}                    NaN                    NaN


for col in df.columns[1:]:
    try:
        df[col].fillna('{}', inplace=True)
        df[col] = df[col].apply(literal_eval)
        df = pd.concat([df, df[col].apply(pd.Series)], axis=1)
        df.drop(columns=[col], inplace=True)
    except (SyntaxError, ValueError) as e:
        print(f'{col}: {e}')


print(df)

       Time   lng    alt        time  error   lat    ack    bar    foo    bar
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8  123.0  456.0  123.0  456.0
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8    NaN    NaN    NaN    NaN
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8  123.0  456.0  123.0  456.0
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8    NaN    NaN    NaN    NaN
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8  123.0  456.0  123.0  456.0
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8    NaN    NaN    NaN    NaN
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8  123.0  456.0  123.0  456.0
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8    NaN    NaN    NaN    NaN

Literal Eval Notes:

  • Pandas has methods for importing data in many forms, such as dict or list.
  • However, read_csv doesn't interprete containers (e.g. dict) well, they are interpreted as a string, unless you specify the converters parameter (pd.read_csv('test3.csv', sep='|', converters={'a': literal_eval}).
  • literal_eval will not work on a column comprised of both containers and strings or NaN, unless the string is only numeric (e.g. '8654')
  • Part of the code above, first replaced all nan with a {} so literal_eval wouldn't have an error.
  • Given the following mixed column example:
column_a
{"ack":123,"bar":456}
some string
{"ack":123,"bar":456}
some string
{"ack":123,"bar":456}
some string
  • literal_eval will throw ValueError: malformed node or string:
    • This difference between the two solutions is the other solution fixes one column, whereas this solution was implemented in such a way as to fix all the columns and remove the necessity of reading only the first 100 rows.
    • You can forgo the loop to fix all the columns and just fix the location column, if it is all dicts. Use the following code:
df['location'] = df['location'].apply(literal_eval)
df = pd.concat([df, df['location'].apply(pd.Series)], axis=1)

Note about the actual data:

  • the location column is not formed properly
    • '{"lng":12.9975201,alt:413.0,"time:""2019-09-10T12:09:58Z""",error:7.0,lat:47.8258582}'
  • Here is the expected form:
    • '{"lng":12.9975201,"alt":413.0,"time":"2019-09-10T12:09:58Z","error":7.0,"lat":47.8258582}'

Fix the location column:

  • The location column is Position in the real data
def fix_pos(x):
    word_dict = {'alt': '"alt"',
                 '"time:"': '"time":',
                 '"",error:': ',"error":',
                 'lat': '"lat"'}
    for k, v in word_dict.items():
        x = x.replace(k, v)
    return x

df.Position = df.Position.apply(lambda x: fix_pos(x))
  • Use the following loop with the real data file.
  • Zeit, device, Text & Type don't need to be processed
  • Position is at index 4.
for col in df.columns[4:]:
    try:
        df[col].fillna('{}', inplace=True)
        df[col] = df[col].apply(literal_eval)
        df = pd.concat([df, df[col].apply(pd.Series)], axis=1)
        df.drop(columns=[col], inplace=True)
    except (SyntaxError, ValueError) as e:
        print(f'{col}: {e}')
  • The loop that applies literal_eval to all columns has been updated with try-except
    • If there's an exception the column name and error message will be printed out.
    • There are a total of 64 columns in the real data, most of them are Furchtbar.

Errors:

  • These are the errors for all the columns in the supplied csv file.
device: unexpected EOF while parsing (<unknown>, line 1)
Text: malformed node or string: <_ast.Name object at 0x00000203B8473C08>
Typ: malformed node or string: <_ast.Name object at 0x00000203BE217E08>
Data: unexpected EOF while parsing (<unknown>, line 1)
Data1: invalid syntax (<unknown>, line 1)
Data2: invalid syntax (<unknown>, line 1)
Unnamed: 8: invalid syntax (<unknown>, line 1)
Unnamed: 9: unexpected EOF while parsing (<unknown>, line 1)
Unnamed: 10: invalid syntax (<unknown>, line 1)
Unnamed: 11: unexpected EOF while parsing (<unknown>, line 1)
Unnamed: 12: invalid syntax (<unknown>, line 1)
Unnamed: 13: invalid syntax (<unknown>, line 1)
Unnamed: 14: invalid syntax (<unknown>, line 1)
Unnamed: 15: invalid syntax (<unknown>, line 1)
Unnamed: 16: invalid syntax (<unknown>, line 1)
Unnamed: 17: invalid syntax (<unknown>, line 1)
Unnamed: 18: invalid syntax (<unknown>, line 1)
Unnamed: 19: invalid syntax (<unknown>, line 1)
Unnamed: 20: invalid syntax (<unknown>, line 1)
Unnamed: 21: unexpected EOF while parsing (<unknown>, line 1)
Unnamed: 22: invalid syntax (<unknown>, line 1)
Unnamed: 23: invalid syntax (<unknown>, line 1)
Unnamed: 24: invalid syntax (<unknown>, line 1)
Unnamed: 25: invalid syntax (<unknown>, line 1)
Unnamed: 26: invalid syntax (<unknown>, line 1)
Unnamed: 27: invalid syntax (<unknown>, line 1)
like image 138
Trenton McKinney Avatar answered Sep 30 '22 08:09

Trenton McKinney


The problem here is that the commas inside your json string are being treated as delimiters. You should modify the input data (if you don't have direct access to the file, you can always read the contents into a list of strings using open first).

Here are a few modification options that you can try:

Option 1: Quote json string with single quote

Use a single quote (or another character that doesn't otherwise appear in your data) as a quote character for your json string.

>> cat data.csv
Time,location,labelA,labelB
2019-09-10,'{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}',nan,nan

Then use quotechar="'" when you read the data:

import pandas as pd
import json

df=pd.read_csv('data.csv', converters={'location':json.loads}, header=0, quotechar="'")

Option 2: Quote json string with double quote and escape

If the single quote can't be used, you can actually use the double quote as the quotechar, as long as your escape the quotes inside the json string:

>> cat data.csv
Time,location,labelA,labelB
2019-09-10,"{""lng"":12.9,""alt"":413.0,""time"":""2019-09-10"",""error"":7.0,""lat"":17.8}",nan,nan

Notice that this now matches the format of the question you linked.

df=pd.read_csv('data.csv', converters={'location':json.loads}, header=0, quotechar='"')

Option 3: Change the delimiter

Use a different character, for example the | as the delimiter

>> cat data.csv
Time|location|labelA|labelB
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan

Now use the sep argument to specify the new delimiter:

df=pd.read_csv('data.csv', converters={'location':json.loads}, header=0, sep="|")

Each of these methods produce the same output:

print(df)
#   Time        location                                            labelA  labelB
#0  2019-09-10  {u'lat': 17.8, u'lng': 12.9, u'error': 7.0, u'...   NaN     NaN

Once you have that, you can expand the location column using one of the methods described in Flatten JSON column in a Pandas DataFrame

new_df = df.join(pd.io.json.json_normalize(df["location"])).drop(["location"], axis=1)
print(new_df)
#   Time        labelA  labelB  alt    error  lat   lng   time
#0  2019-09-10  NaN     NaN     413.0  7.0    17.8  12.9  2019-09-10
like image 24
pault Avatar answered Sep 30 '22 09:09

pault