I need to convert pandas data frame to JSONL format. I couldn't find a good package to do it and tried to implement myself, but it looks a bit ugly and not efficient.
For example, given a pandas df:
label pattern
0 DRUG aspirin
1 DRUG trazodone
2 DRUG citalopram
I need to convert to txt file of the form:
{"label":"DRUG","pattern":[{"lower":"aspirin"}]}
{"label":"DRUG","pattern":[{"lower":"trazodone"}]}
{"label":"DRUG","pattern":[{"lower":"citalopram"}]}
I tried with to_dict('records')
, but I'm missing [ ]
and nested 'lower' key.
df.to_dict('record')
creates:
[{'label': 'DRUG', 'pattern': 'aspirin'},
{'label': 'DRUG', 'pattern': 'trazodone'},
{'label': 'DRUG', 'pattern': 'citalopram'}]
I thought about converting the 'pattern' columns and include nested 'lower'?
UPD
So far, I succeeded to convert 'pattern' into list:
df_new = pd.concat((df[['label']], df[['pattern']].apply(lambda x: x.tolist(), axis=1)), axis=1)
df_new.columns = ['label', 'pattern']
df_new.head()
The result:
label pattern
0 DRUG [aspirin]
1 DRUG [trazodone]
2 DRUG [citalopram]
and then:
df_new.to_dict(orient='records')
[{'label': 'DRUG', 'pattern': ['aspirin']},
{'label': 'DRUG', 'pattern': ['trazodone']},
{'label': 'DRUG', 'pattern': ['citalopram']}]
UPD 2
Eventually, I managed to get what I want, but in the most non-pythonic way.
df_1 = pd.DataFrame(df[['pattern']].apply(lambda x: {'lower': x[0]}, axis=1))
df_1.columns = ['pattern']
df_fin = pd.concat((df[['label']], df_1[['pattern']].apply(lambda x: x.tolist(), axis=1)), axis=1)
df_fin.columns = ['label', 'pattern']
df_fin.to_json(orient='records')
'{'label': 'DRUG', 'pattern': [{'lower': 'aspirin'}]}
{'label': 'DRUG', 'pattern': [{'lower': 'trazodone'}]}
{'label': 'DRUG', 'pattern': [{'lower': 'citalopram'}]}'
Any chance you can show a neat solution?
You can convert pandas DataFrame to JSON string by using DataFrame. to_json() method. This method takes a very important param orient which accepts values ' columns ', ' records ', ' index ', ' split ', ' table ', and ' values '.
You can convert JSON to Pandas DataFrame by simply using read_json() . Just pass JSON string to the function. It takes multiple parameters, for our case I am using orient that specifies the format of JSON string. This function is also used to read JSON files into pandas DataFrame.
JSONL is a text-based format using the . jsonl file extension that is basically the same as JSON format but implemented using newline characters to separate JSON values. It is also known as JSON Lines. Manifold can import and link JSONL files.
If you want to change the data type for all columns in the DataFrame to the string type, you can use df. applymap(str) or df. astype(str) methods.
In versions of Pandas > 0.19.0
, DataFrame.to_json
has a parameter, lines
, that will write out JSONL format.
Given that, a more succinct version of your solution might look like this:
import pandas as pd
data = [{'label': 'DRUG', 'pattern': 'aspirin'},
{'label': 'DRUG', 'pattern': 'trazodone'},
{'label': 'DRUG', 'pattern': 'citalopram'}]
df = pd.DataFrame(data)
# Wrap pattern column in a dictionary
df["pattern"] = df.pattern.apply(lambda x: {"lower": x})
# Output in JSONL format
print(df.to_json(orient='records', lines=True))
Output:
{"label":"DRUG","pattern":{"lower":"aspirin"}}
{"label":"DRUG","pattern":{"lower":"trazodone"}}
{"label":"DRUG","pattern":{"lower":"citalopram"}}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With