I want to use polars to take a csv input and get for each row another column (e.g called json_per_row) where the entry per row is the json representation of the entire row. I also want to select only a subset of the columns to be included alongside the json_per_row column.
Ideally I don’t want to hardcode the number / names of the columns of my input but just to illustrate I’ve provided a simple example below:
# Input: csv with columns time, var1, var2,...
s1 = pl.Series("time", [100, 200, 300])
s2 = pl.Series("var1", [1,2,3])
s3 = pl.Series("var2", [4,5,6])
# I want to add this column with polars somehow
output_col = pl.Series("json_per_row", [
json.dumps({ "time": 100, "var1":1, "var2":4 }),
json.dumps({ "time": 200, "var1":2, "var2":5 }),
json.dumps({ "time":300 , "var1":3, "var2":6 })
])
# Desired output
df = pl.DataFrame([s1, output_col])
print(df)
So is there a way to do this with the functions in the polars library? I'd rather not use json.dumps if it's not needed since as the docs say it can affect performance if you have to bring in external / user defined functions. Thanks
read_csv() to read your csv data, but here I'll just use Series data you provided..struct() to combine all the columns into one struct column.struct.json_encode() to convert to json.(
pl.DataFrame([s1,s2,s3])
.select(
pl.col.time,
json_per_row = pl.struct(pl.all()).struct.json_encode()
)
)
┌──────┬────────────────────────────────┐
│ time ┆ json_per_row │
│ --- ┆ --- │
│ i64 ┆ str │
╞══════╪════════════════════════════════╡
│ 100 ┆ {"time":100,"var1":1,"var2":4} │
│ 200 ┆ {"time":200,"var1":2,"var2":5} │
│ 300 ┆ {"time":300,"var1":3,"var2":6} │
└──────┴────────────────────────────────┘
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With