So I loaded two datasets from a csv and then merged them using a leftjoin:
using CSV
using DataFrames
using CodecZstd
df1 = CSV.read(joinpath(root, "data", "raw", "df1.csv"), DataFrame)
df2 = CSV.read(joinpath(root, "data", "raw", "df2.csv"), DataFrame)
merged = leftjoin(df1, df2, on=:id)
Now I want to write the merged dataframe to disk as a .zst compressed file (Zstandard compression).
I was successful in first writing to .csv then reading then writing again as .zst but is there a way to directly convert a DataFrame into an array of bytes to be able to save to disk?
To follow precisely your questions you can do:
using CSV, DataFrames, CodecZstd
fout = ZstdCompressorStream(open("z.zst","w"))
df = DataFrame(a='a':'h', b=1:8)
CSV.write(df ,fout)
close(fout)
Now this can be read as:
julia> CSV.read(ZstdDecompressorStream(open("z.zst")), DataFrame)
8×2 DataFrame
Row │ a b
│ String1 Int64
─────┼────────────────
1 │ a 1
2 │ b 2
3 │ c 3
4 │ d 4
5 │ e 5
6 │ f 6
7 │ g 7
8 │ h 8
Other reasonable option would be to use Apache Arrow to write the DataFrame instead of CSV. The compression would compose in the same ways as above.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With