Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

arrow file size is the same as csv?

I am trying to save a dataframe into .arrow format, mainly to get better size than CSV, to use that file to vega-lite

I am using python

import pandas
import pyarrow as pa
csv="C:/Users/mimoune.djouallah/data.csv"
arrow ="C:/Users/mimoune.djouallah/file.arrow"
dataset = pandas.read_csv(csv)

table = pa.Table.from_pandas(dataset)
writer = pa.RecordBatchFileWriter(arrow, table.schema)
writer.write(table)
writer.close()

I was expecting the arrow file to be less than the csv, for now arrow is slightly bigger

I tried to export using parquet and the result are as expected

original csv : 4.4 MB arrow : 4.9 MB parquet : 1.6 MB PowerBI ( just for reference) : 1.7 MB

like image 561
Mim Avatar asked Dec 13 '25 01:12

Mim


1 Answers

The Arrow format is not aiming optimising storage size but storage performance. In contrast to CSV, the data is stored in binary form to remove the overhead of parsing the data. But as performance is critical, data is neither compressed nor encoded.

If you want to store data efficiently but with a smaller data size, you should have a look at Apache Parquet. The data is stored in a similar fashion as Arrow but with some efficient techniques on top to decrease storage size.

like image 168
Uwe L. Korn Avatar answered Dec 14 '25 15:12

Uwe L. Korn



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!