I want to store processed data in pandas dataframe to azure blobs in parquet file format. But before uploading to blobs, I have to store it as parquet file in local disk and then upload. I want to write pyarrow.table into pyarrow.parquet.NativeFile and upload it directly. Can anyone help me with this. Below code is working fine:
import pyarrow as pa
import pyarrow.parquet as pq
battery_pq = pd.read_csv('test.csv')
########  SOme Data Processing
battery_pq = pa.Table.from_pandas(battery_pq)
pq.write_table(battery_pq,'example.parquet')
block_blob_service.create_blob_from_path(container_name,'example.parquet','example.parquet')
Need to create the file in memory(I/O file type object) and then upload it to blob.
You can either use io.BytesIO for this or alternatively Apache Arrow also provides its native implementation BufferOutputStream. The benefit of this is that this writes to the stream without the overhead of going through Python. Thus less copies are made and the GIL is released.
import pyarrow as pa
import pyarrow.parquet as pq
df = some pandas.DataFrame
table = pa.Table.from_pandas(df)
buf = pa.BufferOutputStream()
pq.write_table(table, buf)
block_blob_service.create_blob_from_bytes(
    container,
    "example.parquet",
    buf.getvalue().to_pybytes()
)
                        There's a new python SDK version. create_blob_from_bytes is now legacy
import pandas as pd
from azure.storage.blob import BlobServiceClient
from io import BytesIO
blob_service_client = BlobServiceClient.from_connection_string(blob_store_conn_str)
blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_path)
parquet_file = BytesIO()
df.to_parquet(parquet_file, engine='pyarrow')
parquet_file.seek(0)  # change the stream position back to the beginning after writing
blob_client.upload_blob(
    data=parquet_file
)
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With