Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I [de]serialize a dictionary of dataframes in the arrow/js implementation?

I want to use Apache Arrow to send data from a Django backend to a Angular frontend. I want to use a dictionary of dataframes/tables as payload in messages. It's posssible with pyarrow to share data in this way between python microservices, but i cant find a way with the javascript implementation of arrow.

Is there a way to deserialize/serialize a dictionary with strings as keys and dataframes/tables as values in the javascript side with Arrow?

like image 973
gabomgp Avatar asked Oct 17 '22 15:10

gabomgp


1 Answers

Yes, a variant of this is possible using the RecordBatchReader and RecordBatchWriter IPC primitives in both pyarrow and ArrowJS.

On the python side, you can serialize a Table to a buffer like this:

import pyarrow as pa

def serialize_table(table):
    sink = pa.BufferOutputStream()
    writer = pa.RecordBatchStreamWriter(sink, table.schema)
    writer.write_table(table)
    writer.close()
    return sink.getvalue().to_pybytes()

# ...later, in your route handler:
bytes = serialize_table(create_your_arrow_table())

Then you can send the bytes in the response body. If you have multiple tables, you can concatenate the buffers from each as one large payload.

I'm not sure what functionality exists to write multipart/form-body responses in python, but that's probably the best way to craft the response if you want the tables to be sent with their names (or any other metadata you wish to include).

On the JavaScript side, you can read the the response either with Table.from() (if you have just one table), or the RecordBatchReader if you have more than one, or if you want to read each RecordBatch in a streaming fashion:

import { Table, RecordBatchReader } from 'apache-arrow'

// easy if you want to read the first (or only) table in the response
const table = await Table.from(fetch('/table'))

// or for mutliple tables on the same stream, or to read in a streaming fashion:
for await (const reader of RecordBatchReader.readAll(fetch('/table'))) {
    // Buffer all batches into a table
    const table = await Table.from(reader)
    // Or process each batch as it's downloaded
    for await (const batch of reader) {
    }
}

You can see more examples of this in our tests for ArrowJS here: https://github.com/apache/arrow/blob/3eb07b7ed173e2ecf41d689b0780dd103df63a00/js/test/unit/ipc/writer/stream-writer-tests.ts#L40

You can also see some examples in a little fastify plugin I wrote for consuming and producing Arrow payloads in node: https://github.com/trxcllnt/fastify-arrow

like image 53
ptaylor Avatar answered Oct 19 '22 23:10

ptaylor