Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does apache arrow facilitate "No overhead for cross-system communication"?

Tags:

I've been very interested in Apache Arrow for a bit now due to the promises of "zero copy reads", "zero serde", and "No overhead for cross-system communication". My understanding of the project (through the lens of pyarrow) is that it describes the memory and format of data, such that multiple tasks can read this like a treasure map and all find their way to the same data (without copying). I think I can see how this works within Python/Pandas in a single process; it's pretty easy to create an Arrow array, pass it around to different objects, and observe the whole "zero-copy" thing in action.

However, when we talk about cross-system communication without overhead, I am almost entirely lost. For example, how does PySpark convert Java objects to an arrow format, then pass that to Python/Pandas? I've attempted to look at the code here but to a non-java/scala guy it just looks like it's converting spark rows to Arrow objects, then to byteArrays (line 124), which doesn't seem like zero copy, zero overhead to me.

Likewise, if I wanted to try to pass an Arrow array from Python/pyarrow to, say, Rust (using Rust's Arrow API), I can't wrap my mind around how I'd do that, particularly considering that this approach to calling a Rust function from Python doesn't seem to work with Arrow primitives. Is there a way to point both Rust and Python to same memory address(es)? Do I have to send arrow data as a byteArray somehow?

// lib.rs
#[macro_use]
extern crate cpython;

use cpython::{PyResult, Python};
use arrow::array::Int64Array;
use arrow::compute::array_ops::sum;

fn sum_col(_py: Python, val: Int64Array) -> PyResult<i64> {
    let total = sum(val).unwrap();
    Ok(total)
}

py_module_initializer!(rust_arrow_2, initrust_arrow_2, Pyinit_rust_arrow_2, |py, m| {
    m.add(py, "__doc__", "This module is implemented in Rust.")?;
    m.add(py, "sum_col", py_fn!(py, sum_col(val: Int64Array)))?;
    Ok(())
});
$ cargo build --release
...
error[E0277]: the trait bound `arrow::array::array::PrimitiveArray<arrow::datatypes::Int64Type>: cpython::FromPyObject<'_>` is not satisfied
  --> src/lib.rs:15:26
   |
15 |     m.add(py, "sum_col", py_fn!(py, sum_col(val: Int64Array)))?;
   |                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `cpython::FromPyObject<'_>` is not implemented for `arrow::array::array::PrimitiveArray<arrow::datatypes::Int64Type>`
   |
   = note: required by `cpython::FromPyObject::extract`
   = note: this error originates in a macro outside of the current crate (in Nightly builds, run with -Z external-macro-backtrace for more info)
like image 624
kemri Avatar asked Sep 17 '19 00:09

kemri


1 Answers

There are a few questions here:

  1. How does spark share data with python?

    This is done over a socket using the Arrow IPC format, so it isn't quite zero-copy but still much faster then alternatives

  2. How is zero copy achieved in general?

    The approaches that I'm aware of are passing pointer addresses between implementations. For instance, the Gandiva module in Arrow does this via [JNI] (https://github.com/apache/arrow/blob/master/java/gandiva/src/main/java/org/apache/arrow/gandiva/evaluator/JniWrapper.java#L65) by passing data-buffer addresses and reassembling them into a RowBatch.

    A second approach for specifically python/Java interop is Jpype although the implementation isn't 100% complete.

    You could potentially do something similar in pyarrow by creating buffers from pointers and assembling them into arrays

  3. How can it be done in Rust?

    I don't have expertise in Rust, but you can e-mail the Arrow users@ or dev@ mailing list to see what others have done or if there is an opportunity to contribute something to standardize this.

like image 178
Micah Kornfield Avatar answered Oct 08 '22 17:10

Micah Kornfield