I'm trying to create a pyarrow.StructArray with missing values.
I works fine when I use pyarrow.array passing tuples representing my records:
>>> pyarrow.array(
[
None,
(1, "foo"),
],
type=pyarrow.struct(
[pyarrow.field('col1', pyarrow.int64()), pyarrow.field("col2", pyarrow.string())]
)
)
-- is_valid:
[
false,
true
]
-- child 0 type: int64
[
0,
1
]
-- child 1 type: string
[
"",
"foo"
]
But I want to use the StructArray.from_arrays and as far as I can tell there's no way to provide a mask for missing values:
pyarrow.StructArray.from_arrays(
[
[None, 1],
[None, "foo"]
],
fields=[pyarrow.field('col1', pyarrow.int64()), pyarrow.field("col2", pyarrow.string())]
)
-- is_valid: all not null
-- child 0 type: int64
[
null,
1
]
-- child 1 type: string
[
null,
"foo"
]
Is there a way to create a StructArray, from array, specifiying a mask of missing values? Or would there be a way to apply the mask later?
It would indeed be nice to make this possible by passing a mask in StructArray.from_arrays (-> https://issues.apache.org/jira/browse/ARROW-12677, thanks for opening the issue).
But for now, a possible workaround might be to user the lower-level StructArray.from_buffers:
struct_type = pyarrow.struct(
[pyarrow.field('col1', pyarrow.int64()), pyarrow.field("col2", pyarrow.string())]
)
col1 = pyarrow.array([None, 1])
col2 = pyarrow.array([None, "foo"])
Creating a pyarrow mask array to construct a validity buffer:
mask = np.array([True, False])
validity_mask = pyarrow.array(~mask)
validity_bitmask = validity_mask.buffers()[1]
And then we can use this as the first buffer in from_buffers to indicate the missing values in the StructArray:
>>> pyarrow.StructArray.from_buffers(struct_type, len(col1), [validity_bitmask], children=[col1, col2])
<pyarrow.lib.StructArray object at 0x7f8b560fa2e0>
-- is_valid:
[
false,
true
]
-- child 0 type: int64
[
null,
1
]
-- child 1 type: string
[
null,
"foo"
]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With