Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

call StructArray.from_arrays specifying a missing value mask

I'm trying to create a pyarrow.StructArray with missing values.

I works fine when I use pyarrow.array passing tuples representing my records:

>>> pyarrow.array(
    [
        None,
        (1, "foo"),
    ],
    type=pyarrow.struct(
        [pyarrow.field('col1', pyarrow.int64()), pyarrow.field("col2", pyarrow.string())]
    )
)
-- is_valid:
  [
    false,
    true
  ]
-- child 0 type: int64
  [
    0,
    1
  ]
-- child 1 type: string
  [
    "",
    "foo"
  ]

But I want to use the StructArray.from_arrays and as far as I can tell there's no way to provide a mask for missing values:

pyarrow.StructArray.from_arrays(
    [
        [None, 1],
        [None, "foo"]
    ],
    fields=[pyarrow.field('col1', pyarrow.int64()), pyarrow.field("col2", pyarrow.string())]
)
-- is_valid: all not null
-- child 0 type: int64
  [
    null,
    1
  ]
-- child 1 type: string
  [
    null,
    "foo"
  ]

Is there a way to create a StructArray, from array, specifiying a mask of missing values? Or would there be a way to apply the mask later?

like image 392
0x26res Avatar asked Mar 09 '26 16:03

0x26res


1 Answers

It would indeed be nice to make this possible by passing a mask in StructArray.from_arrays (-> https://issues.apache.org/jira/browse/ARROW-12677, thanks for opening the issue).

But for now, a possible workaround might be to user the lower-level StructArray.from_buffers:

struct_type = pyarrow.struct(
    [pyarrow.field('col1', pyarrow.int64()), pyarrow.field("col2", pyarrow.string())]
)
col1 = pyarrow.array([None, 1])
col2 = pyarrow.array([None, "foo"])

Creating a pyarrow mask array to construct a validity buffer:

mask = np.array([True, False])
validity_mask = pyarrow.array(~mask)
validity_bitmask = validity_mask.buffers()[1]

And then we can use this as the first buffer in from_buffers to indicate the missing values in the StructArray:

>>> pyarrow.StructArray.from_buffers(struct_type, len(col1), [validity_bitmask], children=[col1, col2])
<pyarrow.lib.StructArray object at 0x7f8b560fa2e0>
-- is_valid:
  [
    false,
    true
  ]
-- child 0 type: int64
  [
    null,
    1
  ]
-- child 1 type: string
  [
    null,
    "foo"
  ]
like image 160
joris Avatar answered Mar 11 '26 05:03

joris



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!