Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does Polars support UUID?

I have a time series of string-formatted UUIDs, and I would like Polars to translate them into u128 numbers for better storage and querying.

Similar to what we do with dates:

....str.to_datetime("%Y-%m-%dT%H:%M:%S.%fZ", strict=False)

Is this supported, or do I need to handle it on the Python side?

Also, I don't see a u128 type, but there's a Decimal that seems to be an i128. If I were to do my own translation, which type should I use?

P.S. I notice a GitHub ticket in the Polars repository about supporting the Rust crate Uuid, but in a way, this could be implemented without it. So, I am not sure if it is.

like image 906
Jeremy Chone Avatar asked Feb 10 '26 12:02

Jeremy Chone


2 Answers

Polars doesn't support a u128 dtype. If you can accept the loss, you can store them as u64 or otherwise as a Utf8 column.

We haven't support for this yet, but we will also get FixedSizeBinary in the future which could also fit this.

like image 127
ritchie46 Avatar answered Feb 15 '26 20:02

ritchie46


Polars now has pl.Int128 just not pl.UInt128. One workaround would be to, first, take the constant difference between max(uint128) and max(int128) which is 2^128-1 - 2^127-1 = 170141183460469231731687303715884105728. Next, instead of storing the uuid's int, store the uuid's int - CONSTANT. To round trip back to UUID add that constant back.

Alternatively, you could store the uuid's as bytes in pl.Binary.

Here's a comparison:


def uuid_to_signed(uuid: UUID | str | int) -> int:
    if isinstance(uuid, str):
        uuid = UUID(uuid)
    if isinstance(uuid, UUID):
        uuid = uuid.int
    return uuid - CONSTANT


def signed_to_uuid(uuid_int: int) -> UUID:
    corrected = uuid_int + CONSTANT
    return UUID(int=corrected)

def uuid_to_binary(uuid: UUID | str | int) -> bytes:
    if isinstance(uuid, str):
        uuid = UUID(uuid)
    if isinstance(uuid, UUID):
        uuid = uuid.int
    return uuid.to_bytes(16, "little")


def binary_to_uuid(uuid_bytes: bytes) -> UUID:
    return UUID(int=int.from_bytes(uuid_bytes, "little"))


n = 100_000
uuid_list = [uuid4() for _ in range(n)]
df = pl.DataFrame(
    [
        pl.Series("id_signed", [uuid_to_signed(x) for x in uuid_list]),
        pl.Series("id_binary", [uuid_to_binary(x) for x in uuid_list]),
        pl.Series("id_str", [str(x) for x in uuid_list]),
    ]
)


assert uuid_list == [signed_to_uuid(x) for x in df["id_signed"]]
assert uuid_list == [binary_to_uuid(x) for x in df["id_binary"]]
assert uuid_list == [UUID(x) for x in df["id_str"]]

print(pl.DataFrame(df['id_signed']).estimated_size('mb'))
print(pl.DataFrame(df['id_binary']).estimated_size('mb'))
print(pl.DataFrame(df['id_str']).estimated_size('mb'))

This shows that the signed int hack and binary are the same size and that each of those is less than half the size of storing the uuid's as strings. It seems better there's no upside in using the int workaround relative to Binary so use that.

like image 43
Dean MacGregor Avatar answered Feb 15 '26 19:02

Dean MacGregor



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!