Does Polars support UUID?

Question

I have a time series of string-formatted UUIDs, and I would like Polars to translate them into u128 numbers for better storage and querying.

Similar to what we do with dates:

....str.to_datetime("%Y-%m-%dT%H:%M:%S.%fZ", strict=False)

Is this supported, or do I need to handle it on the Python side?

Also, I don't see a u128 type, but there's a Decimal that seems to be an i128. If I were to do my own translation, which type should I use?

P.S. I notice a GitHub ticket in the Polars repository about supporting the Rust crate Uuid, but in a way, this could be implemented without it. So, I am not sure if it is.

ritchie46 · Accepted Answer

Polars doesn't support a u128 dtype. If you can accept the loss, you can store them as u64 or otherwise as a Utf8 column.

We haven't support for this yet, but we will also get FixedSizeBinary in the future which could also fit this.

Dean MacGregor · Answer

Polars now has pl.Int128 just not pl.UInt128. One workaround would be to, first, take the constant difference between max(uint128) and max(int128) which is 2^128-1 - 2^127-1 = 170141183460469231731687303715884105728. Next, instead of storing the uuid's int, store the uuid's int - CONSTANT. To round trip back to UUID add that constant back.

Alternatively, you could store the uuid's as bytes in pl.Binary.

Here's a comparison:


def uuid_to_signed(uuid: UUID | str | int) -> int:
    if isinstance(uuid, str):
        uuid = UUID(uuid)
    if isinstance(uuid, UUID):
        uuid = uuid.int
    return uuid - CONSTANT


def signed_to_uuid(uuid_int: int) -> UUID:
    corrected = uuid_int + CONSTANT
    return UUID(int=corrected)

def uuid_to_binary(uuid: UUID | str | int) -> bytes:
    if isinstance(uuid, str):
        uuid = UUID(uuid)
    if isinstance(uuid, UUID):
        uuid = uuid.int
    return uuid.to_bytes(16, "little")


def binary_to_uuid(uuid_bytes: bytes) -> UUID:
    return UUID(int=int.from_bytes(uuid_bytes, "little"))


n = 100_000
uuid_list = [uuid4() for _ in range(n)]
df = pl.DataFrame(
    [
        pl.Series("id_signed", [uuid_to_signed(x) for x in uuid_list]),
        pl.Series("id_binary", [uuid_to_binary(x) for x in uuid_list]),
        pl.Series("id_str", [str(x) for x in uuid_list]),
    ]
)


assert uuid_list == [signed_to_uuid(x) for x in df["id_signed"]]
assert uuid_list == [binary_to_uuid(x) for x in df["id_binary"]]
assert uuid_list == [UUID(x) for x in df["id_str"]]

print(pl.DataFrame(df['id_signed']).estimated_size('mb'))
print(pl.DataFrame(df['id_binary']).estimated_size('mb'))
print(pl.DataFrame(df['id_str']).estimated_size('mb'))

This shows that the signed int hack and binary are the same size and that each of those is less than half the size of storing the uuid's as strings. It seems better there's no upside in using the int workaround relative to Binary so use that.

Does Polars support UUID?

Tags:

python-polars

rust-polars

Jeremy Chone

2 Answers

ritchie46

Dean MacGregor

Recent Activity

Donate For Us

Does Polars support UUID?

Tags:

python-polars

rust-polars

Jeremy Chone

2 Answers

ritchie46

Dean MacGregor

Related questions

Recent Activity

Donate For Us