I have a time series of string-formatted UUIDs, and I would like Polars to translate them into u128 numbers for better storage and querying.
Similar to what we do with dates:
....str.to_datetime("%Y-%m-%dT%H:%M:%S.%fZ", strict=False)
Is this supported, or do I need to handle it on the Python side?
Also, I don't see a u128 type, but there's a Decimal that seems to be an i128. If I were to do my own translation, which type should I use?
P.S. I notice a GitHub ticket in the Polars repository about supporting the Rust crate Uuid, but in a way, this could be implemented without it. So, I am not sure if it is.
Polars doesn't support a u128 dtype. If you can accept the loss, you can store them as u64 or otherwise as a Utf8 column.
We haven't support for this yet, but we will also get FixedSizeBinary in the future which could also fit this.
Polars now has pl.Int128 just not pl.UInt128. One workaround would be to, first, take the constant difference between max(uint128) and max(int128) which is 2^128-1 - 2^127-1 = 170141183460469231731687303715884105728. Next, instead of storing the uuid's int, store the uuid's int - CONSTANT. To round trip back to UUID add that constant back.
Alternatively, you could store the uuid's as bytes in pl.Binary.
Here's a comparison:
def uuid_to_signed(uuid: UUID | str | int) -> int:
if isinstance(uuid, str):
uuid = UUID(uuid)
if isinstance(uuid, UUID):
uuid = uuid.int
return uuid - CONSTANT
def signed_to_uuid(uuid_int: int) -> UUID:
corrected = uuid_int + CONSTANT
return UUID(int=corrected)
def uuid_to_binary(uuid: UUID | str | int) -> bytes:
if isinstance(uuid, str):
uuid = UUID(uuid)
if isinstance(uuid, UUID):
uuid = uuid.int
return uuid.to_bytes(16, "little")
def binary_to_uuid(uuid_bytes: bytes) -> UUID:
return UUID(int=int.from_bytes(uuid_bytes, "little"))
n = 100_000
uuid_list = [uuid4() for _ in range(n)]
df = pl.DataFrame(
[
pl.Series("id_signed", [uuid_to_signed(x) for x in uuid_list]),
pl.Series("id_binary", [uuid_to_binary(x) for x in uuid_list]),
pl.Series("id_str", [str(x) for x in uuid_list]),
]
)
assert uuid_list == [signed_to_uuid(x) for x in df["id_signed"]]
assert uuid_list == [binary_to_uuid(x) for x in df["id_binary"]]
assert uuid_list == [UUID(x) for x in df["id_str"]]
print(pl.DataFrame(df['id_signed']).estimated_size('mb'))
print(pl.DataFrame(df['id_binary']).estimated_size('mb'))
print(pl.DataFrame(df['id_str']).estimated_size('mb'))
This shows that the signed int hack and binary are the same size and that each of those is less than half the size of storing the uuid's as strings. It seems better there's no upside in using the int workaround relative to Binary so use that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With