Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Computing time delta between two columns in Polars

I am struggling to find a good way to compute time delta in Polars. I have the following string data set and would like to compute the duration in hours between start and end. The tricky part is working around midnight.

df = pl.DataFrame(
        {
        'start_time': [
            '23:00',
            '00:00'
        ],
        'end_time': [
            '24:00',
            '01:00'
        ]
    }
)

(
    df
    .with_columns(
        start_time = pl.col('start_time').str.to_time('%H:%M'),
        end_time = pl.col('end_time').replace('24:00', '00:00').str.to_time('%H:%M')
    )
    .with_columns(
        duration = (
            pl.when(pl.col('end_time') == pl.time(0, 0, 0))
            .then(86400000000000)
            .otherwise(pl.col('end_time').cast(pl.Int64))
            - pl.col('start_time').cast(pl.Int64)) / 3600000000000
    )
    
)

I came up with a solution, but I am very unhappy with it, because it feels unnecessarily complicated. Basically I am substituting end times equal to 24:00 with 00:00, because Polars can't work with the value 24:00. Because of this I have to add a when/then/otherwise to again substitute the 00:00 in the end times with the time in nanoseconds which equal 24:00. Then I am able to cast the times to int, compute the difference in nanoseconds and convert back to hours by dividing through nanoseconds per hour.

like image 323
Philipp Avatar asked May 29 '26 05:05

Philipp


2 Answers

An alternative is to add 24 hours if the time duration is negative. Although it doesn't avoid converting “24:00” to “00:00”, I think it's a bit more readable and more general (e.g. your solution would fail for the extra row I added to the DataFrame).

import polars as pl

df = pl.DataFrame(
    {"start_time": ["23:00", "00:00", "23:30"], 
     "end_time": ["24:00", "01:00", "00:35"]}
)

duration = pl.col("end_time") - pl.col("start_time")

res = (
    df.with_columns(
        start_time=pl.col("start_time").str.to_time("%H:%M"),
        end_time=pl.col("end_time").replace("24:00", "00:00").str.to_time("%H:%M"),
    )
    .with_columns(
        duration=pl.when(duration < 0)
        .then(pl.duration(hours=24) + duration)
        .otherwise(duration)
    )
)

print(res)

Ouput:

shape: (3, 3)
┌────────────┬──────────┬──────────────┐
│ start_time ┆ end_time ┆ duration     │
│ ---        ┆ ---      ┆ ---          │
│ time       ┆ time     ┆ duration[μs] │
╞════════════╪══════════╪══════════════╡
│ 23:00:00   ┆ 00:00:00 ┆ 1h           │
│ 00:00:00   ┆ 01:00:00 ┆ 1h           │
│ 23:30:00   ┆ 00:35:00 ┆ 1h 5m        │
└────────────┴──────────┴──────────────┘
like image 157
Rodalm Avatar answered May 31 '26 19:05

Rodalm


Let me try an answer.
I think the issue is in your time strings which have a range from 00:00 to 24:00. When you convert those to a time value using the string method to_time you will get a value of type Time, but a Time type only allows values within the range from 00:00:00 to 23:59:59.999999999. As you can see 24:00 is not within the range. Look at the output of the following code.

s_str = pl.Series(["00:00", "24:00"])
s_t = s_str.str.to_time("%H:%M", strict=False)
print(s_t)
print(s_t.dtype.min())
print(s_t.dtype.max())

The output of the above code is:

shape: (2,)
Series: '' [time]
[
    00:00:00
    null
]
00:00:00
23:59:59.999999999

If you use strict=False in the method to_time, then no error message will be thrown and you will get a null value for the string 24:00.

What can you do? Perhaps the following.

df.with_columns(
    pl.col("end_time").replace("24:00", "23:59")
).with_columns(
    pl.col("start_time").str.to_time("%H:%M"),
    pl.col("end_time").str.to_time("%H:%M")
).with_columns(
    d1=(pl.col("end_time") - pl.col("start_time"))
).with_columns(
    duration=pl.duration(minutes=(pl.col("d1").dt.total_minutes() / 10).round(0) * 10)
)
  1. Replace the string 24:00 with the highest allowed time value string 23:59.
  2. Convert the strings to time values.
  3. Calculate the duration.
  4. Round the duration to the next hour.

Hope this helps.

like image 41
sk8 118 Avatar answered May 31 '26 19:05

sk8 118