Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Transform column with seconds to human readable duration

I have the following dataframe in pyspark:

Name                 | Seconds

|Enviar solicitud ...| 1415

|Analizar mapa de ...| 1209|

|Modificar solicit...|  591|

|Entregar servicio...|91049|

I wish to convert the seconds column either to date or timestamp (hopefully todate), I am trying to use the following function

def to_date(seconds=0):
    dat = ''
    if seconds == 0:
        dat = '0'
    if (seconds / 86400) >= 1:
        day = (int(seconds / 86400))
        seconds = (seconds - 86400 * int(seconds / 86400))
        dat = f'{day}d '
    if (seconds / 3600) >= 1:
        hour = (int(seconds / 3600))
        seconds = (seconds - 3600 * int(seconds / 3600))
        dat = dat + f'{hour}hr '
    if (seconds / 60) >= 1:
        minutes = (int(seconds / 60))
        dat = dat + f'{minutes}min'   
    else:
        return '0min'
    return dat

But there is no easy way such as Pandas .apply(to_date) in pyspark, is there anyway to achieve what I am trying to do?

EXPECTED OUTPUT:

Analizar mapa de comparacion de presupuestos         1209         20min
Crear mapa de comparacion de presupuestos           12155     3hr 22min
Entregar servicios de bienes                        91049  1d 1hr 17min
like image 701
Jose Gutierrez Avatar asked Sep 16 '25 01:09

Jose Gutierrez


2 Answers

There is no built-in function for that in Spark but this can be done without UDF. You can simply calculate it using division and modulo operations to get the different parts (days, hours, ...), and concatenate to get the desired formatting.

For Spark 2.4+, you can use higher-order function zip_with and array_join. First create parts column that contains number of days, hours, minutes and seconds from the Seconds column. Then zip it with the literal array of units array('d', 'hr', 'min', 'sec') to concatenate each part with its unit and finally join all the elements with a comma delimiter.

duration_parts = [(86400, 7), (3600, 24), (60, 60), (1, 60)]
exp = "zip_with(parts, array('d', 'hr', 'min', 'sec'), (x, y) -> IF(x > 0, concat(x, y), null))"

df.withColumn("parts", array(*[(floor(col("Seconds") / d)) % m for d, m in duration_parts]))\
  .withColumn("duration", array_join(expr(exp), ", "))\
  .drop("parts")\
  .show(truncate=False)

#+--------------------------------------------+-------+---------------------+
#|Name                                        |Seconds|duration             |
#+--------------------------------------------+-------+---------------------+
#|Analizar mapa de comparacion de presupuestos|1209   |20min, 9sec          |
#|Crear mapa de comparacion de presupuestos   |12155  |3hr, 22min, 35sec    |
#|Entregar servicios de bienes                |91049  |1d, 1hr, 17min, 29sec|
#+--------------------------------------------+-------+---------------------+

Another way is using concat and add when expression if you don't want parts that are equal to 0:

df.withColumn("duration", concat(
            floor(col("Seconds") / 86400), lit("d, "),
            floor(col("Seconds") % 86400 / 3600), lit("hr, "),
            floor((col("Seconds") % 86400) % 3600 / 60), lit("min, "),
            floor(((col("Seconds") % 86400) % 3600) % 60), lit("sec "),
        )).show(truncate=False)
like image 153
blackbishop Avatar answered Sep 19 '25 12:09

blackbishop


I think this can be achieved without a UDF and it will much faster and scalable for big data. Try this and let me know if there is a hole in my logic.

from pyspark.sql import functions as F
from pyspark.sql.functions import when
df.withColumn("Minutes", F.round((F.col("Seconds")/60),2))\
.withColumn("Hours", F.floor((F.col("Minutes")/60)))\
.withColumn("hourmin", F.floor(F.col("Minutes")-(F.col("Hours").cast("int") * 60)))\
.withColumn("Days", F.floor((F.col("Hours")/24)))\
.withColumn("Days2", F.col("Days")*24)\
.withColumn("Time", F.when((F.col("Hours")==0) &(F.col("Days")==0), F.concat(F.col("hourmin"),F.lit("min"))).when((F.col("Hours")!=0)&(F.col("Days")==0), F.concat(F.col("Hours"),F.lit("hr "),F.col("hourmin"),F.lit("min"))).when(F.col("Days")!=0, F.concat(F.col("Days"),F.lit("d "),(F.col("Hours")-F.col("Days2")),F.lit("hr "),F.col("hourmin"),F.lit("min"))))\
.drop("Minutes","Hours","hourmin","Days","Days2")\
.show()


+-----------------+-------+---------------+
|             Name|Seconds|           Time|
+-----------------+-------+---------------+
| Enviar solicitud|   1209|          20min|
| Analizar mapa de|  12155|      3hr 22min|
|Entregar servicio|  91049|   1d 1hr 17min|
|         example1|   1900|          31min|
|         example2|   2500|          41min|
|         example3|9282398|107d 10hr 26min|
+-----------------+-------+---------------+
like image 30
murtihash Avatar answered Sep 19 '25 12:09

murtihash