Transform column with seconds to human readable duration

Question

I have the following dataframe in pyspark:

Name                 | Seconds

|Enviar solicitud ...| 1415

|Analizar mapa de ...| 1209|

|Modificar solicit...|  591|

|Entregar servicio...|91049|

I wish to convert the seconds column either to date or timestamp (hopefully todate), I am trying to use the following function

def to_date(seconds=0):
    dat = ''
    if seconds == 0:
        dat = '0'
    if (seconds / 86400) >= 1:
        day = (int(seconds / 86400))
        seconds = (seconds - 86400 * int(seconds / 86400))
        dat = f'{day}d '
    if (seconds / 3600) >= 1:
        hour = (int(seconds / 3600))
        seconds = (seconds - 3600 * int(seconds / 3600))
        dat = dat + f'{hour}hr '
    if (seconds / 60) >= 1:
        minutes = (int(seconds / 60))
        dat = dat + f'{minutes}min'   
    else:
        return '0min'
    return dat

But there is no easy way such as Pandas .apply(to_date) in pyspark, is there anyway to achieve what I am trying to do?

EXPECTED OUTPUT:

Analizar mapa de comparacion de presupuestos         1209         20min
Crear mapa de comparacion de presupuestos           12155     3hr 22min
Entregar servicios de bienes                        91049  1d 1hr 17min

blackbishop · Accepted Answer

There is no built-in function for that in Spark but this can be done without UDF. You can simply calculate it using division and modulo operations to get the different parts (days, hours, ...), and concatenate to get the desired formatting.

For Spark 2.4+, you can use higher-order function zip_with and array_join. First create parts column that contains number of days, hours, minutes and seconds from the Seconds column. Then zip it with the literal array of units array('d', 'hr', 'min', 'sec') to concatenate each part with its unit and finally join all the elements with a comma delimiter.

duration_parts = [(86400, 7), (3600, 24), (60, 60), (1, 60)]
exp = "zip_with(parts, array('d', 'hr', 'min', 'sec'), (x, y) -> IF(x > 0, concat(x, y), null))"

df.withColumn("parts", array(*[(floor(col("Seconds") / d)) % m for d, m in duration_parts]))\
  .withColumn("duration", array_join(expr(exp), ", "))\
  .drop("parts")\
  .show(truncate=False)

#+--------------------------------------------+-------+---------------------+
#|Name                                        |Seconds|duration             |
#+--------------------------------------------+-------+---------------------+
#|Analizar mapa de comparacion de presupuestos|1209   |20min, 9sec          |
#|Crear mapa de comparacion de presupuestos   |12155  |3hr, 22min, 35sec    |
#|Entregar servicios de bienes                |91049  |1d, 1hr, 17min, 29sec|
#+--------------------------------------------+-------+---------------------+

Another way is using concat and add when expression if you don't want parts that are equal to 0:

df.withColumn("duration", concat(
            floor(col("Seconds") / 86400), lit("d, "),
            floor(col("Seconds") % 86400 / 3600), lit("hr, "),
            floor((col("Seconds") % 86400) % 3600 / 60), lit("min, "),
            floor(((col("Seconds") % 86400) % 3600) % 60), lit("sec "),
        )).show(truncate=False)

murtihash · Answer

I think this can be achieved without a UDF and it will much faster and scalable for big data. Try this and let me know if there is a hole in my logic.

from pyspark.sql import functions as F
from pyspark.sql.functions import when
df.withColumn("Minutes", F.round((F.col("Seconds")/60),2))\
.withColumn("Hours", F.floor((F.col("Minutes")/60)))\
.withColumn("hourmin", F.floor(F.col("Minutes")-(F.col("Hours").cast("int") * 60)))\
.withColumn("Days", F.floor((F.col("Hours")/24)))\
.withColumn("Days2", F.col("Days")*24)\
.withColumn("Time", F.when((F.col("Hours")==0) &(F.col("Days")==0), F.concat(F.col("hourmin"),F.lit("min"))).when((F.col("Hours")!=0)&(F.col("Days")==0), F.concat(F.col("Hours"),F.lit("hr "),F.col("hourmin"),F.lit("min"))).when(F.col("Days")!=0, F.concat(F.col("Days"),F.lit("d "),(F.col("Hours")-F.col("Days2")),F.lit("hr "),F.col("hourmin"),F.lit("min"))))\
.drop("Minutes","Hours","hourmin","Days","Days2")\
.show()


+-----------------+-------+---------------+
|             Name|Seconds|           Time|
+-----------------+-------+---------------+
| Enviar solicitud|   1209|          20min|
| Analizar mapa de|  12155|      3hr 22min|
|Entregar servicio|  91049|   1d 1hr 17min|
|         example1|   1900|          31min|
|         example2|   2500|          41min|
|         example3|9282398|107d 10hr 26min|
+-----------------+-------+---------------+

Transform column with seconds to human readable duration

Tags:

python

apache-spark

apache-spark-sql

pyspark

Jose Gutierrez

2 Answers

blackbishop

murtihash

Recent Activity

Donate For Us

Transform column with seconds to human readable duration

Tags:

python

apache-spark

apache-spark-sql

pyspark

Jose Gutierrez

2 Answers

blackbishop

murtihash

Related questions

Recent Activity

Donate For Us