Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to add leading zeroes to a pyspark dataframe column

Tags:

pyspark

I am trying to add leading zeroes to a column in my pyspark dataframe

input :-

ID 123

Output expected:

000000000123

like image 927
Riddhi Krishna Avatar asked Sep 16 '19 15:09

Riddhi Krishna


2 Answers

There is lpad function. Left-pad the string column to width len with pad.

from pyspark.sql.functions import lpad
df.select(lpad(df.ID, 12, '0').alias('s')).collect()
like image 146
Serge Harnyk Avatar answered Jan 01 '23 18:01

Serge Harnyk


Use format_string function to pad zeros in the beginning.

from pyspark.sql.functions import col, format_string
df = spark.createDataFrame([('123',),('1234',)],['number',])
df.show()
+------+
|number|
+------+
|   123|
|  1234|
+------+

If the number is string, make sure to cast it into integer.

df = df.withColumn('number_padded', format_string("%012d", col('number').cast('int')))
df.show()
+------+-------------+
|number|number_padded|
+------+-------------+
|   123| 000000000123|
|  1234| 000000001234|
+------+-------------+
like image 44
cph_sto Avatar answered Jan 01 '23 19:01

cph_sto