I have a Pyspark dataframe(Original Dataframe) having below data(all columns have string datatype):
id Value
1 103
2 1504
3 1
I need to create a new modified dataframe with padding in value column, so that length of this column should be 4 characters. If length is less than 4 characters, then add 0's in data as shown below:
id Value
1 0103
2 1504
3 0001
Can someone help me out? How can i achieve it using Pyspark dataframe? Any help will be appreciated.
To Add Leading Space and Trailing space of the column in pyspark we use concat() function. concat() Function takes column name and “ ” (space) on either side.
Add preceding zeros to the column in pyspark using lpad() function – Method 3. lpad() function takes up “grad_score” as argument followed by 3 i.e. total string length followed by “0” which will be padded to left of the “grad_score” . Which adds leading zeros to the “grad_score” column till the string length becomes 3.
Remove leading zero of column in pyspark We use regexp_replace() function with column name and regular expression as argument and thereby we remove consecutive leading zeros. The regular expression replaces all the leading zeros with ' '. then stores the result in grad_score_new.
PySpark withColumn() is a transformation function of DataFrame which is used to change the value, convert the datatype of an existing column, create a new column, and many more.
You can use lpad from functions module,
from pyspark.sql.functions import lpad
>>> df.select('id',lpad(df['value'],4,'0').alias('value')).show()
+---+-----+
| id|value|
+---+-----+
| 1| 0103|
| 2| 1504|
| 3| 0001|
+---+-----+
Using PySpark lpad
function in conjunction with withColumn
:
import pyspark.sql.functions as F
dfNew = dfOrigin.withColumn('Value', F.lpad(dfOrigin['Value'], 4, '0'))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With