Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Padding in a Pyspark Dataframe

I have a Pyspark dataframe(Original Dataframe) having below data(all columns have string datatype):

  id           Value
   1             103
   2             1504
   3              1  

I need to create a new modified dataframe with padding in value column, so that length of this column should be 4 characters. If length is less than 4 characters, then add 0's in data as shown below:

  id             Value
   1             0103
   2             1504
   3             0001  

Can someone help me out? How can i achieve it using Pyspark dataframe? Any help will be appreciated.

like image 708
rupak das Avatar asked Jul 30 '17 14:07

rupak das


People also ask

How do you add a space in a column in Pyspark?

To Add Leading Space and Trailing space of the column in pyspark we use concat() function. concat() Function takes column name and “ ” (space) on either side.

How do you add leading zeros in Pyspark?

Add preceding zeros to the column in pyspark using lpad() function – Method 3. lpad() function takes up “grad_score” as argument followed by 3 i.e. total string length followed by “0” which will be padded to left of the “grad_score” . Which adds leading zeros to the “grad_score” column till the string length becomes 3.

How do you remove leading zeros in Pyspark?

Remove leading zero of column in pyspark We use regexp_replace() function with column name and regular expression as argument and thereby we remove consecutive leading zeros. The regular expression replaces all the leading zeros with ' '. then stores the result in grad_score_new.

What is withColumn Pyspark?

PySpark withColumn() is a transformation function of DataFrame which is used to change the value, convert the datatype of an existing column, create a new column, and many more.


2 Answers

You can use lpad from functions module,

from pyspark.sql.functions import lpad
>>> df.select('id',lpad(df['value'],4,'0').alias('value')).show()
+---+-----+
| id|value|
+---+-----+
|  1| 0103|
|  2| 1504|
|  3| 0001|
+---+-----+
like image 183
Suresh Avatar answered Oct 18 '22 05:10

Suresh


Using PySpark lpad function in conjunction with withColumn:

import pyspark.sql.functions as F
dfNew = dfOrigin.withColumn('Value', F.lpad(dfOrigin['Value'], 4, '0')) 
like image 22
ucsky Avatar answered Oct 18 '22 03:10

ucsky