I have a Pyspark dataframe(Original Dataframe) having below data(all columns have string datatype): <pre class="prettyprint"><code> id Value 1 103 2 1504 3 1 </code></pre> I need to create a new modified dataframe with padding in value column, so that length of this column should be 4 characters. If length is less than 4 characters, then add 0's in data as shown below: <pre class="prettyprint"><code> id Value 1 0103 2 1504 3 0001 </code></pre> Can someone help me out? How can i achieve it using Pyspark dataframe? Any help will be appreciated.

You can use lpad from functions module, <pre class="prettyprint"><code>from pyspark.sql.functions import lpad >>> df.select('id',lpad(df['value'],4,'0').alias('value')).show() +---+-----+ | id|value| +---+-----+ | 1| 0103| | 2| 1504| | 3| 0001| +---+-----+ </code></pre>

Using PySpark <code>lpad</code> function in conjunction with <code>withColumn</code>: <pre class="prettyprint"><code>import pyspark.sql.functions as F dfNew = dfOrigin.withColumn('Value', F.lpad(dfOrigin['Value'], 4, '0')) </code></pre>

Padding in a Pyspark Dataframe

Tags:

pyspark

spark-dataframe

I have a Pyspark dataframe(Original Dataframe) having below data(all columns have string datatype):

  id           Value
   1             103
   2             1504
   3              1

I need to create a new modified dataframe with padding in value column, so that length of this column should be 4 characters. If length is less than 4 characters, then add 0's in data as shown below:

  id             Value
   1             0103
   2             1504
   3             0001

Can someone help me out? How can i achieve it using Pyspark dataframe? Any help will be appreciated.

708

asked Jul 30 '17 14:07

rupak das

2 Answers

You can use lpad from functions module,

from pyspark.sql.functions import lpad
>>> df.select('id',lpad(df['value'],4,'0').alias('value')).show()
+---+-----+
| id|value|
+---+-----+
|  1| 0103|
|  2| 1504|
|  3| 0001|
+---+-----+

183

answered Oct 18 '22 05:10

Suresh

Using PySpark lpad function in conjunction with withColumn:

import pyspark.sql.functions as F
dfNew = dfOrigin.withColumn('Value', F.lpad(dfOrigin['Value'], 4, '0'))

answered Oct 18 '22 03:10

ucsky

Related questions
                            
                                Missing SPARK_HOME when using SparkLauncher on AWS EMR cluster
                            
                                How to skip lines while reading a CSV file as a dataFrame using PySpark?
                            
                                reading json file in pyspark
                            
                                If dataframes in Spark are immutable, why are we able to modify it with operations such as withColumn()?
                            
                                Pyspark changing type of column from date to string
                            
                                How to add my own function as a custom stage in a ML pyspark Pipeline? [duplicate]
                            
                                How to get rows from DF that contain value None in pyspark (spark)
                            
                                What does Exception: Randomness of hash of string should be disabled via PYTHONHASHSEED mean in pyspark?
                            
                                Difference between RDD.foreach() and RDD.map()
                            
                                Pyspark filter using startswith from list
                            
                                How to Sort a Dataframe in Pyspark [duplicate]
                            
                                Pyspark removing multiple characters in a dataframe column
                            
                                How to convert date to the first day of month in a PySpark Dataframe column?
                            
                                How can I sum multiple columns in a spark dataframe in pyspark?
                            
                                Pyspark: how to duplicate a row n time in dataframe?
                            
                                Creating a row number of each row in PySpark DataFrame using row_number() function with Spark version 2.2
                            
                                How to write csv file into one file by pyspark
                            
                                How to copy and convert parquet files to csv
                            
                                How to set up a local development environment for Scala Spark ETL to run in AWS Glue?
                            
                                How can I get Zeppelin to restart cleanly on an EMR cluster?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With