remove last few characters in PySpark dataframe column

Tags:

I am having a PySpark DataFrame. How can I chop off/remove last 5 characters from the column name below -

from pyspark.sql.functions import substring, length
valuesCol = [('rose_2012',),('jasmine_2013',),('lily_2014',),('daffodil_2017',),('sunflower_2016',)]
df = sqlContext.createDataFrame(valuesCol,['name'])
df.show()

+--------------+
|          name|
+--------------+
|     rose_2012|
|  jasmine_2013|
|     lily_2014|
| daffodil_2017|
|sunflower_2016|
+--------------+

I want to create 2 columns, the flower and year column.

Expected output:

+--------------+----+---------+
|          name|year|   flower|
+--------------+----+---------+
|     rose_2012|2012|     rose|
|  jasmine_2013|2013|  jasmine|
|     lily_2014|2014|     lily|
| daffodil_2017|2017| daffodil|
|sunflower_2016|2016|subflower|
+--------------+----+---------+

year column I have created -

df = df.withColumn("year", substring(col("name"),-4,4))
df.show()
+--------------+----+
|          name|year|
+--------------+----+
|     rose_2012|2012|
|  jasmine_2013|2013|
|     lily_2014|2014|
| daffodil_2017|2017|
|sunflower_2016|2016|
+--------------+----+

I don't know how to chop last 5 characters, so that I only have the name of flowers. I tried something like this, by invoking length, but that doesn't work.

df = df.withColumn("flower",substring(col("name"),0,length(col("name"))-5))

How can I create flower column with only flower names?

242

asked Nov 05 '18 11:11

cph_sto

2 Answers

You can use expr function

>>> from pyspark.sql.functions import substring, length, col, expr
>>> df = df.withColumn("flower",expr("substring(name, 1, length(name)-5)"))
>>> df.show()
+--------------+----+---------+
|          name|year|   flower|
+--------------+----+---------+
|     rose_2012|2012|     rose|
|  jasmine_2013|2013|  jasmine|
|     lily_2014|2014|     lily|
| daffodil_2017|2017| daffodil|
|sunflower_2016|2016|sunflower|
+--------------+----+---------+

answered Oct 07 '22 18:10

Ali Yesilli

You can use split function. this code does what you want:

import pyspark.sql.functions as f

newDF = df.withColumn("year", f.split(df['name'], '\_')[1]).\
           withColumn("flower", f.split(df['name'], '\_')[0])

newDF.show()

+--------------+----+---------+
|          name|year|   flower|
+--------------+----+---------+
|     rose_2012|2012|     rose|
|  jasmine_2013|2013|  jasmine|
|     lily_2014|2014|     lily|
| daffodil_2017|2017| daffodil|
|sunflower_2016|2016|sunflower|
+--------------+----+---------+

answered Oct 07 '22 18:10

Ali AzG

Related questions
                            
                                pandas dataframe group year index by decade
                            
                                Plot lower triangle in a seaborn Pairgrid
                            
                                How do I get the current depth of the Python interpreter stack?
                            
                                Converting year and day of year into datetime index in pandas
                            
                                cv2 import error on Jupyter notebook
                            
                                How can I redirect from view In Django
                            
                                Creating pandas dataframe with datetime index and random values in column
                            
                                In Django ORM, "values" and "annotate" are not working to group by
                            
                                Python code readability
                            
                                form object has no attribute 'cleaned_data'
                            
                                If any item of list starts with string?
                            
                                How to calculate rolling cumulative product on Pandas DataFrame
                            
                                Using DATEADD in sqlalchemy
                            
                                iframe not rendering in ipython-notebook
                            
                                how to create a new xlsx file using openpyxl?
                            
                                Randomizing a list in Python [duplicate]
                            
                                Warning! ***HDF5 library version mismatched error*** python pandas windows
                            
                                Sort tuple list with another list
                            
                                TypeError: descriptor '__init__' requires a 'super' object but received a 'str'
                            
                                How to use Chrome Profile in Selenium Webdriver Python 3

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

remove last few characters in PySpark dataframe column

Tags:

python

substring

pyspark

cph_sto

People also ask

2 Answers

Ali Yesilli

Ali AzG

Recent Activity

Donate For Us