Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

remove last few characters in PySpark dataframe column

I am having a PySpark DataFrame. How can I chop off/remove last 5 characters from the column name below -

from pyspark.sql.functions import substring, length
valuesCol = [('rose_2012',),('jasmine_2013',),('lily_2014',),('daffodil_2017',),('sunflower_2016',)]
df = sqlContext.createDataFrame(valuesCol,['name'])
df.show()

+--------------+
|          name|
+--------------+
|     rose_2012|
|  jasmine_2013|
|     lily_2014|
| daffodil_2017|
|sunflower_2016|
+--------------+

I want to create 2 columns, the flower and year column.

Expected output:

+--------------+----+---------+
|          name|year|   flower|
+--------------+----+---------+
|     rose_2012|2012|     rose|
|  jasmine_2013|2013|  jasmine|
|     lily_2014|2014|     lily|
| daffodil_2017|2017| daffodil|
|sunflower_2016|2016|subflower|
+--------------+----+---------+

year column I have created -

df = df.withColumn("year", substring(col("name"),-4,4))
df.show()
+--------------+----+
|          name|year|
+--------------+----+
|     rose_2012|2012|
|  jasmine_2013|2013|
|     lily_2014|2014|
| daffodil_2017|2017|
|sunflower_2016|2016|
+--------------+----+

I don't know how to chop last 5 characters, so that I only have the name of flowers. I tried something like this, by invoking length, but that doesn't work.

df = df.withColumn("flower",substring(col("name"),0,length(col("name"))-5))

How can I create flower column with only flower names?

like image 242
cph_sto Avatar asked Nov 05 '18 11:11

cph_sto


People also ask

How do you trim columns in Pyspark?

Remove both leading and trailing space of column in pyspark with trim() function – strip or trim space. To Remove both leading and trailing space of the column in pyspark we use trim() function. trim() Function takes column name and trims both left and right white space from that column.

How do you get the first character of a string in Pyspark?

First N character of column in pyspark is obtained using substr() function.

How do you break strings in Pyspark?

In order to split the strings of the column in pyspark we will be using split() function. split function takes the column name and delimiter as arguments.


2 Answers

You can use expr function

>>> from pyspark.sql.functions import substring, length, col, expr
>>> df = df.withColumn("flower",expr("substring(name, 1, length(name)-5)"))
>>> df.show()
+--------------+----+---------+
|          name|year|   flower|
+--------------+----+---------+
|     rose_2012|2012|     rose|
|  jasmine_2013|2013|  jasmine|
|     lily_2014|2014|     lily|
| daffodil_2017|2017| daffodil|
|sunflower_2016|2016|sunflower|
+--------------+----+---------+
like image 98
Ali Yesilli Avatar answered Oct 07 '22 18:10

Ali Yesilli


You can use split function. this code does what you want:

import pyspark.sql.functions as f

newDF = df.withColumn("year", f.split(df['name'], '\_')[1]).\
           withColumn("flower", f.split(df['name'], '\_')[0])

newDF.show()

+--------------+----+---------+
|          name|year|   flower|
+--------------+----+---------+
|     rose_2012|2012|     rose|
|  jasmine_2013|2013|  jasmine|
|     lily_2014|2014|     lily|
| daffodil_2017|2017| daffodil|
|sunflower_2016|2016|sunflower|
+--------------+----+---------+
like image 40
Ali AzG Avatar answered Oct 07 '22 18:10

Ali AzG