I am having a PySpark DataFrame. How can I chop off/remove last 5 characters from the column name
below -
from pyspark.sql.functions import substring, length
valuesCol = [('rose_2012',),('jasmine_2013',),('lily_2014',),('daffodil_2017',),('sunflower_2016',)]
df = sqlContext.createDataFrame(valuesCol,['name'])
df.show()
+--------------+
| name|
+--------------+
| rose_2012|
| jasmine_2013|
| lily_2014|
| daffodil_2017|
|sunflower_2016|
+--------------+
I want to create 2 columns, the flower
and year
column.
Expected output:
+--------------+----+---------+
| name|year| flower|
+--------------+----+---------+
| rose_2012|2012| rose|
| jasmine_2013|2013| jasmine|
| lily_2014|2014| lily|
| daffodil_2017|2017| daffodil|
|sunflower_2016|2016|subflower|
+--------------+----+---------+
year
column I have created -
df = df.withColumn("year", substring(col("name"),-4,4))
df.show()
+--------------+----+
| name|year|
+--------------+----+
| rose_2012|2012|
| jasmine_2013|2013|
| lily_2014|2014|
| daffodil_2017|2017|
|sunflower_2016|2016|
+--------------+----+
I don't know how to chop last 5 characters, so that I only have the name of flowers. I tried something like this, by invoking length
, but that doesn't work.
df = df.withColumn("flower",substring(col("name"),0,length(col("name"))-5))
How can I create flower
column with only flower names?
Remove both leading and trailing space of column in pyspark with trim() function – strip or trim space. To Remove both leading and trailing space of the column in pyspark we use trim() function. trim() Function takes column name and trims both left and right white space from that column.
First N character of column in pyspark is obtained using substr() function.
In order to split the strings of the column in pyspark we will be using split() function. split function takes the column name and delimiter as arguments.
You can use expr function
>>> from pyspark.sql.functions import substring, length, col, expr
>>> df = df.withColumn("flower",expr("substring(name, 1, length(name)-5)"))
>>> df.show()
+--------------+----+---------+
| name|year| flower|
+--------------+----+---------+
| rose_2012|2012| rose|
| jasmine_2013|2013| jasmine|
| lily_2014|2014| lily|
| daffodil_2017|2017| daffodil|
|sunflower_2016|2016|sunflower|
+--------------+----+---------+
You can use split
function. this code does what you want:
import pyspark.sql.functions as f
newDF = df.withColumn("year", f.split(df['name'], '\_')[1]).\
withColumn("flower", f.split(df['name'], '\_')[0])
newDF.show()
+--------------+----+---------+
| name|year| flower|
+--------------+----+---------+
| rose_2012|2012| rose|
| jasmine_2013|2013| jasmine|
| lily_2014|2014| lily|
| daffodil_2017|2017| daffodil|
|sunflower_2016|2016|sunflower|
+--------------+----+---------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With