Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

substring multiple characters from the last index of a pyspark string column using negative indexing

Closely related to: Spark Dataframe column with last character of other column but I want to extract multiple characters from the -1 index.


I have the following pyspark dataframe df

+----------+----------+
|    number|event_type|
+----------+----------+
|0342224022|        11|
|0112964715|        11|
+----------+----------+

I want to extract 3 characters from the last index of the number column.

I tried the following:

from pyspark.sql.functions import substring 
df.select(substring(df['number'], -1, 3), 'event_type').show(2)

# which returns:

+----------------------+----------+
|substring(number,-1,3)|event_type|
+----------------------+----------+
|                     2|        11|
|                     5|        11|
+----------------------+----------+

The below is the expected output (and I'm not sure what the output above is):

+----------------------+----------+
|substring(number,-1,3)|event_type|
+----------------------+----------+
|                   022|        11|
|                   715|        11|
+----------------------+----------+

What am I doing wrong?

Note: Spark version 1.6.0

like image 724
akilat90 Avatar asked Apr 12 '18 09:04

akilat90


People also ask

How do you extract a substring in Pyspark?

Pyspark – Get substring() from a column. In PySpark, the substring() function is used to extract the substring from a DataFrame string column by providing the position and length of the string you wanted to extract.

How do you slice in Pyspark?

In this method, we are first going to make a PySpark DataFrame using createDataFrame(). We will then use randomSplit() function to get two slices of the DataFrame while specifying the fractions of rows that will be present in both slices. The rows are split up RANDOMLY.


Video Answer


1 Answers

This is how you use substring. Your position will be -3 and the length is 3.

pyspark.sql.functions.substring(str, pos, len)

You need to change your substring function call to:

from pyspark.sql.functions import substring
df.select(substring(df['number'], -3, 3), 'event_type').show(2)
#+------------------------+----------+
#|substring(number, -3, 3)|event_type|
#+------------------------+----------+
#|                     022|        11|
#|                     715|        11|
#+------------------------+----------+
like image 135
pissall Avatar answered Oct 12 '22 16:10

pissall