Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python spark extract characters from dataframe

I have a dataframe in spark, something like this:

ID     | Column
------ | ----
1      | STRINGOFLETTERS
2      | SOMEOTHERCHARACTERS
3      | ANOTHERSTRING
4      | EXAMPLEEXAMPLE

What I would like to do is extract the first 5 characters from the column plus the 8th character and create a new column, something like this:

ID     | New Column
------ | ------
1      | STRIN_F
2      | SOMEO_E
3      | ANOTH_S
4      | EXAMP_E

I can't use the following codem, because the values in the columns differ, and I don't want to split on a specific character, but on the 6th character:

import pyspark
split_col = pyspark.sql.functions.split(DF['column'], ' ')
newDF = DF.withColumn('new_column', split_col.getItem(0))

Thanks all!

like image 963
Amanda C Avatar asked Dec 01 '16 17:12

Amanda C


People also ask

How do I extract a string from a DataFrame in Pyspark?

In PySpark, the substring() function is used to extract the substring from a DataFrame string column by providing the position and length of the string you wanted to extract.

How do I find a substring in Pyspark?

The DataFrame. withColumn(colName, col) can be used for extracting substring from the column data by using pyspark's substring() function along with it. Parameters: colName: str, name of the new column.

How do you slice in Pyspark?

In this method, we are first going to make a PySpark DataFrame using createDataFrame(). We will then use randomSplit() function to get two slices of the DataFrame while specifying the fractions of rows that will be present in both slices. The rows are split up RANDOMLY.


1 Answers

Use something like this:

df.withColumn('new_column', concat(df.Column.substr(1, 5),
                                   lit('_'),
                                   df.Column.substr(8, 1)))

This use the function substr and concat

Those functions will solve your problem.

like image 103
Thiago Baldim Avatar answered Oct 27 '22 14:10

Thiago Baldim