I have a dataframe in spark, something like this:
ID | Column
------ | ----
1 | STRINGOFLETTERS
2 | SOMEOTHERCHARACTERS
3 | ANOTHERSTRING
4 | EXAMPLEEXAMPLE
What I would like to do is extract the first 5 characters from the column plus the 8th character and create a new column, something like this:
ID | New Column
------ | ------
1 | STRIN_F
2 | SOMEO_E
3 | ANOTH_S
4 | EXAMP_E
I can't use the following codem, because the values in the columns differ, and I don't want to split on a specific character, but on the 6th character:
import pyspark
split_col = pyspark.sql.functions.split(DF['column'], ' ')
newDF = DF.withColumn('new_column', split_col.getItem(0))
Thanks all!
In PySpark, the substring() function is used to extract the substring from a DataFrame string column by providing the position and length of the string you wanted to extract.
The DataFrame. withColumn(colName, col) can be used for extracting substring from the column data by using pyspark's substring() function along with it. Parameters: colName: str, name of the new column.
In this method, we are first going to make a PySpark DataFrame using createDataFrame(). We will then use randomSplit() function to get two slices of the DataFrame while specifying the fractions of rows that will be present in both slices. The rows are split up RANDOMLY.
Use something like this:
df.withColumn('new_column', concat(df.Column.substr(1, 5),
lit('_'),
df.Column.substr(8, 1)))
This use the function substr and concat
Those functions will solve your problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With