Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In pyspark, how do you add/concat a string to a column?

Tags:

I would like to add a string to an existing column. For example, df['col1'] has values as '1', '2', '3' etc and I would like to concat string '000' on the left of col1 so I can get a column (new or replace the old one doesn't matter) as '0001', '0002', '0003'.

I thought I should use df.withColumn('col1', '000'+df['col1']) but of course it does not work since pyspark dataframe are immutable?

This should be an easy task but i didn't find anything online. Hope someone can give me some help!

Thank you!

like image 752
ASU_TY Avatar asked Mar 21 '18 04:03

ASU_TY


People also ask

How do you add strings to a column in PySpark?

PySpark Concatenate Using concat() select() is a transformation function in PySpark and returns a new DataFrame with the selected columns. In the above example, using concat() function of Pyspark SQL, I have concatenated three input string columns(firstname, middlename, lastname) into a single string column(FullName).

How do you add values to columns in PySpark?

In PySpark, to add a new column to DataFrame use lit() function by importing from pyspark. sql. functions import lit , lit() function takes a constant value you wanted to add and returns a Column type, if you wanted to add a NULL / None use lit(None) .

How do you add a prefix to a column in PySpark?

add_prefix() is used to add a prefix string to each and every column at the beginning of the pyspark pandas dataframe. It is also possible to add a prefix to only a single column by specifying the column name. In this scenario, it will be added to row labels.

How do you make strings in PySpark?

In order to convert array to a string, PySpark SQL provides a built-in function concat_ws() which takes delimiter of your choice as a first argument and array column (type Column) as the second argument. In order to use concat_ws() function, you need to import it using pyspark. sql.


1 Answers

from pyspark.sql.functions import concat, col, lit   df.select(concat(col("firstname"), lit(" "), col("lastname"))).show(5) +------------------------------+ |concat(firstname,  , lastname)| +------------------------------+ |                Emanuel Panton| |              Eloisa Cayouette| |                   Cathi Prins| |             Mitchel Mozdzierz| |               Angla Hartzheim| +------------------------------+ only showing top 5 rows 

http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#module-pyspark.sql.functions

like image 119
Steven Black Avatar answered Sep 28 '22 04:09

Steven Black