Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apply a function to a single column of a csv in Spark

Using Spark I'm reading a csv and want to apply a function to a column on the csv. I have some code that works but it's very hacky. What is the proper way to do this?

My code

SparkContext().addPyFile("myfile.py")
spark = SparkSession\
    .builder\
    .appName("myApp")\
    .getOrCreate()
from myfile import myFunction

df = spark.read.csv(sys.argv[1], header=True,
    mode="DROPMALFORMED",)
a = df.rdd.map(lambda line: Row(id=line[0], user_id=line[1], message_id=line[2], message=myFunction(line[3]))).toDF()

I would like to be able to just call the function on the column name instead of mapping each row to line and then calling the function on line[index].

I'm using Spark version 2.0.1

like image 842
Sal Avatar asked Dec 05 '16 15:12

Sal


People also ask

How do you apply a function to a column in PySpark?

The syntax for Pyspark Apply Function to ColumnThe Import is to be used for passing the user-defined function. B:- The Data frame model used and the user-defined function that is to be passed for the column name. It takes up the column name as the parameter, and the function can be passed along.

How do you apply a function to a DataFrame in PySpark?

You can apply function to column in dataframe to get desired transformation as output. In this post, we will see 2 of the most common ways of applying function to column in PySpark. First is applying spark built-in functions to column and second is applying user defined custom function to columns in Dataframe.

How do you assign a value to a column in PySpark?

Method 1: Using Lit() function Select table by using select() method and pass the arguments first one is the column name, or “*” for selecting the whole table and second argument pass the lit() function with constant values.

How do I select a single column in PySpark?

In PySpark we can select columns using the select() function. The select() function allows us to select single or multiple columns in different formats.


1 Answers

You can simply use User Defined Functions (udf) combined with a withColumn :

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf

udf_myFunction = udf(myFunction, IntegerType()) # if the function returns an int
df = df.withColumn("message", udf_myFunction("_3")) #"_3" being the column name of the column you want to consider

This will add a new column to the dataframe df containing the result of myFunction(line[3]).

like image 99
Bernard Jesop Avatar answered Oct 13 '22 09:10

Bernard Jesop