Apply a function to a single column of a csv in Spark

Tags:

Using Spark I'm reading a csv and want to apply a function to a column on the csv. I have some code that works but it's very hacky. What is the proper way to do this?

My code

SparkContext().addPyFile("myfile.py")
spark = SparkSession\
    .builder\
    .appName("myApp")\
    .getOrCreate()
from myfile import myFunction

df = spark.read.csv(sys.argv[1], header=True,
    mode="DROPMALFORMED",)
a = df.rdd.map(lambda line: Row(id=line[0], user_id=line[1], message_id=line[2], message=myFunction(line[3]))).toDF()

I would like to be able to just call the function on the column name instead of mapping each row to line and then calling the function on line[index].

I'm using Spark version 2.0.1

842

asked Dec 05 '16 15:12

Sal

1 Answers

You can simply use User Defined Functions (udf) combined with a withColumn :

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf

udf_myFunction = udf(myFunction, IntegerType()) # if the function returns an int
df = df.withColumn("message", udf_myFunction("_3")) #"_3" being the column name of the column you want to consider

This will add a new column to the dataframe df containing the result of myFunction(line[3]).

answered Oct 13 '22 09:10

Bernard Jesop

Related questions
                            
                                Accessing Spark SQL RDD tables through the Thrift Server
                            
                                Spark save(write) parquet only one file
                            
                                Using Grouped Map Pandas UDFs with arguments
                            
                                How to use custom classes with Apache Spark (pyspark)?
                            
                                Increase Spark memory when using local[*]
                            
                                Is groupByKey ever preferred over reduceByKey
                            
                                spark-submit, how to specify log4j.properties
                            
                                issue Running Spark Job on Yarn Cluster
                            
                                Does Spark know the partitioning key of a DataFrame?
                            
                                How to get the number of workers(executors) in PySpark?
                            
                                How to read a nested collection in Spark
                            
                                Initialize an RDD to empty
                            
                                Spark Build Custom Column Function, user defined function
                            
                                Why do we need to add "fork in run := true" when running Spark SBT application?
                            
                                filter spark dataframe with row field that is an array of strings
                            
                                Spark Data Frame Random Splitting
                            
                                Save a large Spark Dataframe as a single json file in S3
                            
                                Exception while deleting Spark temp dir in Windows 7 64 bit
                            
                                PySpark - get row number for each row in a group
                            
                                How to pass environment variables to spark driver in cluster mode with spark-submit

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Apply a function to a single column of a csv in Spark

Tags:

apache-spark

pyspark

spark-dataframe

Sal

People also ask

1 Answers

Bernard Jesop

Recent Activity

Donate For Us