Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to add source file name to each row in Spark?

I'm new to Spark and am trying to insert a column to each input row with the file name that it comes from.

I've seen others ask a similar question, but all their answers used wholeTextFile, but I'm trying to do this for larger CSV files (read using the Spark-CSV library), JSON files, and Parquet files (not just small text files).

I can use the spark-shell to get a list of the filenames:

val df = sqlContext.read.parquet("/blah/dir")
val names = df.select(inputFileName())
names.show

but that's a dataframe. I am not sure how to add it as a column to each row (and if that result is ordered the same as the initial data either, though I assume it always is) and how to do this as a general solution for all input types.

like image 831
mcmcmc Avatar asked Oct 23 '15 01:10

mcmcmc


1 Answers

Another solution I just found to add file name as one of the columns in DataFrame

val df = sqlContext.read.parquet("/blah/dir")

val dfWithCol = df.withColumn("filename",input_file_name())

Ref: spark load data and add filename as dataframe column

like image 111
Dipankar Avatar answered Oct 22 '22 06:10

Dipankar