Spark-Obtaining file name in RDDs

Question

I am trying to process 4 directories of text files that keep growing every day. What I need to do is, if somebody is trying to search for an invoice number, I should give them the list of files which has it.

I was able to map and reduce the values in text files by loading them as RDD. But how can I obtain the file name and other file attributes?

zero323 · Accepted Answer

Since Spark 1.6 you can combine text data source and input_file_name function as follows:

Scala:

import org.apache.spark.sql.functions.input_file_name

val inputPath: String = ???

spark.read.text(inputPath)
  .select(input_file_name, $"value")
  .as[(String, String)] // Optionally convert to Dataset
  .rdd // or RDD

Python:

(Versions before 2.x are buggy and may not preserve names when converted to RDD):

from pyspark.sql.functions import input_file_name

(spark.read.text(input_path)
    .select(input_file_name(), "value"))
    .rdd)

This can be used with other input formats as well.

Stefan Buchholz · Answer

You can try this if you are in pyspark:

    test = sc.wholeTextFiles("pathtofile")

you will get a resulting RDD with first element = filepath and second element = content

Spark-Obtaining file name in RDDs

Tags:

apache-spark

Vipin Bhaskaran

2 Answers

zero323

Stefan Buchholz

Recent Activity

Donate For Us

Spark-Obtaining file name in RDDs

Tags:

apache-spark

Vipin Bhaskaran

2 Answers

zero323

Stefan Buchholz

Related questions

Recent Activity

Donate For Us