How to get files name with spark sc.textFile?

Tags:

scala

apache-spark

I am reading a directory of files using the following code:

val data = sc.textFile("/mySource/dir1/*")

now my data rdd contains all rows of all files in the directory (right?)

I want now to add a column to each row with the source files name, how can I do that?

The other options I tried is using wholeTextFile but I keep getting out of memory exceptions. 5 servers 24 cores 24 GB (executor-core 5 executor-memory 5G) any ideas?

262

asked Dec 16 '15 15:12

Eran Witkon

1 Answers

You can use this code. I have tested it with Spark 1.4 and 1.5.

It gets the file name from the inputSplit and adds it to each line using the iterator using the mapPartitionsWithInputSplit of the NewHadoopRDD

import org.apache.hadoop.mapreduce.lib.input.{FileSplit, TextInputFormat}
import org.apache.spark.rdd.{NewHadoopRDD}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text

val sc = new SparkContext(new SparkConf().setMaster("local"))

val fc = classOf[TextInputFormat]
val kc = classOf[LongWritable]
val vc = classOf[Text]

val path :String = "file:///home/user/test"
val text = sc.newAPIHadoopFile(path, fc ,kc, vc, sc.hadoopConfiguration)

val linesWithFileNames = text.asInstanceOf[NewHadoopRDD[LongWritable, Text]]
           .mapPartitionsWithInputSplit((inputSplit, iterator) => {
  val file = inputSplit.asInstanceOf[FileSplit]
  iterator.map(tup => (file.getPath, tup._2))
  }
)

linesWithFileNames.foreach(println)

174

answered Oct 07 '22 18:10

Udy

Related questions
                            
                                How to tune Play Framework application with proper threadpools?
                            
                                How to use a type constraint with abstract type
                            
                                Scala Futures not running in parallel
                            
                                could not find implicit ...: akka.http.server.RoutingSetup
                            
                                Understanding `Monomorphism` Example of Shapeless
                            
                                How to use PathFilter in Apache Spark?
                            
                                SBT Scala cross versions, with aggregation and dependencies
                            
                                Why is Scala's tail recursion slower than that of Java?
                            
                                How Do Callbacks work in Non-blocking Design?
                            
                                How i can integrate Apache Spark with the Play Framework to display predictions in real time?
                            
                                Simplest method for text lemmatization in Scala and Spark
                            
                                Why does this partial application not compile?
                            
                                Mocking Scala void function using Mockito
                            
                                How to do generic tuple -> case class conversion in Scala?
                            
                                Scala's monadic chaining of Try
                            
                                Play Framework - ExecutionContext cannot be resolved when trying to map a Promise
                            
                                Processing multiple files as independent RDD's in parallel
                            
                                How to convert a map to Spark's RDD
                            
                                Use spark in a sbt project in intellij
                            
                                Scala find missing values in a range

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With