Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to pass a list of paths to spark.read.load?

I can load multiple files at once by passing multiple paths to the load method, e.g.

spark.read
  .format("com.databricks.spark.avro")
  .load(
    "/data/src/entity1/2018-01-01",
    "/data/src/entity1/2018-01-12",
    "/data/src/entity1/2018-01-14")

I'd like to prepare a list of paths first and pass them to the load method, but I get the following compilation error:

val paths = Seq(
  "/data/src/entity1/2018-01-01",
  "/data/src/entity1/2018-01-12",
  "/data/src/entity1/2018-01-14")
spark.read.format("com.databricks.spark.avro").load(paths)

<console>:29: error: overloaded method value load with alternatives:
  (paths: String*)org.apache.spark.sql.DataFrame <and>
  (path: String)org.apache.spark.sql.DataFrame
 cannot be applied to (List[String])spark.read.format("com.databricks.spark.avro").load(paths)

Why? How to pass a list of paths to the load method?

like image 234
Takeshi Avatar asked Jun 16 '18 17:06

Takeshi


People also ask

How do I read multiple files in Spark?

Spark core provides textFile() & wholeTextFiles() methods in SparkContext class which is used to read single and multiple text or csv files into a single Spark RDD. Using this method we can also read all files from a directory and files with a specific pattern.


2 Answers

You just need is a splat operator (_*) the paths list as

spark.read.format("com.databricks.spark.avro").load(paths: _*)
like image 57
Ramesh Maharjan Avatar answered Nov 03 '22 01:11

Ramesh Maharjan


load method support varargs type of argument, not the list type. So you have explicitly convert list to varargs adding : _* in load function.

spark.read.format("com.databricks.spark.avro").load(paths: _*)
like image 40
Kaushal Avatar answered Nov 02 '22 23:11

Kaushal