I wanna read csv files in Zeppelin and would like to use databricks' spark-csv package: https://github.com/databricks/spark-csv
In the spark-shell, I can use spark-csv with
spark-shell --packages com.databricks:spark-csv_2.11:1.2.0
But how do I tell Zeppelin to use that package?
Thanks in advance!
The SparkSession can be used to read this CSV file as follows: Dataset<Row> csv = sparkSession. read(). format("csv").
To read multiple CSV files in Spark, just use textFile() method on SparkContext object by passing all file names comma separated. The below example reads text01. csv & text02. csv files into single RDD.
You need to add the Spark Packages repository to Zeppelin before you can use %dep on spark packages.
%dep
z.reset()
z.addRepo("Spark Packages Repo").url("http://dl.bintray.com/spark-packages/maven")
z.load("com.databricks:spark-csv_2.10:1.2.0")
Alternatively, if this is something you want available in all your notebooks, you can add the --packages option to the spark-submit command setting in the interpreters config in Zeppelin, and then restart the interpreter. This should start a context with the package already loaded as per the spark-shell method.
In the notebook, use something like:
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("my_data.txt")
Update:
In the Zeppelin user mailing list, it is now (Nov. 2016) stated by Moon Soo Lee (creator of Apache Zeppelin) that users prefer to keep %dep as it allows for:
The tendency is now to keep %dep, so it should not be considered depreciated at this time.
BEGIN-EDIT
%dep is deprecated in Zeppelin 0.6.0. Please refer Paul-Armand Verhaegen's answer.
Please read further in this answer, if you are using zeppelin older than 0.6.0
END-EDIT
You can load the spark-csv package using %dep interpreter.
like,
%dep
z.reset()
// Add spark-csv package
z.load("com.databricks:spark-csv_2.10:1.2.0")
See Dependency Loading section in https://zeppelin.incubator.apache.org/docs/interpreter/spark.html
If you've already initialized Spark Context, quick solution is to restart zeppelin and execute zeppelin paragraph with above code first and then execute your spark code to read the CSV file
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With