I wanna read csv files in Zeppelin and would like to use databricks' spark-csv package: https://github.com/databricks/spark-csv In the spark-shell, I can use spark-csv with <pre class="prettyprint"><code>spark-shell --packages com.databricks:spark-csv_2.11:1.2.0 </code></pre> But how do I tell Zeppelin to use that package? Thanks in advance!

You need to add the Spark Packages repository to Zeppelin before you can use %dep on spark packages. <pre class="prettyprint"><code>%dep z.reset() z.addRepo("Spark Packages Repo").url("http://dl.bintray.com/spark-packages/maven") z.load("com.databricks:spark-csv_2.10:1.2.0") </code></pre> Alternatively, if this is something you want available in all your notebooks, you can add the --packages option to the spark-submit command setting in the interpreters config in Zeppelin, and then restart the interpreter. This should start a context with the package already loaded as per the spark-shell method.

Reading csv files in zeppelin using spark-csv

Tags:

apache-spark

apache-zeppelin

I wanna read csv files in Zeppelin and would like to use databricks' spark-csv package: https://github.com/databricks/spark-csv

In the spark-shell, I can use spark-csv with

spark-shell --packages com.databricks:spark-csv_2.11:1.2.0

But how do I tell Zeppelin to use that package?

Thanks in advance!

473

asked Oct 06 '15 09:10

fabsta

3 Answers

You need to add the Spark Packages repository to Zeppelin before you can use %dep on spark packages.

%dep
z.reset()
z.addRepo("Spark Packages Repo").url("http://dl.bintray.com/spark-packages/maven")
z.load("com.databricks:spark-csv_2.10:1.2.0")

Alternatively, if this is something you want available in all your notebooks, you can add the --packages option to the spark-submit command setting in the interpreters config in Zeppelin, and then restart the interpreter. This should start a context with the package already loaded as per the spark-shell method.

answered Nov 16 '22 23:11

Simon Elliston Ball

Go to the Interpreter tab, click Repository Information, add a repo and set the URL to http://dl.bintray.com/spark-packages/maven
Scroll down to the spark interpreter paragraph and click edit, scroll down a bit to the artifact field and add "com.databricks:spark-csv_2.10:1.2.0" or a newer version. Then restart the interpreter when asked.

In the notebook, use something like:

import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .load("my_data.txt")

Update:

In the Zeppelin user mailing list, it is now (Nov. 2016) stated by Moon Soo Lee (creator of Apache Zeppelin) that users prefer to keep %dep as it allows for:

self-documenting library requirements in the notebook;
per Note (and possible per User) library loading.

The tendency is now to keep %dep, so it should not be considered depreciated at this time.

answered Nov 17 '22 00:11

Paul-Armand Verhaegen

BEGIN-EDIT

%dep is deprecated in Zeppelin 0.6.0. Please refer Paul-Armand Verhaegen's answer.

Please read further in this answer, if you are using zeppelin older than 0.6.0

END-EDIT

You can load the spark-csv package using %dep interpreter.

like,

%dep
z.reset()

// Add spark-csv package
z.load("com.databricks:spark-csv_2.10:1.2.0")

See Dependency Loading section in https://zeppelin.incubator.apache.org/docs/interpreter/spark.html

If you've already initialized Spark Context, quick solution is to restart zeppelin and execute zeppelin paragraph with above code first and then execute your spark code to read the CSV file

answered Nov 16 '22 23:11

sag

Related questions
                            
                                Getting specific field from chosen Row in Pyspark DataFrame
                            
                                Spark: how to get the number of written rows?
                            
                                Converting epoch to datetime in PySpark data frame using udf
                            
                                How to speed up spark df.write jdbc to postgres database?
                            
                                Spark dataframe reducebykey like operation
                            
                                Date difference between consecutive rows - Pyspark Dataframe
                            
                                Spark-Csv Write quotemode not working
                            
                                selecting a range of elements in an array spark sql
                            
                                Py4J error when creating a spark dataframe using pyspark
                            
                                Error:'java.lang.UnsupportedOperationException' for Pyspark pandas_udf documentation code
                            
                                reading a file in hdfs from pyspark
                            
                                How to convert an RDD[Row] back to DataFrame [duplicate]
                            
                                Write Spark dataframe as CSV with partitions
                            
                                Partitioning by multiple columns in Spark SQL
                            
                                AttributeError: 'SparkContext' object has no attribute 'createDataFrame' using Spark 1.6
                            
                                Spark Dataframe Nested Case When Statement
                            
                                Spark: Programmatically creating dataframe schema in scala
                            
                                How to get the correlation matrix of a pyspark data frame?
                            
                                Spark - scala: shuffle RDD / split RDD into two random parts randomly
                            
                                Spark streaming custom metrics

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With