Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to add any new library like spark-csv in Apache Spark prebuilt version

I have build the Spark-csv and able to use the same from pyspark shell using the following command

bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3 

error getting

>>> df_cat.save("k.csv","com.databricks.spark.csv") Traceback (most recent call last):   File "<stdin>", line 1, in <module>   File "/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/pyspark/sql/dataframe.py", line 209, in save     self._jdf.save(source, jmode, joptions)   File "/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__   File "/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError 

Where should I place the jar file in my spark pre-built setup so that I will be able to access spark-csv from python editor directly as well.

like image 362
Abhishek Choudhary Avatar asked Jun 10 '15 13:06

Abhishek Choudhary


People also ask

How do I create a CSV file in Spark?

csv("file_name") to read a file or directory of files in CSV format into Spark DataFrame, and dataframe. write(). csv("path") to write to a CSV file. Function option() can be used to customize the behavior of reading or writing, such as controlling behavior of the header, delimiter character, character set, and so on.

How do I create a Spark library?

Create a Project in IDEOpen up IntelliJ and select “Create New Project” and select “SBT” for the Project. Set the Java SDK and Scala Versions to match your intended Apache Spark environment on Databricks. Enable “auto-import” to automatically import libraries as you add them to your build file.


2 Answers

At the time I used spark-csv, I also had to download commons-csv jar (not sure it is still relevant). Both jars where in the spark distribution folder.

  1. I downloaded the jars as follow:

    wget http://search.maven.org/remotecontent?filepath=org/apache/commons/commons-csv/1.1/commons-csv-1.1.jar -O commons-csv-1.1.jar<br/>     wget http://search.maven.org/remotecontent?filepath=com/databricks/spark-csv_2.10/1.0.0/spark-csv_2.10-1.0.0.jar -O spark-csv_2.10-1.0.0.jar 
  2. then started the python spark shell with the arguments:

    ./bin/pyspark --jars "spark-csv_2.10-1.0.0.jar,commons-csv-1.1.jar" 
  3. and read a spark dataframe from a csv file:

    from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.load(source="com.databricks.spark.csv", path = "/path/to/you/file.csv") df.show() 
like image 173
Yannick Marcon Avatar answered Sep 30 '22 04:09

Yannick Marcon


Another option is to add the following to your spark-defaults.conf:

spark.jars.packages com.databricks:spark-csv_2.11:1.2.0 
like image 23
kentt Avatar answered Sep 30 '22 03:09

kentt