I have build the Spark-csv and able to use the same from pyspark shell using the following command
bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3
error getting
>>> df_cat.save("k.csv","com.databricks.spark.csv") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/pyspark/sql/dataframe.py", line 209, in save self._jdf.save(source, jmode, joptions) File "/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError
Where should I place the jar file in my spark pre-built setup so that I will be able to access spark-csv
from python editor directly as well.
csv("file_name") to read a file or directory of files in CSV format into Spark DataFrame, and dataframe. write(). csv("path") to write to a CSV file. Function option() can be used to customize the behavior of reading or writing, such as controlling behavior of the header, delimiter character, character set, and so on.
Create a Project in IDEOpen up IntelliJ and select “Create New Project” and select “SBT” for the Project. Set the Java SDK and Scala Versions to match your intended Apache Spark environment on Databricks. Enable “auto-import” to automatically import libraries as you add them to your build file.
At the time I used spark-csv, I also had to download commons-csv
jar (not sure it is still relevant). Both jars where in the spark distribution folder.
I downloaded the jars as follow:
wget http://search.maven.org/remotecontent?filepath=org/apache/commons/commons-csv/1.1/commons-csv-1.1.jar -O commons-csv-1.1.jar<br/> wget http://search.maven.org/remotecontent?filepath=com/databricks/spark-csv_2.10/1.0.0/spark-csv_2.10-1.0.0.jar -O spark-csv_2.10-1.0.0.jar
then started the python spark shell with the arguments:
./bin/pyspark --jars "spark-csv_2.10-1.0.0.jar,commons-csv-1.1.jar"
and read a spark dataframe from a csv file:
from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.load(source="com.databricks.spark.csv", path = "/path/to/you/file.csv") df.show()
Another option is to add the following to your spark-defaults.conf:
spark.jars.packages com.databricks:spark-csv_2.11:1.2.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With