Get CSV to Spark dataframe

Question

I'm using python on Spark and would like to get a csv into a dataframe.

The documentation for Spark SQL strangely does not provide explanations for CSV as a source.

I have found Spark-CSV, however I have issues with two parts of the documentation:

"This package can be added to Spark using the --jars command line option. For example, to include it when starting the spark shell: $ bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3" Do I really need to add this argument everytime I launch pyspark or spark-submit? It seems very inelegant. Isn't there a way to import it in python rather than redownloading it each time?
df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "cars.csv") Even if I do the above, this won't work. What does the "source" argument stand for in this line of code? How do I simply load a local file on linux, say "/Spark_Hadoop/spark-1.3.1-bin-cdh4/cars.csv"?

mattsilver · Accepted Answer

With more recent versions of Spark (as of, I believe, 1.4) this has become a lot easier. The expression sqlContext.read gives you a DataFrameReader instance, with a .csv() method:

df = sqlContext.read.csv("/path/to/your.csv")

Note that you can also indicate that the csv file has a header by adding the keyword argument header=True to the .csv() call. A handful of other options are available, and described in the link above.

Get CSV to Spark dataframe

Tags:

Alexis Eggermont

1 Answers

mattsilver

Recent Activity

Donate For Us

Get CSV to Spark dataframe

Tags:

Alexis Eggermont

1 Answers

mattsilver

Related questions

Recent Activity

Donate For Us