Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using spark dataFrame to load data from HDFS

Can we use DataFrame while reading data from HDFS. I have a tab separated data in HDFS.

I googled, but saw it can be used with NoSQL data

like image 762
ToBeSparkShark Avatar asked Jun 05 '16 05:06

ToBeSparkShark


People also ask

Can we store data in HDFS from Spark?

Spark does not have its system to organize files in a distributed way(the file system). For this reason, programmers install Spark on top of Hadoop so that Spark's advanced analytics applications can make use of the data stored using the Hadoop Distributed File System(HDFS).


1 Answers

DataFrame is certainly not limited to NoSQL data sources. Parquet, ORC and JSON support is natively provided in 1.4 to 1.6.1; text delimited files are supported using the spark-cvs package.

If you have your tsv file in HDFS at /demo/data then the following code will read the file into a DataFrame

sqlContext.read.
  format("com.databricks.spark.csv").
  option("delimiter","\t").
  option("header","true").
  load("hdfs:///demo/data/tsvtest.tsv").show

To run the code from spark-shell use the following:

--packages com.databricks:spark-csv_2.10:1.4.0

In Spark 2.0 csv is natively supported so you should be able to do something like this:

spark.read.
  option("delimiter","\t").
  option("header","true").
  csv("hdfs:///demo/data/tsvtest.tsv").show
like image 170
Robin East Avatar answered Nov 08 '22 17:11

Robin East