Can we use DataFrame while reading data from HDFS. I have a tab separated data in HDFS.
I googled, but saw it can be used with NoSQL data
Spark does not have its system to organize files in a distributed way(the file system). For this reason, programmers install Spark on top of Hadoop so that Spark's advanced analytics applications can make use of the data stored using the Hadoop Distributed File System(HDFS).
DataFrame is certainly not limited to NoSQL data sources. Parquet, ORC and JSON support is natively provided in 1.4 to 1.6.1; text delimited files are supported using the spark-cvs package.
If you have your tsv file in HDFS at /demo/data then the following code will read the file into a DataFrame
sqlContext.read.
format("com.databricks.spark.csv").
option("delimiter","\t").
option("header","true").
load("hdfs:///demo/data/tsvtest.tsv").show
To run the code from spark-shell use the following:
--packages com.databricks:spark-csv_2.10:1.4.0
In Spark 2.0 csv is natively supported so you should be able to do something like this:
spark.read.
option("delimiter","\t").
option("header","true").
csv("hdfs:///demo/data/tsvtest.tsv").show
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With