Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a DataFrame from a text file in Spark

Tags:

I have a text file on HDFS and I want to convert it to a Data Frame in Spark.

I am using the Spark Context to load the file and then try to generate individual columns from that file.

val myFile = sc.textFile("file.txt") val myFile1 = myFile.map(x=>x.split(";")) 

After doing this, I am trying the following operation.

myFile1.toDF() 

I am getting an issues since the elements in myFile1 RDD are now array type.

How can I solve this issue?

like image 522
Rahul Avatar asked Apr 21 '16 10:04

Rahul


People also ask

How do I create a DataFrame from a text file in Pyspark?

First, import the modules and create a spark session and then read the file with spark. read. format(), then create columns and split the data from the txt file show into a dataframe.

How do I read a text file into a DataFrame in Spark?

2.1 text() – Read text file into DataFramespark. read. text() method is used to read a text file into DataFrame. like in RDD, we can also use this method to read multiple files at a time, reading patterns matching files and finally reading all files from a directory.

How do I read a text file with Spark?

text("file_name") to read a file or directory of text files into a Spark DataFrame, and dataframe. write(). text("path") to write to a text file. When reading a text file, each line becomes each row that has string “value” column by default.


2 Answers

Update - as of Spark 1.6, you can simply use the built-in csv data source:

spark: SparkSession = // create the Spark Session val df = spark.read.csv("file.txt") 

You can also use various options to control the CSV parsing, e.g.:

val df = spark.read.option("header", "false").csv("file.txt") 

For Spark version < 1.6: The easiest way is to use spark-csv - include it in your dependencies and follow the README, it allows setting a custom delimiter (;), can read CSV headers (if you have them), and it can infer the schema types (with the cost of an extra scan of the data).

Alternatively, if you know the schema you can create a case-class that represents it and map your RDD elements into instances of this class before transforming into a DataFrame, e.g.:

case class Record(id: Int, name: String)  val myFile1 = myFile.map(x=>x.split(";")).map {   case Array(id, name) => Record(id.toInt, name) }   myFile1.toDF() // DataFrame will have columns "id" and "name" 
like image 158
Tzach Zohar Avatar answered Sep 19 '22 18:09

Tzach Zohar


I have given different ways to create DataFrame from text file

val conf = new SparkConf().setAppName(appName).setMaster("local") val sc = SparkContext(conf) 

raw text file

val file = sc.textFile("C:\\vikas\\spark\\Interview\\text.txt") val fileToDf = file.map(_.split(",")).map{case Array(a,b,c) =>  (a,b.toInt,c)}.toDF("name","age","city") fileToDf.foreach(println(_)) 

spark session without schema

import org.apache.spark.sql.SparkSession val sparkSess =  SparkSession.builder().appName("SparkSessionZipsExample") .config(conf).getOrCreate()  val df = sparkSess.read.option("header",  "false").csv("C:\\vikas\\spark\\Interview\\text.txt") df.show() 

spark session with schema

import org.apache.spark.sql.types._ val schemaString = "name age city" val fields = schemaString.split(" ").map(fieldName => StructField(fieldName,  StringType, nullable=true)) val schema = StructType(fields)  val dfWithSchema = sparkSess.read.option("header",  "false").schema(schema).csv("C:\\vikas\\spark\\Interview\\text.txt") dfWithSchema.show() 

using sql context

import org.apache.spark.sql.SQLContext  val fileRdd =  sc.textFile("C:\\vikas\\spark\\Interview\\text.txt").map(_.split(",")).map{x  => org.apache.spark.sql.Row(x:_*)} val sqlDf = sqlCtx.createDataFrame(fileRdd,schema) sqlDf.show() 
like image 20
Vikas Singh Avatar answered Sep 18 '22 18:09

Vikas Singh