<p>I have a text file on HDFS and I want to convert it to a Data Frame in Spark.</p> <p>I am using the Spark Context to load the file and then try to generate individual columns from that file. </p> <pre class="prettyprint"><code>val myFile = sc.textFile("file.txt") val myFile1 = myFile.map(x=>x.split(";")) </code></pre> <p>After doing this, I am trying the following operation.</p> <pre class="prettyprint"><code>myFile1.toDF() </code></pre> <p>I am getting an issues since the elements in myFile1 RDD are now array type.</p> <p>How can I solve this issue?</p>

<p><strong>Update</strong> - as of <strong>Spark 1.6</strong>, you can simply use the built-in csv data source:</p> <pre class="prettyprint"><code>spark: SparkSession = // create the Spark Session val df = spark.read.csv("file.txt") </code></pre> <p>You can also use various options to control the CSV parsing, e.g.:</p> <pre class="prettyprint"><code>val df = spark.read.option("header", "false").csv("file.txt") </code></pre> <p><strong>For Spark version < 1.6</strong>: The easiest way is to use spark-csv - include it in your dependencies and follow the README, it allows setting a custom delimiter (<code>;</code>), can read CSV headers (if you have them), and it can infer the schema <em>types</em> (with the cost of an extra scan of the data). </p> <p>Alternatively, if you know the schema you can create a case-class that represents it and map your RDD elements into instances of this class before transforming into a DataFrame, e.g.:</p> <pre class="prettyprint"><code>case class Record(id: Int, name: String) val myFile1 = myFile.map(x=>x.split(";")).map { case Array(id, name) => Record(id.toInt, name) } myFile1.toDF() // DataFrame will have columns "id" and "name" </code></pre>

<p>I have given different ways to create DataFrame from text file</p> <pre class="prettyprint"><code>val conf = new SparkConf().setAppName(appName).setMaster("local") val sc = SparkContext(conf) </code></pre> <h3>raw text file</h3> <pre class="prettyprint"><code>val file = sc.textFile("C:\\vikas\\spark\\Interview\\text.txt") val fileToDf = file.map(_.split(",")).map{case Array(a,b,c) => (a,b.toInt,c)}.toDF("name","age","city") fileToDf.foreach(println(_)) </code></pre> <h3>spark session without schema</h3> <pre class="prettyprint"><code>import org.apache.spark.sql.SparkSession val sparkSess = SparkSession.builder().appName("SparkSessionZipsExample") .config(conf).getOrCreate() val df = sparkSess.read.option("header", "false").csv("C:\\vikas\\spark\\Interview\\text.txt") df.show() </code></pre> <h3>spark session with schema</h3> <pre class="prettyprint"><code>import org.apache.spark.sql.types._ val schemaString = "name age city" val fields = schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, nullable=true)) val schema = StructType(fields) val dfWithSchema = sparkSess.read.option("header", "false").schema(schema).csv("C:\\vikas\\spark\\Interview\\text.txt") dfWithSchema.show() </code></pre> <h3>using sql context</h3> <pre class="prettyprint"><code>import org.apache.spark.sql.SQLContext val fileRdd = sc.textFile("C:\\vikas\\spark\\Interview\\text.txt").map(_.split(",")).map{x => org.apache.spark.sql.Row(x:_*)} val sqlDf = sqlCtx.createDataFrame(fileRdd,schema) sqlDf.show() </code></pre>

How to create a DataFrame from a text file in Spark

Tags:

I have a text file on HDFS and I want to convert it to a Data Frame in Spark.

I am using the Spark Context to load the file and then try to generate individual columns from that file.

val myFile = sc.textFile("file.txt") val myFile1 = myFile.map(x=>x.split(";"))

After doing this, I am trying the following operation.

myFile1.toDF()

I am getting an issues since the elements in myFile1 RDD are now array type.

How can I solve this issue?

522

asked Apr 21 '16 10:04

Rahul

2 Answers

Update - as of Spark 1.6, you can simply use the built-in csv data source:

spark: SparkSession = // create the Spark Session val df = spark.read.csv("file.txt")

You can also use various options to control the CSV parsing, e.g.:

val df = spark.read.option("header", "false").csv("file.txt")

For Spark version < 1.6: The easiest way is to use spark-csv - include it in your dependencies and follow the README, it allows setting a custom delimiter (;), can read CSV headers (if you have them), and it can infer the schema types (with the cost of an extra scan of the data).

Alternatively, if you know the schema you can create a case-class that represents it and map your RDD elements into instances of this class before transforming into a DataFrame, e.g.:

case class Record(id: Int, name: String)  val myFile1 = myFile.map(x=>x.split(";")).map {   case Array(id, name) => Record(id.toInt, name) }   myFile1.toDF() // DataFrame will have columns "id" and "name"

158

answered Sep 19 '22 18:09

Tzach Zohar

I have given different ways to create DataFrame from text file

val conf = new SparkConf().setAppName(appName).setMaster("local") val sc = SparkContext(conf)

raw text file

val file = sc.textFile("C:\\vikas\\spark\\Interview\\text.txt") val fileToDf = file.map(_.split(",")).map{case Array(a,b,c) =>  (a,b.toInt,c)}.toDF("name","age","city") fileToDf.foreach(println(_))

spark session without schema

import org.apache.spark.sql.SparkSession val sparkSess =  SparkSession.builder().appName("SparkSessionZipsExample") .config(conf).getOrCreate()  val df = sparkSess.read.option("header",  "false").csv("C:\\vikas\\spark\\Interview\\text.txt") df.show()

spark session with schema

import org.apache.spark.sql.types._ val schemaString = "name age city" val fields = schemaString.split(" ").map(fieldName => StructField(fieldName,  StringType, nullable=true)) val schema = StructType(fields)  val dfWithSchema = sparkSess.read.option("header",  "false").schema(schema).csv("C:\\vikas\\spark\\Interview\\text.txt") dfWithSchema.show()

using sql context

import org.apache.spark.sql.SQLContext  val fileRdd =  sc.textFile("C:\\vikas\\spark\\Interview\\text.txt").map(_.split(",")).map{x  => org.apache.spark.sql.Row(x:_*)} val sqlDf = sqlCtx.createDataFrame(fileRdd,schema) sqlDf.show()

answered Sep 18 '22 18:09

Vikas Singh

Related questions
                            
                                NullPointerException when trying to check permissions
                            
                                Template specialization with empty brackets and struct
                            
                                How do you provide an icon for an action extension?
                            
                                Is there a way to view whitespace in SQL Server Management Studio 2016?
                            
                                How to know if a binary number divides by 3?
                            
                                Visual studio/GIT : No tracked remote branch
                            
                                React - Create nested components with loops
                            
                                Can't get POST data using NodeJS/ExpressJS and Postman
                            
                                How to get timezone offset as ±hh:mm?
                            
                                adb cannot connect to daemon at tcp:5037
                            
                                Constexpr if alternative
                            
                                django-filter use paginations

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With