Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to load data into spark dataframe from text file without knowing the schema of the data?

I have a text file in hadoop, I need to sort it using its second column using spark java api. I am using data frame but I am not sure about its columns. It may have dynamic columns,means I don't know about the exact number of columns.

How can I proceed? Please help me.

Thanks in advance.

like image 932
A.N.Gupta Avatar asked Nov 28 '25 05:11

A.N.Gupta


1 Answers

First thing is I'm trying to give an csv example in scala (not java)

You can use Spark csv api to create dataframes and sort based on any column you want. If you have any limitations please see below way.

Fixed number of Columns :

Starting with below example of fixed number of Columns.. You can follow this example.

where data looks like for ebay.csv :

“8213034705,95,2.927373,jake7870,0,95,117.5,xbox,3”

//  SQLContext entry point for working with structured data
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
// Import Spark SQL data types and Row.
import org.apache.spark.sql._

//define the schema using a case class
case class Auction(auctionid: String, bid: Float, bidtime: Float, bidder: String, bidderrate: Integer, openbid: Float, price: Float, item: String, daystolive: Integer)


 val auction = sc.textFile("ebay.csv").map(_.split(",")).map(p => 
Auction(p(0),p(1).toFloat,p(2).toFloat,p(3),p(4).toInt,p(5).toFloat,p(6).toFloat,p(7),p(8).toInt )).toDF()

// Display the top 20 rows of DataFrame 
auction.show()
// auctionid  bid   bidtime  bidder         bidderrate openbid price item daystolive
// 8213034705 95.0  2.927373 jake7870       0          95.0    117.5 xbox 3
// 8213034705 115.0 2.943484 davidbresler2  1          95.0    117.5 xbox 3 …


// Return the schema of this DataFrame
auction.printSchema()
root
 |-- auctionid: string (nullable = true)
 |-- bid: float (nullable = false)
 |-- bidtime: float (nullable = false)
 |-- bidder: string (nullable = true)
 |-- bidderrate: integer (nullable = true)
 |-- openbid: float (nullable = false)
 |-- price: float (nullable = false)
 |-- item: string (nullable = true)
 |-- daystolive: integer (nullable = true)

auction.sort("auctionid") // this will sort first column i.e auctionid

Variable number of Columns (since Case class with Array parameter is possible):

you can use like below pseudocode, where first 4 elements are fixed and remaining all are variable array...

Since you are only inserted to sort on second column so this will work out and all other data will be there in array for that particular row, for later use.

case class Auction(auctionid: String, bid: Float, bidtime: Float, bidder: String, variablenumberofColumnsArray:String*)

 val auction = sc.textFile("ebay.csv").map(_.split(",")).map(p => 
Auction(p(0),p(1).toFloat,p(2).toFloat,p(3),p(4).toInt, VariableNumberOfColumnsArray or any complex type like Map ).toDF()

    auction.sort("auctionid") // this will sort first column i.e auctionid
like image 129
Ram Ghadiyaram Avatar answered Nov 30 '25 20:11

Ram Ghadiyaram