The Data as follows below:
make,Model,MPG,Cylinders,Engine Disp,Horsepower,Weight,Accelerate,Year,Origin<br>
amc,amc ambassador dpl,15,8,390,190,3850,8.5,70,Indian<br>
amc,amc gremlin,21,6,199,90,2648,15,70,Indian<br>
amc,amc hornet,18,6,199,97,2774,15.5,70,Indian<br>
amc,amc rebel sst,16,8,304,150,3433,12,70,Indian<br>
.............
.............
.............
Now above is a purely structured data which i have processed happily with spark with scala as shown below
val rawData=sc.textFile("/hdfs/spark/cars2.txt") <br>
case class cars(make:String, model:String, mpg:Integer, cylinders :Integer, engine_disp:Integer, horsepower:Integer,weight:Integer ,accelerate:Double, year:Integer, origin:String)<br>
val carsData=rawData.map(x=>x.split(",")).map(x=>cars(x(0).toString,x(1).toString,x(2).toInt,x(3).toInt,x(4).toInt,x(5).toInt,x(6).toInt,x(7).toDouble,x(8).toInt,x(9).toString))<br>
carsData.take(2)<br>
carsData.cache()<br>
carsData.map(x=>(x.origin,1)).reduceByKey((x,y)=>x+y).collect<br>
val indianCars=carsData.filter(x=>(x.origin=="Indian"))<br>
indianCars.count() <br>
val makeWeightSum=indianCars.map(x=>(x.make,x.weight.toInt)).combineByKey((x:Int) => (x, 1),(acc:(Int, Int), x) => (acc._1 + x, acc._2 + 1),(acc1:(Int, Int), acc2:(Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2))<br>
makeWeightSum.collect()<br>
val makeWeightAvg=makeWeightSum.map(x=>(x._1,(x._2._1/x._2._2)))<br>
makeWeightAvg.collect()<br>
makeWeightAvg.saveAsTextFile(“carsMakeWeightAvg.txt”)<br>
Now i can do this analysis in HIVE also, why i need spark ( Spark might be fast, who really wants to travel on ROCKET ). So the Question is, does SPARK process multi-line unstructured data as shown below:
Data:
Brand:Nokia, Model:1112, price:100, time:201604091,<br>
redirectDomain:xyz.com, type:online,status:completed,<br>
tx:credit,country:in,<br>
Brand:samsung, Model:s6, price:5000, time:2016045859,<br>
redirectDomain:abc.com, type:online,status:completed,<br>
.....thousands of records...
Yes, Spark shall be used to do that.
A DataFrame is a distributed collection of data organized into named columns. Spark SQL supports operating on a variety of data sources through the DataFrame interface. You may Manually Specify Options of data source for such data.
Refer: Spark DataFrames and mutli-line input in spark
Note: Your data is not so unstructured. Its more like a csv file and if you perform few basic transformations, it may be converted to a data-set/data-frame.
If you are just testing various possible tools/frameworks which can be used to do it, I would also like to suggest Apache Flink.
Spark usually reads line per line. So your rawData.map will split by "," each text line. So unstructured multiline data will fail.
If you have a multiline CSV, you will need to read all file together and implement your own CSV parser allowing you to process multilines.
Learning Spark book from O'Really purpose the following approach:
val input = sc.wholeTextFiles(inputFile)
val result = input.flatMap{ case (_, txt) =>
val reader = new CSVReader(new StringReader(txt));
reader.readAll().map(x => IndianCar(x(0), x(1), ...)))
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With