I'm new to spark. I want to perform some operations on particular data in a CSV record. I'm trying to read a CSV file and convert it to RDD. My further operations are based on the heading provided in CSV file. (From comments) This is my code so far: <pre class="prettyprint"><code>final JavaRDD<String> File = sc.textFile(Filename).cache(); final JavaRDD<String> lines = File.flatMap(new FlatMapFunction<String, String>() { @Override public Iterable<String> call(String s) { return Arrays.asList(EOL.split(s)); } }); final String heading=lines.first().toString(); </code></pre> I can get the header values like this. I want to map this to each record in CSV file. <pre class="prettyprint"><code>final String[] header=heading.split(" "); </code></pre> I can get the header values like this. I want to map this to each record in CSV file. In java I’m using <code>CSVReader record.getColumnValue(Column header)</code> to get the particular value. I need to do something similar to that here.

A simplistic approach would be to have a way to preserve the header. Let's say you have a file.csv like: <pre class="prettyprint"><code>user, topic, hits om, scala, 120 daniel, spark, 80 3754978, spark, 1 </code></pre> We can define a header class that uses a parsed version of the first row: <pre class="prettyprint"><code>class SimpleCSVHeader(header:Array[String]) extends Serializable { val index = header.zipWithIndex.toMap def apply(array:Array[String], key:String):String = array(index(key)) } </code></pre> That we can use that header to address the data further down the road: <pre class="prettyprint"><code>val csv = sc.textFile("file.csv") // original file val data = csv.map(line => line.split(",").map(elem => elem.trim)) //lines in rows val header = new SimpleCSVHeader(data.take(1)(0)) // we build our header with the first line val rows = data.filter(line => header(line,"user") != "user") // filter the header out val users = rows.map(row => header(row,"user") val usersByHits = rows.map(row => header(row,"user") -> header(row,"hits").toInt) ... </code></pre> Note that the <code>header</code> is not much more than a simple map of a mnemonic to the array index. Pretty much all this could be done on the ordinal place of the element in the array, like <code>user = row(0)</code> PS: Welcome to Scala :-)

How do I convert csv file to rdd

Tags:

scala

apache-spark

I'm new to spark. I want to perform some operations on particular data in a CSV record.

I'm trying to read a CSV file and convert it to RDD. My further operations are based on the heading provided in CSV file.

(From comments) This is my code so far:

final JavaRDD<String> File = sc.textFile(Filename).cache(); final JavaRDD<String> lines = File.flatMap(new FlatMapFunction<String, String>() {      @Override public Iterable<String> call(String s) {      return Arrays.asList(EOL.split(s));      }  }); final String heading=lines.first().toString();

I can get the header values like this. I want to map this to each record in CSV file.

final String[] header=heading.split(" ");

I can get the header values like this. I want to map this to each record in CSV file.

In java I’m using CSVReader record.getColumnValue(Column header) to get the particular value. I need to do something similar to that here.

394

asked Jun 19 '14 05:06

Ramya

2 Answers

A simplistic approach would be to have a way to preserve the header.

Let's say you have a file.csv like:

user, topic, hits om,  scala, 120 daniel, spark, 80 3754978, spark, 1

We can define a header class that uses a parsed version of the first row:

class SimpleCSVHeader(header:Array[String]) extends Serializable {   val index = header.zipWithIndex.toMap   def apply(array:Array[String], key:String):String = array(index(key)) }

That we can use that header to address the data further down the road:

val csv = sc.textFile("file.csv")  // original file val data = csv.map(line => line.split(",").map(elem => elem.trim)) //lines in rows val header = new SimpleCSVHeader(data.take(1)(0)) // we build our header with the first line val rows = data.filter(line => header(line,"user") != "user") // filter the header out val users = rows.map(row => header(row,"user") val usersByHits = rows.map(row => header(row,"user") -> header(row,"hits").toInt) ...

Note that the header is not much more than a simple map of a mnemonic to the array index. Pretty much all this could be done on the ordinal place of the element in the array, like user = row(0)

PS: Welcome to Scala :-)

113

answered Oct 04 '22 08:10

maasg

You can use the spark-csv library: https://github.com/databricks/spark-csv

This is directly from the documentation:

import org.apache.spark.sql.SQLContext  SQLContext sqlContext = new SQLContext(sc);  HashMap<String, String> options = new HashMap<String, String>(); options.put("header", "true"); options.put("path", "cars.csv");  DataFrame df = sqlContext.load("com.databricks.spark.csv", options);

answered Oct 04 '22 08:10

Saman

Related questions
                            
                                Flattening Rows in Spark
                            
                                dataframe: how to groupBy/count then filter on count in Scala
                            
                                Syntax sugar: _* for treating Seq as method parameters
                            
                                How do I exclude/rename some classes from import in Scala?
                            
                                What is a function literal in Scala?
                            
                                How do I do casting in Scala?
                            
                                How to reduce the verbosity of Spark's runtime output?
                            
                                Spark unionAll multiple dataframes
                            
                                How do you update multiple columns using Slick Lifted Embedding?
                            
                                Infinite streams in Scala
                            
                                Using partial functions in Scala - how does it work?
                            
                                How to fully clean, re-resolve and rebuild a Scala sbt-managed project in IDEA?
                            
                                Scala - mutable (var) method parameter reference
                            
                                Scala profiler?
                            
                                What's the best Scala build system? [closed]
                            
                                How to reduce Seq[Either[A,B]] to Either[A,Seq[B]]?
                            
                                Scala : fold vs foldLeft
                            
                                Output path is shared between the same module error
                            
                                Creating a jar file from a Scala file
                            
                                Scala filter on two conditions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With