<p>I am new to spark, and I want to use group-by & reduce to find the following from CSV (one line by employed):</p> <pre class="prettyprint"><code> Department, Designation, costToCompany, State Sales, Trainee, 12000, UP Sales, Lead, 32000, AP Sales, Lead, 32000, LA Sales, Lead, 32000, TN Sales, Lead, 32000, AP Sales, Lead, 32000, TN Sales, Lead, 32000, LA Sales, Lead, 32000, LA Marketing, Associate, 18000, TN Marketing, Associate, 18000, TN HR, Manager, 58000, TN </code></pre> <p>I would like to simplify the about CSV with group by <strong>Department, Designation, State</strong> with additional columns with <strong>sum(costToCompany)</strong> and <strong>TotalEmployeeCount</strong></p> <p>Should get a result like:</p> <pre class="prettyprint"><code> Dept, Desg, state, empCount, totalCost Sales,Lead,AP,2,64000 Sales,Lead,LA,3,96000 Sales,Lead,TN,2,64000 </code></pre> <p>Is there any way to achieve this using transformations and actions. Or should we go for RDD operations?</p>

<blockquote> <p><strong>CSV file can be parsed with Spark built-in CSV reader</strong>. It will return DataFrame/DataSet on the successful read of the file. On top of DataFrame/DataSet, you apply SQL-like operations easily.</p> </blockquote> <h3>Using Spark 2.x(and above) with Java</h3> <h3>Create SparkSession object aka <code>spark</code> </h3> <pre class="prettyprint"><code>import org.apache.spark.sql.SparkSession; SparkSession spark = SparkSession .builder() .appName("Java Spark SQL Example") .getOrCreate(); </code></pre> <h3>Create Schema for Row with <code>StructType</code> </h3> <pre class="prettyprint"><code>import org.apache.spark.sql.types.StructType; StructType schema = new StructType() .add("department", "string") .add("designation", "string") .add("ctc", "long") .add("state", "string"); </code></pre> <h3>Create dataframe from CSV file and apply schema to it</h3> <pre class="prettyprint"><code>Dataset<Row> df = spark.read() .option("mode", "DROPMALFORMED") .schema(schema) .csv("hdfs://path/input.csv"); </code></pre> <p>more option on reading data from CSV file</p> <h3>Now we can aggregation on data in 2 ways</h3> <blockquote> <h3>1. SQL way</h3> <p>Register a table in spark sql metastore to perform SQL operation</p> <pre class="prettyprint"><code>df.createOrReplaceTempView("employee"); </code></pre> <p>Run SQL query on registered dataframe</p> <pre class="prettyprint"><code>Dataset<Row> sqlResult = spark.sql( "SELECT department, designation, state, SUM(ctc), COUNT(department)" + " FROM employee GROUP BY department, designation, state"); sqlResult.show(); //for testing </code></pre> <p>We can even execute SQL directly on CSV file with out creating table with Spark SQL</p> </blockquote> <hr> <blockquote> <h3>2. Object chaining or Programming or Java-like way</h3> <p>Do the necessary import for sql functions</p> <pre class="prettyprint"><code>import static org.apache.spark.sql.functions.count; import static org.apache.spark.sql.functions.sum; </code></pre> <p>Use <code>groupBy</code> and <code>agg</code> on dataframe/dataset to perform <code>count</code> and <code>sum</code> on data</p> <pre class="prettyprint"><code>Dataset<Row> dfResult = df.groupBy("department", "designation", "state") .agg(sum("ctc"), count("department")); // After Spark 1.6 columns mentioned in group by will be added to result by default dfResult.show();//for testing </code></pre> </blockquote> <h3>dependent libraries</h3> <pre class="prettyprint"><code>"org.apache.spark" % "spark-core_2.11" % "2.0.0" "org.apache.spark" % "spark-sql_2.11" % "2.0.0" </code></pre>

<h3>Procedure</h3> <ul> <li> <p>Create a Class (Schema) to encapsulate your structure (it’s not required for the approach B, but it would make your code easier to read if you are using Java)</p> <pre class="prettyprint"><code>public class Record implements Serializable { String department; String designation; long costToCompany; String state; // constructor , getters and setters } </code></pre> </li> <li> <p>Loading CVS (JSON) file</p> <pre class="prettyprint"><code>JavaSparkContext sc; JavaRDD<String> data = sc.textFile("path/input.csv"); //JavaSQLContext sqlContext = new JavaSQLContext(sc); // For previous versions SQLContext sqlContext = new SQLContext(sc); // In Spark 1.3 the Java API and Scala API have been unified JavaRDD<Record> rdd_records = sc.textFile(data).map( new Function<String, Record>() { public Record call(String line) throws Exception { // Here you can use JSON // Gson gson = new Gson(); // gson.fromJson(line, Record.class); String[] fields = line.split(","); Record sd = new Record(fields[0], fields[1], fields[2].trim(), fields[3]); return sd; } }); </code></pre> </li> </ul> <p>At this point you have 2 approaches: </p> <h3>A. SparkSQL</h3> <ul> <li> <p>Register a table (using the your defined Schema Class)</p> <pre class="prettyprint"><code>JavaSchemaRDD table = sqlContext.applySchema(rdd_records, Record.class); table.registerAsTable("record_table"); table.printSchema(); </code></pre> </li> <li> <p>Query the table with your desired Query-group-by</p> <pre class="prettyprint"><code>JavaSchemaRDD res = sqlContext.sql(" select department,designation,state,sum(costToCompany),count(*) from record_table group by department,designation,state "); </code></pre> </li> <li><p>Here you would also be able to do any other query you desire, using a SQL approach</p></li> </ul> <h3>B. Spark</h3> <ul> <li> <p>Mapping using a composite key: <code>Department</code>,<code>Designation</code>,<code>State</code></p> <pre class="prettyprint"><code>JavaPairRDD<String, Tuple2<Long, Integer>> records_JPRDD = rdd_records.mapToPair(new PairFunction<Record, String, Tuple2<Long, Integer>>(){ public Tuple2<String, Tuple2<Long, Integer>> call(Record record){ Tuple2<String, Tuple2<Long, Integer>> t2 = new Tuple2<String, Tuple2<Long,Integer>>( record.Department + record.Designation + record.State, new Tuple2<Long, Integer>(record.costToCompany,1) ); return t2; } </code></pre> <p>});</p> </li> <li> <p>reduceByKey using the composite key, summing <code>costToCompany</code> column, and accumulating the number of records by key</p> <pre class="prettyprint"><code>JavaPairRDD<String, Tuple2<Long, Integer>> final_rdd_records = records_JPRDD.reduceByKey(new Function2<Tuple2<Long, Integer>, Tuple2<Long, Integer>, Tuple2<Long, Integer>>() { public Tuple2<Long, Integer> call(Tuple2<Long, Integer> v1, Tuple2<Long, Integer> v2) throws Exception { return new Tuple2<Long, Integer>(v1._1 + v2._1, v1._2+ v2._2); } }); </code></pre> </li> </ul>

Parse CSV as DataFrame/DataSet with Apache Spark and Java

Tags:

I am new to spark, and I want to use group-by & reduce to find the following from CSV (one line by employed):

  Department, Designation, costToCompany, State
  Sales, Trainee, 12000, UP
  Sales, Lead, 32000, AP
  Sales, Lead, 32000, LA
  Sales, Lead, 32000, TN
  Sales, Lead, 32000, AP
  Sales, Lead, 32000, TN 
  Sales, Lead, 32000, LA
  Sales, Lead, 32000, LA
  Marketing, Associate, 18000, TN
  Marketing, Associate, 18000, TN
  HR, Manager, 58000, TN

I would like to simplify the about CSV with group by Department, Designation, State with additional columns with sum(costToCompany) and TotalEmployeeCount

Should get a result like:

  Dept, Desg, state, empCount, totalCost
  Sales,Lead,AP,2,64000
  Sales,Lead,LA,3,96000  
  Sales,Lead,TN,2,64000

Is there any way to achieve this using transformations and actions. Or should we go for RDD operations?

269

asked Aug 18 '14 12:08

mithra

2 Answers

CSV file can be parsed with Spark built-in CSV reader. It will return DataFrame/DataSet on the successful read of the file. On top of DataFrame/DataSet, you apply SQL-like operations easily.

Using Spark 2.x(and above) with Java

Create SparkSession object aka `spark`

import org.apache.spark.sql.SparkSession;

SparkSession spark = SparkSession
    .builder()
    .appName("Java Spark SQL Example")
    .getOrCreate();

Create Schema for Row with `StructType`

import org.apache.spark.sql.types.StructType;

StructType schema = new StructType()
    .add("department", "string")
    .add("designation", "string")
    .add("ctc", "long")
    .add("state", "string");

Create dataframe from CSV file and apply schema to it

Dataset<Row> df = spark.read()
    .option("mode", "DROPMALFORMED")
    .schema(schema)
    .csv("hdfs://path/input.csv");

more option on reading data from CSV file

Now we can aggregation on data in 2 ways

1. SQL way

Register a table in spark sql metastore to perform SQL operation
df.createOrReplaceTempView("employee");
Run SQL query on registered dataframe
Dataset<Row> sqlResult = spark.sql(
    "SELECT department, designation, state, SUM(ctc), COUNT(department)" 
        + " FROM employee GROUP BY department, designation, state");

sqlResult.show(); //for testing
We can even execute SQL directly on CSV file with out creating table with Spark SQL

2. Object chaining or Programming or Java-like way

Do the necessary import for sql functions
import static org.apache.spark.sql.functions.count;
import static org.apache.spark.sql.functions.sum;
Use groupBy and agg on dataframe/dataset to perform count and sum on data
Dataset<Row> dfResult = df.groupBy("department", "designation", "state")
    .agg(sum("ctc"), count("department"));
// After Spark 1.6 columns mentioned in group by will be added to result by default

dfResult.show();//for testing

dependent libraries

"org.apache.spark" % "spark-core_2.11" % "2.0.0" 
"org.apache.spark" % "spark-sql_2.11" % "2.0.0"

answered Oct 14 '22 20:10

mrsrinivas

Procedure

Create a Class (Schema) to encapsulate your structure (it’s not required for the approach B, but it would make your code easier to read if you are using Java)

public class Record implements Serializable {
  String department;
  String designation;
  long costToCompany;
  String state;
  // constructor , getters and setters  
}

Loading CVS (JSON) file

JavaSparkContext sc;
JavaRDD<String> data = sc.textFile("path/input.csv");
//JavaSQLContext sqlContext = new JavaSQLContext(sc); // For previous versions 
SQLContext sqlContext = new SQLContext(sc); // In Spark 1.3 the Java API and Scala API have been unified


JavaRDD<Record> rdd_records = sc.textFile(data).map(
  new Function<String, Record>() {
      public Record call(String line) throws Exception {
         // Here you can use JSON
         // Gson gson = new Gson();
         // gson.fromJson(line, Record.class);
         String[] fields = line.split(",");
         Record sd = new Record(fields[0], fields[1], fields[2].trim(), fields[3]);
         return sd;
      }
});

At this point you have 2 approaches:

A. SparkSQL

JavaSchemaRDD table = sqlContext.applySchema(rdd_records, Record.class);
table.registerAsTable("record_table");
table.printSchema();

Query the table with your desired Query-group-by

JavaSchemaRDD res = sqlContext.sql("
  select department,designation,state,sum(costToCompany),count(*) 
  from record_table 
  group by department,designation,state
");

Here you would also be able to do any other query you desire, using a SQL approach

B. Spark

Mapping using a composite key: Department,Designation,State

JavaPairRDD<String, Tuple2<Long, Integer>> records_JPRDD = 
rdd_records.mapToPair(new
  PairFunction<Record, String, Tuple2<Long, Integer>>(){
    public Tuple2<String, Tuple2<Long, Integer>> call(Record record){
      Tuple2<String, Tuple2<Long, Integer>> t2 = 
      new Tuple2<String, Tuple2<Long,Integer>>(
        record.Department + record.Designation + record.State,
        new Tuple2<Long, Integer>(record.costToCompany,1)
      );
      return t2;
}

});

reduceByKey using the composite key, summing costToCompany column, and accumulating the number of records by key

JavaPairRDD<String, Tuple2<Long, Integer>> final_rdd_records = 
 records_JPRDD.reduceByKey(new Function2<Tuple2<Long, Integer>, Tuple2<Long,
 Integer>, Tuple2<Long, Integer>>() {
    public Tuple2<Long, Integer> call(Tuple2<Long, Integer> v1,
    Tuple2<Long, Integer> v2) throws Exception {
        return new Tuple2<Long, Integer>(v1._1 + v2._1, v1._2+ v2._2);
    }
});

140

answered Oct 14 '22 21:10

emecas

Related questions
                            
                                How to convert a JToken
                            
                                Swift Compiler Error: Use of unresolved identifier 'name'
                            
                                What are the 6 dots in template parameter packs? [duplicate]
                            
                                How to change three dots button on android to other button
                            
                                How to load images from mipmap folder programmatically? [duplicate]
                            
                                How to bring imageView in front of cardview ? When both are of Relative Layout childs
                            
                                Text View with circular background
                            
                                find the nearest location by latitude and longitude in postgresql
                            
                                How do you change the colour of a section title in a tableview?
                            
                                angular-cli Firebase hosting Angular 2 router not working
                            
                                How to get Route data into App Component in Angular 2
                            
                                python setup.py egg_info mysqlclient

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parse CSV as DataFrame/DataSet with Apache Spark and Java

Tags:

mithra

People also ask

2 Answers

Using Spark 2.x(and above) with Java

Create SparkSession object aka `spark`

Create Schema for Row with `StructType`

Create dataframe from CSV file and apply schema to it

Now we can aggregation on data in 2 ways

1. SQL way

2. Object chaining or Programming or Java-like way

dependent libraries

mrsrinivas

Procedure

A. SparkSQL

B. Spark

emecas

Recent Activity

Donate For Us

Parse CSV as DataFrame/DataSet with Apache Spark and Java

Tags:

mithra

People also ask

2 Answers

Using Spark 2.x(and above) with Java

Create SparkSession object aka spark

Create Schema for Row with StructType

Create dataframe from CSV file and apply schema to it

Now we can aggregation on data in 2 ways

1. SQL way

2. Object chaining or Programming or Java-like way

dependent libraries

mrsrinivas

Procedure

A. SparkSQL

B. Spark

emecas

Related questions

Recent Activity

Donate For Us

Create SparkSession object aka `spark`

Create Schema for Row with `StructType`