Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Spark for sequential row-by-row processing without map and reduce

I'm looking into Spark as a possible calculation tool, but was unable to find examples for use case i have in mind. What i want to do is somewhat opposite of map and reduce (in the first step at least) which maybe makes Spark a wrong tool for the job, so before i disregard it, i wanted to check here is someone has some good idea if this can be done.

Data flow would look like this:

  1. Idea is to have a huge tabular structure as an input which would then be split across the cluster of compute nodes (it can be loaded as a text file, it can be in DB)
  2. For each line in the this input structure, there would be a logic which would classify the content of the row (e.g. if it's a mortgage, current account or something else)
  3. After classification initiate the calculation of installments for a given class. Now, here is the problem - i'm not sure if Spark can perform this kind of calculations: one input line can result in several hundred of resulting lines with e.g. 4 minimal columns: ID of original row, date, amount1, amount2
  4. Save the output into a new table
  5. Then, combine new table with several other tables and apply the map and reduce on the results

Of course, the calculation is supposed to be done on the compute nodes so it can work in parallel and i would like to move data only once - which means that a single chunk of the input table will be processed only by a single node.

Is this feasible? If it's not, is there an alternative that can be used for this purpose and integrated with Spark?

like image 267
mispp Avatar asked Sep 25 '22 09:09

mispp


2 Answers

All what you are listing out fits perfectly in a Spark typical flow.

  1. You parallelize / partition your input. How:
    1. You can simply feed in a Java List of elements JavaSparkContext.parallelize(...), and the API will take care of the rest. Optionally you can feed in an additional parameter telling by how much you want to parallelize.
    2. Use SparkContext.readFile(...) to read and parallelize a file, producing a RDD of Strings. You can further split it up in columns or something by doing an additional String.split(...) and JavaRDD.map(...).
    3. Other API's, like JDBCRDD for database reads,
    4. Start with non-parallelized data, and use JavaRDD.repartition(...) to split the data downstream.
  2. classify = JavaRDD.map(...).
  3. 1 row to x rows = JavaRDD.flatMap(...).
  4. Do a parallel concurrent Insert using JavaRDD.foreachPartition(...) (if your database can support it, Oracle does). Just make sure you do a batch insert, not x individual inserts (batch is not the same as x inserts with one commit).

This is all very typical Spark coding, that can be read from the Spark Programming Guide. You can switch the documentation between Java/Scala/Python.

I apologize for providing all the information with links to JavaDoc. I didnt notice at first that your question was Python specific. However the same still applies, the API has been fully mapped to Python (for the most part at least, and maybe with a few improvements).

If I can give you one good piece of advice: work in a descent IDE that provides you context sensitive help and auto-completion. It will definitely help you to discover what methods can work for you.

like image 139
YoYo Avatar answered Sep 29 '22 07:09

YoYo


If I understand your question, in spark would be resolve like this:

1.- Read with spark-csv and add prop delimiter to "\t"

2.- Over RDD, map to apply function over every register

3.- Use flatMap for multiply results

4.- save with SQLContext

5.- Read other tables with sqlContext and apply join.

Then can run mapReduce.

Example:

val a = sc.readFile(1).map(2).flatMap(3) a.saveAs(4)

a.join(otherRDD)..

like image 21
DanielVL Avatar answered Sep 29 '22 07:09

DanielVL