I'm looking into Spark as a possible calculation tool, but was unable to find examples for use case i have in mind. What i want to do is somewhat opposite of map and reduce (in the first step at least) which maybe makes Spark a wrong tool for the job, so before i disregard it, i wanted to check here is someone has some good idea if this can be done.
Data flow would look like this:
Of course, the calculation is supposed to be done on the compute nodes so it can work in parallel and i would like to move data only once - which means that a single chunk of the input table will be processed only by a single node.
Is this feasible? If it's not, is there an alternative that can be used for this purpose and integrated with Spark?
All what you are listing out fits perfectly in a Spark typical flow.
JavaSparkContext.parallelize(...)
, and the API will take care of the rest. Optionally you can feed in an additional parameter telling by how much you want to parallelize.JavaRDD.map(...)
.JavaRDD.repartition(...)
to split the data downstream.JavaRDD.map(...)
.JavaRDD.flatMap(...)
.JavaRDD.foreachPartition(...)
(if your database can support it, Oracle does). Just make sure you do a batch insert, not x individual inserts (batch is not the same as x inserts with one commit).This is all very typical Spark coding, that can be read from the Spark Programming Guide. You can switch the documentation between Java/Scala/Python.
I apologize for providing all the information with links to JavaDoc. I didnt notice at first that your question was Python specific. However the same still applies, the API has been fully mapped to Python (for the most part at least, and maybe with a few improvements).
If I can give you one good piece of advice: work in a descent IDE that provides you context sensitive help and auto-completion. It will definitely help you to discover what methods can work for you.
If I understand your question, in spark would be resolve like this:
1.- Read with spark-csv and add prop delimiter to "\t"
2.- Over RDD, map to apply function over every register
3.- Use flatMap for multiply results
4.- save with SQLContext
5.- Read other tables with sqlContext and apply join.
Then can run mapReduce.
Example:
val a = sc.readFile(1).map(2).flatMap(3) a.saveAs(4)
a.join(otherRDD)..
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With