Is my understanding of parallel operations in Spark correct?

Tags:

I am newbie to Spark and trying to understand the Spark concept with Python. While using Python to develop applications for Spark, I get a bit confused with the way to get my data processed in parallel style.

1. Everyone says that I don't need to worry about which node and how many nodes will be invloved in processing my data encapsulated in RDD variables. Therefore, based on my best understanding, I believe what a Spark cluster would do to the code below:

a = sc.textFile(filename)
b = a.filter(lambda x: len(x) > 0 and x.split("\t").count("9999-12-31") == 0)
c = b.collect()

could be described as the following steps:

(1) Variable a will be saved as an RDD variable containing the expected txt file content
(2) Different chunks of RDD a will be broadcasted to different nodes in the cluster and filter method will be conducted for each chunk in different node
(3) when the collection action is invoked, the results will be returned to the master from different nodes and saved as a local variable, c.

Is my description right? If not, what exactly will the procedure be? If I am right, what is the point to have parallelize method? Does the following code experience the same thing as that listed above?

a = sc.textFile(filename).collect()
b = sc.parallelize(a).filter(lambda x: len(x)>0 and x.split("\t").count("9999-12-31"))
c = b.collect()

2. For the following code, would the SQL query syntax be processed in parallel by dividing the defined table into many partitions?

a = sc.textFile(filename)
b = a.filter(lambda x: len(x) > 0 and x.split("\t").count("9999-12-31") == 0)
parts = b.map(lambda x: x.split("\t"))
records = parts.map(Row(r0 = str(x[0]), r1 = x[1], r2 = x[2]))
rTable = sqlContext.createDataFrame(records)
rTable.registerTempTable("rTable")
result = sqlContext.sql("select substr(r0,1,2), case when r1=1 then r1*100 else r1*10 end, r2 from rTable").collect()

719

asked Sep 28 '15 04:09

Buddhainside

1 Answers

Your first step description is true. But there is something more about second and third steps.

Second Step:

According to Spark documentation:

def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]

The textFile method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 64MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value.

If you put your file in HDFS and pass its path as textFile parameter, partitions of RDD a are created based on the HDFS blocks. So in this case the amount of palatalization depend on number of HDFS blocks. Also data has already partitioned and moved to cluster machines via HDFS.

If you use path on the local file system (available on all nodes) and do not specify minPartitions the default parallelism (that depends on number of cores in your cluster) is chosen. In this case you have to copy your file on every worker or put it into a shared storage which is available to every worker.

In each of the cases, Spark avoid broadcasting any data and instead tries to use existing blocks in each machines. So your second step is not totally true.

Third Step

According to Spark documentation:

collect(): Array[T] Return an array that contains all of the elements in this RDD

In this step your RDD b is shuffled/collected into your driver program/node.

127

answered Sep 18 '22 16:09

Farzad Nozarian

Related questions
                            
                                How do I connect to a kerberos authenticated REST service in Python on Windows
                            
                                How to parameterize python unittest setUp method?
                            
                                Python UDP socket send bottleneck (slow/delays randomly)
                            
                                Prettify Jinja2 Template
                            
                                Different versions of sklearn give quite different training results
                            
                                Celery: Rate limit on tasks with the same parameters
                            
                                wxPython wx.lib.plot.PlotCanvas error
                            
                                Python string format a float, how to truncate instead of rounding
                            
                                Creating an Exe with Selenium Module: Py2exe/Pyinstaller
                            
                                Calculating distances between unique Python array regions?
                            
                                Unit testing Python Flask Stream
                            
                                Flask-Login:Where user_loader callback should be defined?
                            
                                How to set one to many and one to one relationship at same time in Flask-SQLAlchemy?
                            
                                Export pandas DataFrame to LaTeX and apply formatters by row
                            
                                Python rapidly creating and removing directories will cause WindowsError [Error 5] intermittently
                            
                                Pex: Could not satisfy all requirements
                            
                                cx_freeze PyGObject application on Linux
                            
                                How do I get PyCharm to update from my local package repository?
                            
                                Why is del an instruction and not a method in python? [duplicate]
                            
                                What's the pythonic way to package a web app with a generation step?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is my understanding of parallel operations in Spark correct?

Tags:

python

parallel-processing

apache-spark

apache-spark-sql