Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is my understanding of parallel operations in Spark correct?

I am newbie to Spark and trying to understand the Spark concept with Python. While using Python to develop applications for Spark, I get a bit confused with the way to get my data processed in parallel style.

1. Everyone says that I don't need to worry about which node and how many nodes will be invloved in processing my data encapsulated in RDD variables. Therefore, based on my best understanding, I believe what a Spark cluster would do to the code below:

a = sc.textFile(filename)
b = a.filter(lambda x: len(x) > 0 and x.split("\t").count("9999-12-31") == 0)
c = b.collect()

could be described as the following steps:

(1) Variable a will be saved as an RDD variable containing the expected txt file content
(2) Different chunks of RDD a will be broadcasted to different nodes in the cluster and filter method will be conducted for each chunk in different node
(3) when the collection action is invoked, the results will be returned to the master from different nodes and saved as a local variable, c.

Is my description right? If not, what exactly will the procedure be? If I am right, what is the point to have parallelize method? Does the following code experience the same thing as that listed above?

a = sc.textFile(filename).collect()
b = sc.parallelize(a).filter(lambda x: len(x)>0 and x.split("\t").count("9999-12-31"))
c = b.collect()

2. For the following code, would the SQL query syntax be processed in parallel by dividing the defined table into many partitions?

a = sc.textFile(filename)
b = a.filter(lambda x: len(x) > 0 and x.split("\t").count("9999-12-31") == 0)
parts = b.map(lambda x: x.split("\t"))
records = parts.map(Row(r0 = str(x[0]), r1 = x[1], r2 = x[2]))
rTable = sqlContext.createDataFrame(records)
rTable.registerTempTable("rTable")
result = sqlContext.sql("select substr(r0,1,2), case when r1=1 then r1*100 else r1*10 end, r2 from rTable").collect()
like image 719
Buddhainside Avatar asked Sep 28 '15 04:09

Buddhainside


People also ask

Does spark use parallel processing?

Moreover, Apache Spark uses RDDs for parallel processing performance across a cluster or system processor. It includes easy useful APIs for operating on big datasets, in multiple programming languages. It also includes various APIs for data transformation and familiar data frame APIs for modifying semi-structured data.

Is spark Read parallel?

Spark SQL will read different column family in parallel. By default, for example, in this example, there are six columns for the Parquet file. By default, all the six columns will be in a single Parquet file.

What is the role of parallelism in spark?

When a task is parallelized in Spark, it means that concurrent tasks may be running on the driver node or worker nodes. How the task is split across these different nodes in the cluster depends on the types of data structures and libraries that you're using.


1 Answers

Your first step description is true. But there is something more about second and third steps.

Second Step:

According to Spark documentation:

def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]

The textFile method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 64MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value.

If you put your file in HDFS and pass its path as textFile parameter, partitions of RDD a are created based on the HDFS blocks. So in this case the amount of palatalization depend on number of HDFS blocks. Also data has already partitioned and moved to cluster machines via HDFS.

If you use path on the local file system (available on all nodes) and do not specify minPartitions the default parallelism (that depends on number of cores in your cluster) is chosen. In this case you have to copy your file on every worker or put it into a shared storage which is available to every worker.

In each of the cases, Spark avoid broadcasting any data and instead tries to use existing blocks in each machines. So your second step is not totally true.

Third Step

According to Spark documentation:

collect(): Array[T] Return an array that contains all of the elements in this RDD

In this step your RDD b is shuffled/collected into your driver program/node.

like image 127
Farzad Nozarian Avatar answered Sep 18 '22 16:09

Farzad Nozarian