How can I distribute my task to all worker nodes in gcp? I am using pyspark

Question

I have created a cluster with 1 master (clus-m) and two worker nodes(clus-w-0, clus-w-1) in gcp dataproc. Now using pyspark rdd, I want to distribute one task so that all the nodes get involved. Below is my code snippet.

def pair_dist(row):
    dissimlarity = 0
    Z = row[0].split(',')
    X = row[1].split(',')

    for j in range(len(Z)):
        if Z[j] != X[j]:
            dissimlarity += 1

    return str(dissimlarity) + **os.uname()[1]**

sc = SparkContext.getOrCreate()
rdd = sc.textFile( "input.csv" )

rdd = sc.parallelize(rdd.take(5))
rdd = rdd.cartesian(rdd)
dist = rdd.map(lambda x: pair_dist(x)).collect()
dist = np.array(dist).reshape((5,5))
print(dist)

sc.stop()

To check whether it is happened properly or not I put the host name with the result. But I always get the host name clus-m in result not the worker nodes' host name.

Output: [0clus-m 2clus-m...... 1clus-m 0clus-m.......] 5x5

Please suggest what exactly I need to do?

tix · Accepted Answer

To distribute work, your input dataset has to be sharded. Since you're using sc.textFile( "input.csv" ) you will have a single mapper reading the file.

If for instance the input dataset is substantially multiplied through transformations, you could RDD.repartition to make subsequent operations better parallelized.

Your best bet will be to split the input into multiple files.

Spark programming guide has these points that are relevant to your question:

All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/.txt"), and textFile("/my/directory/.gz").

The textFile method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.

How can I distribute my task to all worker nodes in gcp? I am using pyspark

Tags:

python

google-cloud-platform

apache-spark

pyspark

google-cloud-dataproc

srimanta

1 Answers

tix

Recent Activity

Donate For Us

How can I distribute my task to all worker nodes in gcp? I am using pyspark

Tags:

python

google-cloud-platform

apache-spark

pyspark

google-cloud-dataproc

srimanta

1 Answers

tix

Related questions

Recent Activity

Donate For Us