Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark toDebugString not nice in python

This is what I get when I use toDebugString in scala:

scala> val a  = sc.parallelize(Array(1,2,3)).distinct
a: org.apache.spark.rdd.RDD[Int] = MappedRDD[3] at distinct at <console>:12

scala> a.toDebugString
res0: String = 
(4) MappedRDD[3] at distinct at <console>:12
 |  ShuffledRDD[2] at distinct at <console>:12
 +-(4) MappedRDD[1] at distinct at <console>:12
    |  ParallelCollectionRDD[0] at parallelize at <console>:12

This is the equivalent in python:

>>> a = sc.parallelize([1,2,3]).distinct()
>>> a.toDebugString()
'(4) PythonRDD[6] at RDD at PythonRDD.scala:43\n |  MappedRDD[5] at values at NativeMethodAccessorImpl.java:-2\n |  ShuffledRDD[4] at partitionBy at NativeMethodAccessorImpl.java:-2\n +-(4) PairwiseRDD[3] at RDD at PythonRDD.scala:261\n    |  PythonRDD[2] at RDD at PythonRDD.scala:43\n    |  ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:315'

As you can see, the output is not as nice in python as in scala. Is there any trick to have a nicer output of this function?

I am using Spark 1.1.0.

like image 481
poiuytrez Avatar asked Oct 13 '14 14:10

poiuytrez


People also ask

What does the toDebugString () method do?

ToDebugString Method to get RDD Lineage Graph in Spark Here for indication of shuffle boundary, this method “ toDebugString method” uses indentations. Basically, here H in round brackets refers, numbers that show the level of parallelism at each stage. For example, (2) in the above output.

What is spark toDebugString?

In spark, dependencies in the RDDs are logged in as a graph. In simpler words , every step is part of lineage. By calling the toDebugString method you are essentially asking to get this lineage graph(aka chain of every individual step that happened i.e type of RDD created and method used to create it) to be displayed.

Which is better spark or PySpark?

Spark is an awesome framework and the Scala and Python APIs are both great for most workflows. PySpark is more popular because Python is the most popular language in the data community. PySpark is a well supported, first class Spark API, and is a great choice for most organizations.

What is the use of RDD lineage?

Lineage is an RDD process to reconstruct lost partitions. Spark not replicate the data in memory, if data lost, Rdd use linege to rebuild lost data. Each RDD remembers how the RDD build from other datasets.


1 Answers

Try adding a print statement so that the debug string is actually printed, rather than displaying its __repr__:

>>> a = sc.parallelize([1,2,3]).distinct()
>>> print a.toDebugString()
(8) PythonRDD[27] at RDD at PythonRDD.scala:44 [Serialized 1x Replicated]
 |  MappedRDD[26] at values at NativeMethodAccessorImpl.java:-2 [Serialized 1x Replicated]
 |  ShuffledRDD[25] at partitionBy at NativeMethodAccessorImpl.java:-2 [Serialized 1x Replicated]
 +-(8) PairwiseRDD[24] at distinct at <stdin>:1 [Serialized 1x Replicated]
    |  PythonRDD[23] at distinct at <stdin>:1 [Serialized 1x Replicated]
    |  ParallelCollectionRDD[21] at parallelize at PythonRDD.scala:358 [Serialized 1x Replicated]
like image 191
Josh Rosen Avatar answered Sep 28 '22 19:09

Josh Rosen