This is what I get when I use toDebugString in scala:
scala> val a = sc.parallelize(Array(1,2,3)).distinct
a: org.apache.spark.rdd.RDD[Int] = MappedRDD[3] at distinct at <console>:12
scala> a.toDebugString
res0: String =
(4) MappedRDD[3] at distinct at <console>:12
| ShuffledRDD[2] at distinct at <console>:12
+-(4) MappedRDD[1] at distinct at <console>:12
| ParallelCollectionRDD[0] at parallelize at <console>:12
This is the equivalent in python:
>>> a = sc.parallelize([1,2,3]).distinct()
>>> a.toDebugString()
'(4) PythonRDD[6] at RDD at PythonRDD.scala:43\n | MappedRDD[5] at values at NativeMethodAccessorImpl.java:-2\n | ShuffledRDD[4] at partitionBy at NativeMethodAccessorImpl.java:-2\n +-(4) PairwiseRDD[3] at RDD at PythonRDD.scala:261\n | PythonRDD[2] at RDD at PythonRDD.scala:43\n | ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:315'
As you can see, the output is not as nice in python as in scala. Is there any trick to have a nicer output of this function?
I am using Spark 1.1.0.
ToDebugString Method to get RDD Lineage Graph in Spark Here for indication of shuffle boundary, this method “ toDebugString method” uses indentations. Basically, here H in round brackets refers, numbers that show the level of parallelism at each stage. For example, (2) in the above output.
In spark, dependencies in the RDDs are logged in as a graph. In simpler words , every step is part of lineage. By calling the toDebugString method you are essentially asking to get this lineage graph(aka chain of every individual step that happened i.e type of RDD created and method used to create it) to be displayed.
Spark is an awesome framework and the Scala and Python APIs are both great for most workflows. PySpark is more popular because Python is the most popular language in the data community. PySpark is a well supported, first class Spark API, and is a great choice for most organizations.
Lineage is an RDD process to reconstruct lost partitions. Spark not replicate the data in memory, if data lost, Rdd use linege to rebuild lost data. Each RDD remembers how the RDD build from other datasets.
Try adding a print
statement so that the debug string is actually printed, rather than displaying its __repr__
:
>>> a = sc.parallelize([1,2,3]).distinct()
>>> print a.toDebugString()
(8) PythonRDD[27] at RDD at PythonRDD.scala:44 [Serialized 1x Replicated]
| MappedRDD[26] at values at NativeMethodAccessorImpl.java:-2 [Serialized 1x Replicated]
| ShuffledRDD[25] at partitionBy at NativeMethodAccessorImpl.java:-2 [Serialized 1x Replicated]
+-(8) PairwiseRDD[24] at distinct at <stdin>:1 [Serialized 1x Replicated]
| PythonRDD[23] at distinct at <stdin>:1 [Serialized 1x Replicated]
| ParallelCollectionRDD[21] at parallelize at PythonRDD.scala:358 [Serialized 1x Replicated]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With