Strange behavior when using toDF() function to transfrom RDD to Dataframe in PySpark

Tags:

I am new in Spark. And when I use toDF() function to convert RDD to dataframe, it seems to compute all the transformation function like map() I've written before. I wonder if toDF() in PySpark is a transformation or an action.

I create a simple RDD and use a simple function to output its value, just for test, And use toDF() after map(). The result seems to run the function in map partially. And when I show the result of dataframe, toDF() act like transformation and output the result again.

>>> a = sc.parallelize([(1,),(2,),(3,)])
>>> def f(x):
...     print(x[0])
...     return (x[0] + 1, )
...
>>> b = a.map(f).toDF(["id"])
2
1
>>> b = a.map(f).toDF(["id"]).show()
2
1
1
2
3
+---+
| id|
+---+
|  2|
|  3|
|  4|
+---+

Could someone tell me why toDF() function in PySpark act both like action and transformation? Thanks a lot.

PS: In Scala, toDF act like transformation in my case.

929

asked Oct 31 '18 05:10

xking

1 Answers

That's not strange. Since you didn't provide the schema, Spark has to infer it based on the data. If the RDD is an input, it will call SparkSession._createFromRDD and subsequently SparkSession._inferSchema, which, if samplingRatio is missing, will evaluate up to 100 row:

first = rdd.first()
if not first:
    raise ValueError("The first row in RDD is empty, "
                     "can not infer schema")
if type(first) is dict:
    warnings.warn("Using RDD of dict to inferSchema is deprecated. "
                  "Use pyspark.sql.Row instead")


if samplingRatio is None:
    schema = _infer_schema(first, names=names)
    if _has_nulltype(schema):
        for row in rdd.take(100)[1:]:
            schema = _merge_type(schema, _infer_schema(row, names=names))
            if not _has_nulltype(schema):
                break
        else:
            raise ValueError("Some of types cannot be determined by the "
                             "first 100 rows, please try again with sampling")

Now the only puzzle left if why it doesn't evaluate exactly one record. After-all in your case first is not empty and doesn't contain None.

That's because first is implemented through take and doesn't guarantee that the exact number of items will evaluated. If the first partition doesn't yield the required number of items, it will iteratively increase number of partitions to scan. Please check the implementation for details.

If you want to avoid this you should use createDataFrame and provide schema either as DDL string:

spark.createDataFrame(a.map(f), "val: integer")

or equivalent StructType.

You won't find any similar behavior in Scala counterpart, because it doesn't use schema inference in toDF. It either retrieves corresponding schema from the Encoder (which is fetched using Scala reflection), or doesn't allow conversion at all. The closest similar behavior is inference on input source like CSV or JSON:

spark.read.json(Seq("""{"foo": "bar"}""").toDS.map(x => { println(x); x }))

171

answered Nov 09 '22 02:11

10465355

Related questions
                            
                                C++ - vector version implement of argsort low effiency compared to the one in numpy
                            
                                Run schedule function in new thread
                            
                                How to use tf.contrib.model_pruning on MNIST?
                            
                                Debug Python in Docker Container
                            
                                Why do MFCC extraction libs return different values?
                            
                                Python3 reading mixed text/binary data line-by-line
                            
                                MySQL-python installation failed from python-alpine
                            
                                Swap a TensorFlow Dataset input pipeline with a placeholder after training
                            
                                Converting categorical variables to numbers based on frequency in a single line
                            
                                PyPDF2: Why does PdfFileWriter forget changes I made to a document?
                            
                                UMAP with Tensorboard projector
                            
                                Applying wildcard to Pandas isin filter [duplicate]
                            
                                Referring Enum members to each other
                            
                                Python tasks and DAGs with different conda environments
                            
                                How to use properly Tensorflow Dataset with batch?
                            
                                how to import conda packages into google colab?
                            
                                Change ChromeOptions in an existing webdriver
                            
                                Get current (or basic) python logging configuration as a dictionary
                            
                                Find out if two symmetric matrices are the same up to a permutation of the rows/columns
                            
                                pandas.errors.ParserError: Error could possibly be due to quotes being ignored when a multi-char delimiter is used

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Strange behavior when using toDF() function to transfrom RDD to Dataframe in PySpark

Tags:

python

apache-spark

rdd

apache-spark-sql

pyspark

xking

People also ask

1 Answers

10465355

Recent Activity

Donate For Us