Is there any difference in semantics between <code>df.na().drop()</code> and <code>df.filter(df.col("onlyColumnInOneColumnDataFrame").isNotNull() && !df.col("onlyColumnInOneColumnDataFrame").isNaN())</code> where <code>df</code> is Apache Spark <code>Dataframe</code>? Or shall I consider it as a bug if the first one does NOT return afterwards <code>null</code> (not a String null, but simply a <code>null</code> value) in the column <code>onlyColumnInOneColumnDataFrame</code> and the second one does? EDIT: added <code>!isNaN()</code> as well. The <code>onlyColumnInOneColumnDataFrame</code> is the only column in the given <code>Dataframe</code>. Let's say it's type is <code>Integer</code>.

With <code>df.na.drop()</code> you drop the rows containing any null or NaN values. With <code>df.filter(df.col("onlyColumnInOneColumnDataFrame").isNotNull())</code> you drop those rows which have null only in the column <code>onlyColumnInOneColumnDataFrame</code>. If you would want to achieve the same thing, that would be <code>df.na.drop(["onlyColumnInOneColumnDataFrame"])</code>.

Difference between na().drop() and filter(col.isNotNull) (Apache Spark)

Tags:

apache-spark

apache-spark-sql

Is there any difference in semantics between df.na().drop() and df.filter(df.col("onlyColumnInOneColumnDataFrame").isNotNull() && !df.col("onlyColumnInOneColumnDataFrame").isNaN()) where df is Apache Spark Dataframe?

Or shall I consider it as a bug if the first one does NOT return afterwards null (not a String null, but simply a null value) in the column onlyColumnInOneColumnDataFrame and the second one does?

EDIT: added !isNaN() as well. The onlyColumnInOneColumnDataFrame is the only column in the given Dataframe. Let's say it's type is Integer.

773

asked Feb 18 '16 09:02

JiriS

1 Answers

With df.na.drop() you drop the rows containing any null or NaN values.

With df.filter(df.col("onlyColumnInOneColumnDataFrame").isNotNull()) you drop those rows which have null only in the column onlyColumnInOneColumnDataFrame.

If you would want to achieve the same thing, that would be df.na.drop(["onlyColumnInOneColumnDataFrame"]).

answered Oct 11 '22 20:10

Daniel Zolnai

Related questions
                            
                                Replace empty strings with None/null values in DataFrame
                            
                                Increase memory available to PySpark at runtime
                            
                                how to convert json string to dataframe on spark
                            
                                Difference in dense rank and row number in spark
                            
                                How to set Master address for Spark examples from command line
                            
                                Querying on multiple Hive stores using Apache Spark
                            
                                Concatenating datasets of different RDDs in Apache spark using scala
                            
                                How to know which piece of code runs on driver or executor?
                            
                                What is the difference between Spark Standalone, YARN and local mode?
                            
                                How to create correct data frame for classification in Spark ML
                            
                                PySpark dataframe convert unusual string format to Timestamp
                            
                                Save Spark dataframe as dynamic partitioned table in Hive
                            
                                Change nullable property of column in spark dataframe
                            
                                Reading DataFrame from partitioned parquet file
                            
                                Running scheduled Spark job
                            
                                pyspark: Efficiently have partitionBy write to same number of total partitions as original table
                            
                                Spark DataFrames: registerTempTable vs not
                            
                                Select Specific Columns from Spark DataFrame
                            
                                Spark2.1.0 incompatible Jackson versions 2.7.6
                            
                                How to obtain the symmetric difference between two DataFrames?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With