I have a dataframe of <pre class="prettyprint"><code>date, string, string </code></pre> I want to select dates before a certain period. I have tried the following with no luck <pre class="prettyprint"><code> data.filter(data("date") < new java.sql.Date(format.parse("2015-03-14").getTime)) </code></pre> I'm getting an error stating the following <pre class="prettyprint"><code>org.apache.spark.sql.AnalysisException: resolved attribute(s) date#75 missing from date#72,uid#73,iid#74 in operator !Filter (date#75 < 16508); </code></pre> As far as I can guess the query is incorrect. Can anyone show me what way the query should be formatted? I checked that all enteries in the dataframe have values - they do.

The following solutions are applicable since spark 1.5 : For lower than : <pre class="prettyprint lang-scala prettyprint-override"><code>// filter data where the date is lesser than 2015-03-14 data.filter(data("date").lt(lit("2015-03-14"))) </code></pre> For greater than : <pre class="prettyprint lang-scala prettyprint-override"><code>// filter data where the date is greater than 2015-03-14 data.filter(data("date").gt(lit("2015-03-14"))) </code></pre> For equality, you can use either <code>equalTo</code> or <code>===</code> : <pre class="prettyprint lang-scala prettyprint-override"><code>data.filter(data("date") === lit("2015-03-14")) </code></pre> If your <code>DataFrame</code> date column is of type <code>StringType</code>, you can convert it using the <code>to_date</code> function : <pre class="prettyprint lang-scala prettyprint-override"><code>// filter data where the date is greater than 2015-03-14 data.filter(to_date(data("date")).gt(lit("2015-03-14"))) </code></pre> You can also filter according to a year using the <code>year</code> function : <pre class="prettyprint lang-scala prettyprint-override"><code>// filter data where year is greater or equal to 2016 data.filter(year($"date").geq(lit(2016))) </code></pre>

Don't use this as suggested in other answers <pre class="prettyprint"><code>.filter(f.col("dateColumn") < f.lit('2017-11-01')) </code></pre> But use this instead <pre class="prettyprint"><code>.filter(f.col("dateColumn") < f.unix_timestamp(f.lit('2017-11-01 00:00:00')).cast('timestamp')) </code></pre> This will use the <code>TimestampType</code> instead of the <code>StringType</code>, which will be more performant in some cases. For example Parquet predicate pushdown will only work with the latter. Edit: Both snippets assume this import: <pre class="prettyprint"><code>from pyspark.sql import functions as f </code></pre>

I find the most readable way to express this is using a sql expression: <pre class="prettyprint"><code>df.filter("my_date < date'2015-01-01'") </code></pre> we can verify this works correctly by looking at the physical plan from <code>.explain()</code> <pre class="prettyprint"><code>+- *(1) Filter (isnotnull(my_date#22) && (my_date#22 < 16436)) </code></pre>

<pre class="prettyprint"><code>df=df.filter(df["columnname"]>='2020-01-13') </code></pre>

We can also use SQL kind of expression inside filter : <hr> <blockquote> Note -> Here I am showing two conditions and a date range for future reference : </blockquote> <hr> <pre class="prettyprint"><code>ordersDf.filter("order_status = 'PENDING_PAYMENT' AND order_date BETWEEN '2013-07-01' AND '2013-07-31' ") </code></pre>

Filtering a spark dataframe based on date

Tags:

apache-spark

apache-spark-sql

I have a dataframe of

date, string, string

I want to select dates before a certain period. I have tried the following with no luck

 data.filter(data("date") < new java.sql.Date(format.parse("2015-03-14").getTime))

I'm getting an error stating the following

org.apache.spark.sql.AnalysisException: resolved attribute(s) date#75 missing from date#72,uid#73,iid#74 in operator !Filter (date#75 < 16508);

As far as I can guess the query is incorrect. Can anyone show me what way the query should be formatted?

I checked that all enteries in the dataframe have values - they do.

671

asked Aug 13 '15 17:08

Steve

6 Answers

The following solutions are applicable since spark 1.5 :

For lower than :

// filter data where the date is lesser than 2015-03-14
data.filter(data("date").lt(lit("2015-03-14")))

For greater than :

// filter data where the date is greater than 2015-03-14
data.filter(data("date").gt(lit("2015-03-14")))

For equality, you can use either equalTo or === :

data.filter(data("date") === lit("2015-03-14"))

If your DataFrame date column is of type StringType, you can convert it using the to_date function :

// filter data where the date is greater than 2015-03-14
data.filter(to_date(data("date")).gt(lit("2015-03-14")))

You can also filter according to a year using the year function :

// filter data where year is greater or equal to 2016
data.filter(year($"date").geq(lit(2016)))

127

answered Oct 21 '22 13:10

eliasah

Don't use this as suggested in other answers

.filter(f.col("dateColumn") < f.lit('2017-11-01'))

But use this instead

.filter(f.col("dateColumn") < f.unix_timestamp(f.lit('2017-11-01 00:00:00')).cast('timestamp'))

This will use the TimestampType instead of the StringType, which will be more performant in some cases. For example Parquet predicate pushdown will only work with the latter.

Edit: Both snippets assume this import:

from pyspark.sql import functions as f

answered Oct 21 '22 13:10

Ruurtjan Pul

I find the most readable way to express this is using a sql expression:

df.filter("my_date < date'2015-01-01'")

we can verify this works correctly by looking at the physical plan from .explain()

+- *(1) Filter (isnotnull(my_date#22) && (my_date#22 < 16436))

answered Oct 21 '22 13:10

RobinL

In PySpark(python) one of the option is to have the column in unix_timestamp format.We can convert string to unix_timestamp and specify the format as shown below. Note we need to import unix_timestamp and lit function

from pyspark.sql.functions import unix_timestamp, lit

df.withColumn("tx_date", to_date(unix_timestamp(df_cast["date"], "MM/dd/yyyy").cast("timestamp")))

Now we can apply the filters

df_cast.filter(df_cast["tx_date"] >= lit('2017-01-01')) \
       .filter(df_cast["tx_date"] <= lit('2017-01-31')).show()

answered Oct 21 '22 14:10

Prathap Kudupu

df=df.filter(df["columnname"]>='2020-01-13')

answered Oct 21 '22 13:10

Prastuti Srivastava

We can also use SQL kind of expression inside filter :

Note -> Here I am showing two conditions and a date range for future reference :

ordersDf.filter("order_status = 'PENDING_PAYMENT' AND order_date BETWEEN '2013-07-01' AND '2013-07-31' ")

answered Oct 21 '22 14:10

Abhishek Sengupta

Related questions
                            
                                Calling Java/Scala function from a task
                            
                                Getting the count of records in a data frame quickly
                            
                                pyspark: rolling average using timeseries data
                            
                                Where do you need to use lit() in Pyspark SQL?
                            
                                Spark on yarn concept understanding
                            
                                Is there better way to display entire Spark SQL DataFrame?
                            
                                PySpark row-wise function composition
                            
                                SPARK SQL - case when then
                            
                                How to conditionally replace value in a column based on evaluation of expression based on another column in Pyspark?
                            
                                Can I add arguments to python code when I submit spark job?
                            
                                PySpark create new column with mapping from a dict
                            
                                DataFrame join optimization - Broadcast Hash Join
                            
                                How to exclude multiple columns in Spark dataframe in Python
                            
                                “value $ is not a member of StringContext” - Missing Scala plugin?
                            
                                Understanding Spark's caching
                            
                                Viewing the content of a Spark Dataframe Column
                            
                                Fast Hadoop Analytics (Cloudera Impala vs Spark/Shark vs Apache Drill)
                            
                                Schema evolution in parquet format
                            
                                Spark Error:expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)
                            
                                Spark SQL Row_number() PartitionBy Sort Desc

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Filtering a spark dataframe based on date

Tags:

apache-spark

apache-spark-sql

Steve

People also ask

6 Answers

eliasah

Ruurtjan Pul

RobinL

Prathap Kudupu

Prastuti Srivastava

Abhishek Sengupta

Recent Activity

Donate For Us