I have a spark Time Series data frame. I would like to split it into 80-20 (train-test). As this is a time series data frame, I don't want to do a random split. How do I do this in order to pass the first data frame into train and the second to test?

You can use <code>pyspark.sql.functions.percent_rank()</code> to get the percentile ranking of your DataFrame ordered by the timestamp/date column. Then pick all the columns with a <code>rank <= 0.8</code> as your training set and the rest as your test set. For example, if you had the following DataFrame: <pre class="prettyprint lang-python prettyprint-override"><code>df.show(truncate=False) #+---------------------+---+ #|date |x | #+---------------------+---+ #|2018-01-01 00:00:00.0|0 | #|2018-01-02 00:00:00.0|1 | #|2018-01-03 00:00:00.0|2 | #|2018-01-04 00:00:00.0|3 | #|2018-01-05 00:00:00.0|4 | #+---------------------+---+ </code></pre> You'd want the first 4 rows in your training set and the last one in your training set. First add a column <code>rank</code>: <pre class="prettyprint lang-python prettyprint-override"><code>from pyspark.sql.functions import percent_rank from pyspark.sql import Window df = df.withColumn("rank", percent_rank().over(Window.partitionBy().orderBy("date"))) </code></pre> Now use <code>rank</code> to split your data into <code>train</code> and <code>test</code>: <pre class="prettyprint lang-python prettyprint-override"><code>train_df = df.where("rank <= .8").drop("rank") train_df.show() #+---------------------+---+ #|date |x | #+---------------------+---+ #|2018-01-01 00:00:00.0|0 | #|2018-01-02 00:00:00.0|1 | #|2018-01-03 00:00:00.0|2 | #|2018-01-04 00:00:00.0|3 | #+---------------------+---+ test_df = df.where("rank > .8").drop("rank") test_df.show() #+---------------------+---+ #|date |x | #+---------------------+---+ #|2018-01-05 00:00:00.0|4 | #+---------------------+---+ </code></pre>

Split Time Series pySpark data frame into test & train without using random split

Tags:

python

rdd

apache-spark-sql

pyspark

I have a spark Time Series data frame. I would like to split it into 80-20 (train-test). As this is a time series data frame, I don't want to do a random split. How do I do this in order to pass the first data frame into train and the second to test?

973

asked Aug 09 '18 17:08

Rohit

1 Answers

You can use pyspark.sql.functions.percent_rank() to get the percentile ranking of your DataFrame ordered by the timestamp/date column. Then pick all the columns with a rank <= 0.8 as your training set and the rest as your test set.

For example, if you had the following DataFrame:

df.show(truncate=False)
#+---------------------+---+
#|date                 |x  |
#+---------------------+---+
#|2018-01-01 00:00:00.0|0  |
#|2018-01-02 00:00:00.0|1  |
#|2018-01-03 00:00:00.0|2  |
#|2018-01-04 00:00:00.0|3  |
#|2018-01-05 00:00:00.0|4  |
#+---------------------+---+

You'd want the first 4 rows in your training set and the last one in your training set. First add a column rank:

from pyspark.sql.functions import percent_rank
from pyspark.sql import Window

df = df.withColumn("rank", percent_rank().over(Window.partitionBy().orderBy("date")))

Now use rank to split your data into train and test:

train_df = df.where("rank <= .8").drop("rank")
train_df.show()
#+---------------------+---+
#|date                 |x  |
#+---------------------+---+
#|2018-01-01 00:00:00.0|0  |
#|2018-01-02 00:00:00.0|1  |
#|2018-01-03 00:00:00.0|2  |
#|2018-01-04 00:00:00.0|3  |
#+---------------------+---+

test_df = df.where("rank > .8").drop("rank")
test_df.show()
#+---------------------+---+
#|date                 |x  |
#+---------------------+---+
#|2018-01-05 00:00:00.0|4  |
#+---------------------+---+

140

answered Oct 13 '22 16:10

pault

Related questions
                            
                                'function' object has no attribute 'assert_called_once_with'
                            
                                additional row colors in seaborn cluster map
                            
                                Python: Lib to use epoll if available, fallback to select
                            
                                Convert Google Vision API response to JSON
                            
                                Longest Common Subsequence in Python
                            
                                What's the difference between data time major and batch major?
                            
                                User input boolean in python
                            
                                Pandas split on regex
                            
                                map function run into infinite loop in 3.X
                            
                                How to open a Chrome Profile through Python
                            
                                Vectorized way to count occurrences of string in either of two columns
                            
                                get index of the first block of at least n consecutive False values in boolean array
                            
                                convert dict of dict to dataframe in pandas
                            
                                understanding level =0 and group_keys
                            
                                How fetch latest records using find_one in pymongo
                            
                                pandas get data for the end day of month?
                            
                                .NET Core 2.0 & Angular Initial app build fails - Can't find python followed by JavaScript Runtime Error
                            
                                OSError: [Errno 8] Exec format error: 'geckodriver' when trying to open firefox using selenium in python
                            
                                How to resize / rescale a SVG graphic in an iPython / Jupyter Notebook?
                            
                                Why doesn't my simple pytorch network work on GPU device?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With