How to slice a pyspark dataframe in two row-wise

Tags:

I am working in Databricks.

I have a dataframe which contains 500 rows, I would like to create two dataframes on containing 100 rows and the other containing the remaining 400 rows.

+--------------------+----------+
|              userid| eventdate|
+--------------------+----------+
|00518b128fc9459d9...|2017-10-09|
|00976c0b7f2c4c2ca...|2017-12-16|
|00a60fb81aa74f35a...|2017-12-04|
|00f9f7234e2c4bf78...|2017-05-09|
|0146fe6ad7a243c3b...|2017-11-21|
|016567f169c145ddb...|2017-10-16|
|01ccd278777946cb8...|2017-07-05|

I have tried the below but I receive an error

df1 = df[:99]
df2 = df[100:499]


TypeError: unexpected item type: <type 'slice'>

327

asked Feb 20 '18 12:02

Data_101

1 Answers

Initially I misunderstood and thought you wanted to slice the columns. If you want to select a subset of rows, one method is to create an index column using monotonically_increasing_id(). From the docs:

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.

You can use this ID to sort the dataframe and subset it using limit() to ensure you get exactly the rows you want.

For example:

import pyspark.sql.functions as f
import string

# create a dummy df with 500 rows and 2 columns
N = 500
numbers = [i%26 for i in range(N)]
letters = [string.ascii_uppercase[n] for n in numbers]

df = sqlCtx.createDataFrame(
    zip(numbers, letters),
    ('numbers', 'letters')
)

# add an index column
df = df.withColumn('index', f.monotonically_increasing_id())

# sort ascending and take first 100 rows for df1
df1 = df.sort('index').limit(100)

# sort descending and take 400 rows for df2
df2 = df.sort('index', ascending=False).limit(400)

Just to verify that this did what you wanted:

df1.count()
#100
df2.count()
#400

Also we can verify that the index column doesn't overlap:

df1.select(f.min('index').alias('min'), f.max('index').alias('max')).show()
#+---+---+
#|min|max|
#+---+---+
#|  0| 99|
#+---+---+

df2.select(f.min('index').alias('min'), f.max('index').alias('max')).show()
#+---+----------+
#|min|       max|
#+---+----------+
#|100|8589934841|
#+---+----------+

answered Sep 20 '22 20:09

pault

Related questions
                            
                                speech recognition python code not working
                            
                                Python HTML Encoding \xc2\xa0
                            
                                Replace all matches using re.findall()
                            
                                Python List object attribute 'append' is read-only
                            
                                Mock open() function used in a class method
                            
                                How to use pyinstaller?
                            
                                Python's json.load(sys.stdin) gets me u'...' instead of double quotes around Strings
                            
                                Why is a `for` over a Python list faster than over a Numpy array?
                            
                                Django annotate() error AttributeError: 'CharField' object has no attribute 'resolve_expression'
                            
                                Deprecated rolling window option in OLS from Pandas to Statsmodels
                            
                                Weighted correlation coefficient with pandas
                            
                                How to get odds-ratios and other related features with scikit-learn
                            
                                Pandas random sample with remove
                            
                                Is there a Python shortcut for an __init__ that simply sets properties? [duplicate]
                            
                                Is there a way to get access_key and secret_key from boto3? [duplicate]
                            
                                Get the last output of a dynamic_rnn in TensorFlow
                            
                                ffmpeg in python script
                            
                                Producing spectrogram from microphone
                            
                                Sklearn Label Encoding multiple columns pandas dataframe
                            
                                How to remove non-alpha-numeric characters from strings within a dataframe column in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to slice a pyspark dataframe in two row-wise

Tags:

python

pyspark

spark-dataframe

databricks

Data_101

People also ask

1 Answers

pault

Recent Activity

Donate For Us