Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to slice a pyspark dataframe in two row-wise

I am working in Databricks.

I have a dataframe which contains 500 rows, I would like to create two dataframes on containing 100 rows and the other containing the remaining 400 rows.

+--------------------+----------+
|              userid| eventdate|
+--------------------+----------+
|00518b128fc9459d9...|2017-10-09|
|00976c0b7f2c4c2ca...|2017-12-16|
|00a60fb81aa74f35a...|2017-12-04|
|00f9f7234e2c4bf78...|2017-05-09|
|0146fe6ad7a243c3b...|2017-11-21|
|016567f169c145ddb...|2017-10-16|
|01ccd278777946cb8...|2017-07-05|

I have tried the below but I receive an error

df1 = df[:99]
df2 = df[100:499]


TypeError: unexpected item type: <type 'slice'>
like image 327
Data_101 Avatar asked Feb 20 '18 12:02

Data_101


People also ask

How do you slice data in PySpark DataFrame?

In this method, we are first going to make a PySpark DataFrame using createDataFrame(). We will then use randomSplit() function to get two slices of the DataFrame while specifying the fractions of rows that will be present in both slices. The rows are split up RANDOMLY.

How do you slice a DataFrame row?

Slicing Rows and Columns by Index PositionWhen slicing by index position in Pandas, the start index is included in the output, but the stop index is one step beyond the row you want to select. So the slice return row 0 and row 1, but does not return row 2. The second slice [:] indicates that all columns are required.

How do you divide DataFrame into equal parts PySpark?

Example 1: Split dataframe using 'DataFrame.limit()' We will make use of the split() method to create 'n' equal dataframes. Where, Limits the result count to the number specified.


1 Answers

Initially I misunderstood and thought you wanted to slice the columns. If you want to select a subset of rows, one method is to create an index column using monotonically_increasing_id(). From the docs:

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.

You can use this ID to sort the dataframe and subset it using limit() to ensure you get exactly the rows you want.

For example:

import pyspark.sql.functions as f
import string

# create a dummy df with 500 rows and 2 columns
N = 500
numbers = [i%26 for i in range(N)]
letters = [string.ascii_uppercase[n] for n in numbers]

df = sqlCtx.createDataFrame(
    zip(numbers, letters),
    ('numbers', 'letters')
)

# add an index column
df = df.withColumn('index', f.monotonically_increasing_id())

# sort ascending and take first 100 rows for df1
df1 = df.sort('index').limit(100)

# sort descending and take 400 rows for df2
df2 = df.sort('index', ascending=False).limit(400)

Just to verify that this did what you wanted:

df1.count()
#100
df2.count()
#400

Also we can verify that the index column doesn't overlap:

df1.select(f.min('index').alias('min'), f.max('index').alias('max')).show()
#+---+---+
#|min|max|
#+---+---+
#|  0| 99|
#+---+---+

df2.select(f.min('index').alias('min'), f.max('index').alias('max')).show()
#+---+----------+
#|min|       max|
#+---+----------+
#|100|8589934841|
#+---+----------+
like image 87
pault Avatar answered Sep 20 '22 20:09

pault