Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Take n rows from a spark dataframe and pass to toPandas()

I have this code:

l = [('Alice', 1),('Jim',2),('Sandra',3)] df = sqlContext.createDataFrame(l, ['name', 'age']) df.withColumn('age2', df.age + 2).toPandas() 

Works fine, does what it needs to. Suppose though I only want to display the first n rows, and then call toPandas() to return a pandas dataframe. How do I do it? I can't call take(n) because that doesn't return a dataframe and thus I can't pass it to toPandas().

So to put it another way, how can I take the top n rows from a dataframe and call toPandas() on the resulting dataframe? Can't think this is difficult but I can't figure it out.

I'm using Spark 1.6.0.

like image 737
jamiet Avatar asked Nov 10 '16 22:11

jamiet


People also ask

How do you select n rows in PySpark?

In Spark/PySpark, you can use show() action to get the top/first N (5,10,100 ..) rows of the DataFrame and display them on a console or a log, there are also several Spark Actions like take() , tail() , collect() , head() , first() that return top and last n rows as a list of Rows (Array[Row] for Scala).

How do you filter rows in PySpark DataFrame?

PySpark filter() function is used to filter the rows from RDD/DataFrame based on the given condition or SQL expression, you can also use where() clause instead of the filter() if you are coming from an SQL background, both these functions operate exactly the same.

How do you filter rows in Spark?

Spark filter() or where() function is used to filter the rows from DataFrame or Dataset based on the given one or multiple conditions or SQL expression. You can use where() operator instead of the filter if you are coming from SQL background. Both these functions operate exactly the same.


2 Answers

You can use the limit(n) function:

l = [('Alice', 1),('Jim',2),('Sandra',3)] df = sqlContext.createDataFrame(l, ['name', 'age']) df.limit(2).withColumn('age2', df.age + 2).toPandas() 

Or:

l = [('Alice', 1),('Jim',2),('Sandra',3)] df = sqlContext.createDataFrame(l, ['name', 'age']) df.withColumn('age2', df.age + 2).limit(2).toPandas() 
like image 103
Neo Avatar answered Oct 02 '22 20:10

Neo


You could get first rows of Spark DataFrame with head and then create Pandas DataFrame:

l = [('Alice', 1),('Jim',2),('Sandra',3)] df = sqlContext.createDataFrame(l, ['name', 'age'])  df_pandas = pd.DataFrame(df.head(3), columns=df.columns)  In [4]: df_pandas Out[4]:       name  age 0   Alice    1 1     Jim    2 2  Sandra    3 
like image 25
Anton Protopopov Avatar answered Oct 02 '22 21:10

Anton Protopopov