I have this code: <pre class="prettyprint"><code>l = [('Alice', 1),('Jim',2),('Sandra',3)] df = sqlContext.createDataFrame(l, ['name', 'age']) df.withColumn('age2', df.age + 2).toPandas() </code></pre> Works fine, does what it needs to. Suppose though I only want to display the first n rows, and then call <code>toPandas()</code> to return a pandas dataframe. How do I do it? I can't call <code>take(n)</code> because that doesn't return a dataframe and thus I can't pass it to <code>toPandas()</code>. So to put it another way, how can I take the top n rows from a dataframe and call <code>toPandas()</code> on the resulting dataframe? Can't think this is difficult but I can't figure it out. I'm using Spark 1.6.0.

You can use the <code>limit(n)</code> function: <pre class="prettyprint"><code>l = [('Alice', 1),('Jim',2),('Sandra',3)] df = sqlContext.createDataFrame(l, ['name', 'age']) df.limit(2).withColumn('age2', df.age + 2).toPandas() </code></pre> Or: <pre class="prettyprint"><code>l = [('Alice', 1),('Jim',2),('Sandra',3)] df = sqlContext.createDataFrame(l, ['name', 'age']) df.withColumn('age2', df.age + 2).limit(2).toPandas() </code></pre>

Take n rows from a spark dataframe and pass to toPandas()

Tags:

python

apache-spark-sql

spark-dataframe

I have this code:

l = [('Alice', 1),('Jim',2),('Sandra',3)] df = sqlContext.createDataFrame(l, ['name', 'age']) df.withColumn('age2', df.age + 2).toPandas()

Works fine, does what it needs to. Suppose though I only want to display the first n rows, and then call toPandas() to return a pandas dataframe. How do I do it? I can't call take(n) because that doesn't return a dataframe and thus I can't pass it to toPandas().

So to put it another way, how can I take the top n rows from a dataframe and call toPandas() on the resulting dataframe? Can't think this is difficult but I can't figure it out.

I'm using Spark 1.6.0.

737

asked Nov 10 '16 22:11

jamiet

2 Answers

You can use the limit(n) function:

l = [('Alice', 1),('Jim',2),('Sandra',3)] df = sqlContext.createDataFrame(l, ['name', 'age']) df.limit(2).withColumn('age2', df.age + 2).toPandas()

Or:

l = [('Alice', 1),('Jim',2),('Sandra',3)] df = sqlContext.createDataFrame(l, ['name', 'age']) df.withColumn('age2', df.age + 2).limit(2).toPandas()

103

answered Oct 02 '22 20:10

Neo

You could get first rows of Spark DataFrame with head and then create Pandas DataFrame:

l = [('Alice', 1),('Jim',2),('Sandra',3)] df = sqlContext.createDataFrame(l, ['name', 'age'])  df_pandas = pd.DataFrame(df.head(3), columns=df.columns)  In [4]: df_pandas Out[4]:       name  age 0   Alice    1 1     Jim    2 2  Sandra    3

answered Oct 02 '22 21:10

Anton Protopopov

Related questions
                            
                                Debugging Apache/Django/WSGI Bad Request (400) Error
                            
                                How to check if DynamoDB table exists?
                            
                                Pandas: ValueError: cannot convert float NaN to integer
                            
                                recover dict from 0-d numpy array
                            
                                Jinja2 template not rendering if-elif-else statement properly
                            
                                Check if dataframe column is Categorical
                            
                                Get weekday/day-of-week for Datetime column of DataFrame
                            
                                Get POSIX/Unix time in seconds and nanoseconds in Python?
                            
                                Python: Converting string into decimal number
                            
                                Multiple assignments into a python dictionary
                            
                                can you write a str.replace() using dictionary values in Python?
                            
                                jinja2 how to remove trailing newline
                            
                                Why Java and Python garbage collection methods are different?
                            
                                Error handling in SQLAlchemy
                            
                                Replace part of a string in Python?
                            
                                Python BeautifulSoup give multiple tags to findAll
                            
                                Superscript in Python plots
                            
                                Best practice in python for return value on error vs. success
                            
                                Find all columns of dataframe in Pandas whose type is float, or a particular type?
                            
                                matplotlib: make plus sign thicker

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With