I have a very big pyspark.sql.dataframe.DataFrame named df. I need some way of enumerating records- thus, being able to access record with certain index. (or select group of records with indexes range) In pandas, I could make just <pre class="prettyprint"><code>indexes=[2,3,6,7] df[indexes] </code></pre> Here I want something similar, (and without converting dataframe to pandas) The closest I can get to is: <ul> <li> Enumerating all the objects in the original dataframe by: <pre class="prettyprint"><code>indexes=np.arange(df.count()) df_indexed=df.withColumn('index', indexes) </code></pre> <ul> <li>Searching for values I need using where() function. </li> </ul> </li> </ul> QUESTIONS: <ol> <li>Why it doesn't work and how to make it working? How to add a row to a dataframe?</li> <li> Would it work later to make something like: <pre class="prettyprint"><code> indexes=[2,3,6,7] df1.where("index in indexes").collect() </code></pre> </li> <li>Any faster and simpler way to deal with it?</li> </ol>

It doesn't work because: <ol> <li>the second argument for <code>withColumn</code> should be a <code>Column</code> not a collection. <code>np.array</code> won't work here</li> <li>when you pass <code>"index in indexes"</code> as a SQL expression to <code>where</code> <code>indexes</code> is out of scope and it is not resolved as a valid identifier</li> </ol> PySpark >= 1.4.0 <s>You can add row numbers using respective window function and query using <code>Column.isin</code> method or properly formated query string:</s> <pre class="prettyprint lang-py prettyprint-override"><code>from pyspark.sql.functions import col, rowNumber from pyspark.sql.window import Window w = Window.orderBy() indexed = df.withColumn("index", rowNumber().over(w)) # Using DSL indexed.where(col("index").isin(set(indexes))) # Using SQL expression indexed.where("index in ({0})".format(",".join(str(x) for x in indexes))) </code></pre> It looks like window functions called without <code>PARTITION BY</code> clause move all data to the single partition so above may be not the best solution after all. <blockquote> Any faster and simpler way to deal with it? </blockquote> Not really. Spark DataFrames don't support random row access. <code>PairedRDD</code> can be accessed using <code>lookup</code> method which is relatively fast if data is partitioned using <code>HashPartitioner</code>. There is also indexed-rdd project which supports efficient lookups. Edit: Independent of PySpark version you can try something like this: <pre class="prettyprint"><code>from pyspark.sql import Row from pyspark.sql.types import StructType, StructField, LongType row = Row("char") row_with_index = Row("char", "index") df = sc.parallelize(row(chr(x)) for x in range(97, 112)).toDF() df.show(5) ## +----+ ## |char| ## +----+ ## | a| ## | b| ## | c| ## | d| ## | e| ## +----+ ## only showing top 5 rows # This part is not tested but should work and save some work later schema = StructType( df.schema.fields[:] + [StructField("index", LongType(), False)]) indexed = (df.rdd # Extract rdd .zipWithIndex() # Add index .map(lambda ri: row_with_index(*list(ri[0]) + [ri[1]])) # Map to rows .toDF(schema)) # It will work without schema but will be more expensive # inSet in Spark < 1.3 indexed.where(col("index").isin(indexes)) </code></pre>

PySpark DataFrames - way to enumerate without converting to Pandas?

Tags:

python

apache-spark

rdd

pyspark

bigdata

I have a very big pyspark.sql.dataframe.DataFrame named df. I need some way of enumerating records- thus, being able to access record with certain index. (or select group of records with indexes range)

In pandas, I could make just

indexes=[2,3,6,7] 
df[indexes]

Here I want something similar, (and without converting dataframe to pandas)

The closest I can get to is:

Enumerating all the objects in the original dataframe by:
```
indexes=np.arange(df.count())
df_indexed=df.withColumn('index', indexes)
```
- Searching for values I need using where() function.

QUESTIONS:

Why it doesn't work and how to make it working? How to add a row to a dataframe?

Would it work later to make something like:

 indexes=[2,3,6,7] 
 df1.where("index in indexes").collect()

Any faster and simpler way to deal with it?

991

asked Sep 24 '15 12:09

Maria Koroliuk

1 Answers

It doesn't work because:

the second argument for withColumn should be a Column not a collection. np.array won't work here
when you pass "index in indexes" as a SQL expression to where indexes is out of scope and it is not resolved as a valid identifier

PySpark >= 1.4.0

~~You can add row numbers using respective window function and query using Column.isin method or properly formated query string:~~

from pyspark.sql.functions import col, rowNumber
from pyspark.sql.window import Window

w = Window.orderBy()
indexed = df.withColumn("index", rowNumber().over(w))

# Using DSL
indexed.where(col("index").isin(set(indexes)))

# Using SQL expression
indexed.where("index in ({0})".format(",".join(str(x) for x in indexes)))

It looks like window functions called without PARTITION BY clause move all data to the single partition so above may be not the best solution after all.

Any faster and simpler way to deal with it?

Not really. Spark DataFrames don't support random row access.

PairedRDD can be accessed using lookup method which is relatively fast if data is partitioned using HashPartitioner. There is also indexed-rdd project which supports efficient lookups.

Edit:

Independent of PySpark version you can try something like this:

from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, LongType

row = Row("char")
row_with_index = Row("char", "index")

df = sc.parallelize(row(chr(x)) for x in range(97, 112)).toDF()
df.show(5)

## +----+
## |char|
## +----+
## |   a|
## |   b|
## |   c|
## |   d|
## |   e|
## +----+
## only showing top 5 rows

# This part is not tested but should work and save some work later
schema  = StructType(
    df.schema.fields[:] + [StructField("index", LongType(), False)])

indexed = (df.rdd # Extract rdd
    .zipWithIndex() # Add index
    .map(lambda ri: row_with_index(*list(ri[0]) + [ri[1]])) # Map to rows
    .toDF(schema)) # It will work without schema but will be more expensive

# inSet in Spark < 1.3
indexed.where(col("index").isin(indexes))

answered Sep 22 '22 06:09

zero323

Related questions
                            
                                Python equivalent of LINQ All function?
                            
                                Deleting a Secure Cookie in tornado
                            
                                Create and stream a large archive without storing it in memory or on disk
                            
                                python: plotting a histogram with a function line on top
                            
                                How to get XML tag value in Python
                            
                                How can I defer the execution of Celery tasks?
                            
                                Django/Python: generate pdf with the proper language
                            
                                Changing position of vertical (z) axis of 3D plot (Matplotlib)?
                            
                                Efficiently creating additional columns in a pandas DataFrame using .map()
                            
                                Python/pandas idiom for if/then/else [duplicate]
                            
                                How to force virtualenv to install latest setuptools and pip from pypi?
                            
                                python os.environ, os.putenv, /usr/bin/env
                            
                                How can I make PyInstaller's .spec files actually portable? (woes absolute path for 'pathex' parameter)
                            
                                Default kwarg values for Python's str.format() method
                            
                                Unexpected keyword argument "context" when using appcfg.py
                            
                                Play Animations in GIF with Tkinter [duplicate]
                            
                                Intellij/Pycharm can't debug Python modules
                            
                                How can I reuse exception handling code for multiple functions in Python?
                            
                                How to perform JPEG compression in Python without writing/reading
                            
                                Flask, Python and Socket.io: multithreading app is giving me "RuntimeError: working outside of request context"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With