In Pandas DataFrame, I can use <code>DataFrame.isin()</code> function to match the column values against another column. For example: suppose we have one DataFrame: <pre class="prettyprint"><code>df_A = pd.DataFrame({'col1': ['A', 'B', 'C', 'B', 'C', 'D'], 'col2': [1, 2, 3, 4, 5, 6]}) df_A col1 col2 0 A 1 1 B 2 2 C 3 3 B 4 4 C 5 5 D 6 </code></pre> and another DataFrame: <pre class="prettyprint"><code>df_B = pd.DataFrame({'col1': ['C', 'E', 'D', 'C', 'F', 'G', 'H'], 'col2': [10, 20, 30, 40, 50, 60, 70]}) df_B col1 col2 0 C 10 1 E 20 2 D 30 3 C 40 4 F 50 5 G 60 6 H 70 </code></pre> I can use <code>.isin()</code> function to match the column values of <code>df_B</code> against the column values of <code>df_A</code> E.g.: <pre class="prettyprint"><code>df_B[df_B['col1'].isin(df_A['col1'])] </code></pre> yields: <pre class="prettyprint"><code> col1 col2 0 C 10 2 D 30 3 C 40 </code></pre> What's the equivalent operation in PySpark DataFrame? <pre class="prettyprint"><code>df_A = pd.DataFrame({'col1': ['A', 'B', 'C', 'B', 'C', 'D'], 'col2': [1, 2, 3, 4, 5, 6]}) df_A = sqlContext.createDataFrame(df_A) df_B = pd.DataFrame({'col1': ['C', 'E', 'D', 'C', 'F', 'G', 'H'], 'col2': [10, 20, 30, 40, 50, 60, 70]}) df_B = sqlContext.createDataFrame(df_B) df_B[df_B['col1'].isin(df_A['col1'])] </code></pre> The <code>.isin()</code> code above gives me an error messages: <pre class="prettyprint"><code>u'resolved attribute(s) col1#9007 missing from col1#9012,col2#9013L in operator !Filter col1#9012 IN (col1#9007);;\n!Filter col1#9012 IN (col1#9007)\n+- LogicalRDD [col1#9012, col2#9013L]\n' </code></pre>

This kind of operation is called left semi join in spark: <pre class="prettyprint"><code>df_B.join(df_A, ['col1'], 'leftsemi') </code></pre>

PySpark: match the values of a DataFrame column against another DataFrame column

Tags:

python

apache-spark

pyspark

In Pandas DataFrame, I can use DataFrame.isin() function to match the column values against another column.

For example: suppose we have one DataFrame:

df_A = pd.DataFrame({'col1': ['A', 'B', 'C', 'B', 'C', 'D'], 
                     'col2': [1, 2, 3, 4, 5, 6]})
df_A

    col1  col2
0    A     1
1    B     2
2    C     3
3    B     4
4    C     5
5    D     6

and another DataFrame:

df_B = pd.DataFrame({'col1': ['C', 'E', 'D', 'C', 'F', 'G', 'H'], 
                     'col2': [10, 20, 30, 40, 50, 60, 70]})
df_B

    col1  col2
0    C    10
1    E    20
2    D    30
3    C    40
4    F    50
5    G    60
6    H    70

I can use .isin() function to match the column values of df_B against the column values of df_A

E.g.:

df_B[df_B['col1'].isin(df_A['col1'])]

yields:

    col1  col2
0    C    10
2    D    30
3    C    40

What's the equivalent operation in PySpark DataFrame?

df_A = pd.DataFrame({'col1': ['A', 'B', 'C', 'B', 'C', 'D'], 
                     'col2': [1, 2, 3, 4, 5, 6]})
df_A = sqlContext.createDataFrame(df_A)

df_B = pd.DataFrame({'col1': ['C', 'E', 'D', 'C', 'F', 'G', 'H'], 
                     'col2': [10, 20, 30, 40, 50, 60, 70]})
df_B = sqlContext.createDataFrame(df_B)


df_B[df_B['col1'].isin(df_A['col1'])]

The .isin() code above gives me an error messages:

u'resolved attribute(s) col1#9007 missing from 
col1#9012,col2#9013L in operator !Filter col1#9012 IN 
(col1#9007);;\n!Filter col1#9012 IN (col1#9007)\n+- 
LogicalRDD [col1#9012, col2#9013L]\n'

218

asked Mar 02 '17 02:03

cwl

1 Answers

This kind of operation is called left semi join in spark:

df_B.join(df_A, ['col1'], 'leftsemi')

174

answered Sep 16 '22 20:09

Mariusz

Related questions
                            
                                Update a MongoEngine document using a python dict?
                            
                                Pythonic way to correctly separate Model from application using SQLAlchemy
                            
                                How would I implement a dict with Abstract Base Classes in Python? [duplicate]
                            
                                Why is virtualenv necessary?
                            
                                python keyword arguments with hyphen
                            
                                Flask: How to serve static html?
                            
                                Find the shortest distance between a point and line segments (not line)
                            
                                Reading a JSON string | TypeError: string indices must be integers
                            
                                How to generate random points in a circular distribution
                            
                                convert sound to list of phonemes in python
                            
                                AttributeError: 'str' object has no attribute 'fileno'
                            
                                How to pass data between django views
                            
                                Pandas - automatically detect date columns **at run time**
                            
                                Equivalent urllib.parse.quote() in python 2.7
                            
                                Python: How to evaluate the residuals in StatsModels?
                            
                                Can I inspect a sqlalchemy query object to find the already joined tables?
                            
                                How to create a seaborn.heatmap() with frames around the tiles?
                            
                                Rolling Regression Estimation in Python dataframe
                            
                                sqlalchemy dynamic filtering
                            
                                Python: Map a function over recursive iterables

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With