In Pandas DataFrame, I can use DataFrame.isin()
function to match the column values against another column.
For example: suppose we have one DataFrame:
df_A = pd.DataFrame({'col1': ['A', 'B', 'C', 'B', 'C', 'D'],
'col2': [1, 2, 3, 4, 5, 6]})
df_A
col1 col2
0 A 1
1 B 2
2 C 3
3 B 4
4 C 5
5 D 6
and another DataFrame:
df_B = pd.DataFrame({'col1': ['C', 'E', 'D', 'C', 'F', 'G', 'H'],
'col2': [10, 20, 30, 40, 50, 60, 70]})
df_B
col1 col2
0 C 10
1 E 20
2 D 30
3 C 40
4 F 50
5 G 60
6 H 70
I can use .isin()
function to match the column values of df_B
against the column values of df_A
E.g.:
df_B[df_B['col1'].isin(df_A['col1'])]
yields:
col1 col2
0 C 10
2 D 30
3 C 40
What's the equivalent operation in PySpark DataFrame?
df_A = pd.DataFrame({'col1': ['A', 'B', 'C', 'B', 'C', 'D'],
'col2': [1, 2, 3, 4, 5, 6]})
df_A = sqlContext.createDataFrame(df_A)
df_B = pd.DataFrame({'col1': ['C', 'E', 'D', 'C', 'F', 'G', 'H'],
'col2': [10, 20, 30, 40, 50, 60, 70]})
df_B = sqlContext.createDataFrame(df_B)
df_B[df_B['col1'].isin(df_A['col1'])]
The .isin()
code above gives me an error messages:
u'resolved attribute(s) col1#9007 missing from
col1#9012,col2#9013L in operator !Filter col1#9012 IN
(col1#9007);;\n!Filter col1#9012 IN (col1#9007)\n+-
LogicalRDD [col1#9012, col2#9013L]\n'
Timestamp difference in PySpark can be calculated by using 1) unix_timestamp() to get the Time in seconds and subtract with other time to get the seconds 2) Cast TimestampType column to LongType and subtract two long values to get the difference in seconds, divide it by 60 to get the minute difference and finally ...
In Spark use isin() function of Column class to check if a column value of DataFrame exists/contains in a list of string values.
Computes basic statistics for numeric and string columns.
regexp_replace is a string function that is used to replace part of a string (substring) value with another string on DataFrame column by using gular expression (regex). This function returns a org. apache. spark.
This kind of operation is called left semi join in spark:
df_B.join(df_A, ['col1'], 'leftsemi')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With