Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas: get rows by comparing two columns of dataframe to list of tuples

Tags:

python

pandas

Say I have a pandas DataFrame with four columns: A,B,C,D.

my_df = pd.DataFrame({'A': [0,1,4,9], 'B': [1,7,5,7],'C':[1,1,1,1],'D':[2,2,2,2]})

I also have a list of tuples:

my_tuples = [(0,1),(4,5),(9,9)]

I want to keep only the rows of the dataframe where the value of (my_df['A'],my_df['B']) is equal to one of the tuples in my_tuples.

In this example, this would be row#0 and row#2.

Is there a good way to do this? I'd appreciate any help.

like image 592
abra Avatar asked Mar 17 '20 10:03

abra


People also ask

How do I compare two rows in a DataFrame Pandas?

During data analysis, one might need to compute the difference between two rows for comparison purposes. This can be done using pandas. DataFrame. diff() function.

How do I combine two columns in tuple Pandas?

To combine two columns as a single column of tuples, use the DataFrame's apply(tuple) method.

How to select multiple columns in pandas Dataframe?

Pandas is one of those packages and makes importing and analyzing data much easier. Let’s discuss all different ways of selecting multiple columns in a pandas DataFrame. Given a dictionary which contains Employee entity as keys and list of those entity as values. Select Second to fourth column. Example 2: Select one to another columns.

How to compare the number of goals in a pandas Dataframe?

We can use the following code to compare the number of goals by row and output the winner of the match in a third column: The results of the comparison are shown in the new column called winner. Here are a few things to keep in mind when comparing two columns in a pandas DataFrame: The number of conditions and choices should be equal.

Can two DataFrames have the same column names but different rows?

As you can see, both data sets contain the same column names, but partly different rows. In this example, I’ll show how to compare two pandas DataFrames with different lengths and identify all different rows that are contained in only one of the two DataFrames.

How are the two pandas DataFrames visualized in tables 1 and 2?

After executing the previous Python programming code the two pandas DataFrames visualized in Tables 1 and 2 have been created. As you can see, both data sets contain the same column names, but partly different rows.


2 Answers

Use DataFrame.merge with DataFrame created by tuples, there is no on parameter for default interecton of all columns in both DataFrames, here A and B:

df = my_df.merge(pd.DataFrame(my_tuples, columns=['A','B']))
print (df)
   A  B  C  D
0  0  1  1  2
1  4  5  1  2

Or:

df = my_df[my_df.set_index(['A','B']).index.isin(my_tuples)]
print (df)
   A  B  C  D
0  0  1  1  2
2  4  5  1  2
like image 81
jezrael Avatar answered Oct 25 '22 19:10

jezrael


We can also use DataFrame.loc with map.

my_df.loc[list(map(lambda x: x in my_tuples, zip(my_df['A'], my_df['B']))),:]

#my_df.loc[[row in my_tuples for row in zip(my_df['A'], my_df['B'])],:]

Time comparison

%%timeit
my_df.loc[list(map(lambda x: x in my_tuples, zip(my_df['A'], my_df['B']))),:]
394 µs ± 24.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
df = my_df.merge(pd.DataFrame(my_tuples, columns=['A','B']))
3.56 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


%%timeit
df = my_df[my_df.set_index(['A','B']).index.isin(my_tuples)]
3.99 ms ± 139 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
like image 43
ansev Avatar answered Oct 25 '22 20:10

ansev