Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to perform the equivalent of a correlated subquery in pandas

Tags:

python

pandas

I have a CSV file from the Kaggle Titanic competition as follows. The record format of this file is described by the following columns: PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked. I want to analyze the data in this file and check whether passengers traveling in a group had a better survival rate. For this I assume that the value for Ticket will be the same for all passengers in a group.

I loaded the CSV in MS Access, and executed the following query to get the desired result set:

SELECT a.Ticket, a.PassengerId, a.Survived
FROM train a
WHERE 1 < (SELECT COUNT(*) FROM train b WHERE b.Ticket = a.Ticket)
ORDER BY a.Ticket

I am not being able to extract the same result set as above, without writing a loop.

like image 853
Sudipto Chatterjee Avatar asked Oct 17 '22 13:10

Sudipto Chatterjee


1 Answers

Let's see if this matches:

df.groupby(['Ticket']).filter(lambda x: x.Ticket.count()>1)[['Ticket','PassengerId','Survived']]

Or with Jezrael's suggestion:

df.groupby(['Ticket']).filter(lambda x: len(x)>1)[['Ticket','PassengerId','Survived']]

I am using groupby on Tickets then filtering my dataframe to those records where the count in that ticket group is greater than 1, using filter.

like image 177
Scott Boston Avatar answered Oct 20 '22 10:10

Scott Boston