I have a CSV file from the Kaggle Titanic competition as follows. The record format of this file is described by the following columns: PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked. I want to analyze the data in this file and check whether passengers traveling in a group had a better survival rate. For this I assume that the value for Ticket will be the same for all passengers in a group.
I loaded the CSV in MS Access, and executed the following query to get the desired result set:
SELECT a.Ticket, a.PassengerId, a.Survived
FROM train a
WHERE 1 < (SELECT COUNT(*) FROM train b WHERE b.Ticket = a.Ticket)
ORDER BY a.Ticket
I am not being able to extract the same result set as above, without writing a loop.
Let's see if this matches:
df.groupby(['Ticket']).filter(lambda x: x.Ticket.count()>1)[['Ticket','PassengerId','Survived']]
Or with Jezrael's suggestion:
df.groupby(['Ticket']).filter(lambda x: len(x)>1)[['Ticket','PassengerId','Survived']]
I am using groupby
on Tickets then filtering my dataframe to those records where the count in that ticket group is greater than 1, using filter
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With