Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apply function to a MultiIndex dataframe with pandas/python

Tags:

python

pandas

I have the following DataFrame that I wish to apply some date range calculations to. I want to select rows in the date frame where the the date difference between samples for unique persons (from sample_date) is less than 8 weeks and keep the row with the oldest date (i.e. the first sample).

Here is an example dataset. The actual dataset can exceed 200,000 records.

labno   name    sex dob         id     location  sample_date
1       John A  M   12/07/1969  12345  A         12/05/2112
2       John B  M   10/01/1964  54321  B         6/12/2010
3       James   M   30/08/1958  87878  A         30/04/2012
4       James   M   30/08/1958  45454  B         29/04/2012
5       Peter   M   12/05/1935  33322  C         15/07/2011
6       John A  M   12/07/1969  12345  A         14/05/2012
7       Peter   M   12/05/1935  33322  A         23/03/2011
8       Jack    M   5/12/1921   65655  B         15/08/2011
9       Jill    F   6/08/1986   65459  A         16/02/2012
10      Julie   F   4/03/1992   41211  C         15/09/2011
11      Angela  F   1/10/1977   12345  A         23/10/2006
12      Mark A  M   1/06/1955   56465  C         4/04/2011
13      Mark A  M   1/06/1955   45456  C         3/04/2011
14      Mark B  M   9/12/1984   55544  A         13/09/2012
15      Mark B  M   9/12/1984   55544  A         1/01/2012

Unique persons are those with the same name and dob. For example John A, James, Mark A, and Mark B are unique persons. Mark A however has different id values.

I normally use R for the procedure and generate a list of dataframes based on the name/dob combination and sort each dataframe by sample_date. I then would use a list apply function to determine if the difference in date between the fist and last index within each dataframe to return the oldest if it was less than 8 weeks from the most recent date. It takes forever.

I would welcome a few pointers as to how I might attempt this with python/pandas. I started by making a MultiIndex with name/dob/id. The structure looks like what I want. What I need to do is try applying some of the functions I use in R to select out the rows I need. I have tried selecting with df.xs() but I am not getting very far.

Here is a dictionary of the data that can be loaded easily into pandas (albeit with different column order).

{'dob': {0: '12/07/1969', 1: '10/01/1964', 2: '30/08/1958', 3: '30/08/1958', 4: '12/05/1935', 5: '12/07/1969', 6: '12/05/1935', 7: '5/12/1921', 8: '6/08/1986', 9: '4/03/1992', 10: '1/10/1977', 11: '1/06/1955', 12: '1/06/1955', 13: '9/12/1984', 14: '9/12/1984'}, 'id': {0: 12345, 1: 54321, 2: 87878, 3: 45454,
4: 33322, 5: 12345, 6: 33322, 7: 65655, 8: 65459, 9: 41211, 10: 12345, 11: 56465, 12: 45456, 13: 55544, 14: 55544}, 'labno': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 12, 12: 13, 13: 14, 14: 15}, 'location': {0: 'A', 1: 'B', 2: 'A', 3: 'B', 4: 'C', 5: 'A', 6: 'A', 7: 'B', 8: 'A', 9: 'C', 10: 'A', 11: 'C', 12: 'C', 13: 'A', 14: 'A'}, 'name': {0: 'John A', 1: 'John B', 2: 'James', 3: 'James', 4: 'Peter', 5: 'John A', 6: 'Peter', 7: 'Jack', 8: 'Jill', 9: 'Julie', 10: 'Angela', 11: 'Mark A',
12: 'Mark A', 13: 'Mark B', 14: 'Mark B'}, 'sample_date': {0: '12/05/2112', 1: '6/12/2010', 2: '30/04/2012', 3: '29/04/2012', 4: '15/07/2011', 5: '14/05/2012', 6: '23/03/2011', 7: '15/08/2011', 8: '16/02/2012', 9: '15/09/2011', 10: '23/10/2006', 11: '4/04/2011', 12: '3/04/2011', 13: '13/09/2012', 14: '1/01/2012'}, 'sex': {0: 'M', 1: 'M', 2: 'M', 3: 'M', 4: 'M', 5: 'M', 6: 'M', 7: 'M', 8: 'F', 9: 'F',
10: 'F', 11: 'M', 12: 'M', 13: 'M', 14: 'M'}}

like image 687
John Avatar asked Aug 10 '13 02:08

John


People also ask

What does the pandas function MultiIndex From_tuples do?

from_tuples() function is used to convert list of tuples to MultiIndex. It is one of the several ways in which we construct a MultiIndex.


1 Answers

I think what you might be looking for is

def differ(df):
    delta = df.sample_date.diff().abs()  # only care about magnitude
    cond = delta.notnull() & (delta < np.timedelta64(8, 'W'))
    return df[cond].max()

delta = df.groupby(['dob', 'name']).apply(differ)

Depending on whether or not you want to keep people who don't have more than 1 sample you can call delta.dropna(how='all') to remove them.

Note that I think you'll need numpy >= 1.7 for the timedelta64 comparison to work correctly, as there are a whole host of problems with timedelta64/datetime64 for numpy < 1.7.

like image 56
Phillip Cloud Avatar answered Oct 18 '22 19:10

Phillip Cloud