Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merge a list of DataFrame's on a column? [duplicate]

I am having trouble combining an array of DataFrames into a single DataFrame, merged on a specific column.

I have a list of DataFrames called data, with each element, data[i] looking like this:

     Rank  Name
2400    1 name1
2401    2 name2
2402    3 name3
2403    4 name4
2404    5 name5

Each DataFrame contains a Top 5 list for a given month, and the list contains the monthly results for a year.

I would like the final, merged DataFrame to look like this:

     Rank  Name_month1 Name_month2 Name_month3 ...
2400    1        name1       name1       name1 ...
2401    2        name2       name2       name2 ...
2402    3        name3       name3       name3 ...
2403    4        name4       name4       name4 ...
2404    5        name5       name5       name5 ...

where each column, after the first, corresponds to a monthly rank.

I have no problem merging 2 DataFrames from the list, data:

pandas.merge(data[0], data[1], on='Rank', suffix=['_month1', '_month2'])

But when I try to use filter() or chain .merge's, I keep running into trouble.

Any thoughts? Thanks!

like image 268
alokv28 Avatar asked Sep 16 '13 22:09

alokv28


1 Answers

The problem is that, when you did the first merge, you changed the names of the columns (adding suffixes) and there won't be a name collision on the second merge, so the suffixes in the second merge will never be used. The solution is to rename the columns manually after the merge.

In [2]: df
Out[2]:       Rank   Name
        2400     1  name1
        2401     2  name2
        2402     3  name3
        2403     4  name4
        2404     5  name5
In [3]: df.merge(
            df, on='Rank', suffixes=['_month1', '_month2']
        ).merge(df, on='Rank').rename(
            columns={'Name': 'Name_month3'}
        ).merge(df, on='Rank').rename(
            columns={'Name': 'Name_month4'}
        )
Out[3]:    Rank Name_month1 Name_month2 Name_month3 Name_month4
        0     1       name1       name1       name1       name1
        1     2       name2       name2       name2       name2
        2     3       name3       name3       name3       name3
        3     4       name4       name4       name4       name4
        4     5       name5       name5       name5       name5

If you have a list of DataFrames just do:

In [4]: data = [df, df, df, df]
        current = data[0].rename(columns={'Name': 'Name_month1'})
        for i, frame in enumerate(data[1:], 2):
            current = current.merge(frame, on='Rank').rename(
                         columns={'Name': 'Name_month%d' % i})
        current
Out[4]:    Rank Name_month1 Name_month2 Name_month3 Name_month4
        0     1       name1       name1       name1       name1
        1     2       name2       name2       name2       name2
        2     3       name3       name3       name3       name3
        3     4       name4       name4       name4       name4
        4     5       name5       name5       name5       name5
like image 97
Viktor Kerkez Avatar answered Sep 19 '22 00:09

Viktor Kerkez