Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retaining categorical dtype upon dataframe concatenation

I have two dataframes with identical column names and dtypes, similar to the following:

A             object
B             category
C             category

The categories are not identical in each of the dataframes.

When normally concatinating, pandas outputs:

A             object
B             object
C             object

Which is the expected behaviour as per the documentation.

However, I wish to keep the categorisation and wish to union the categories, so I have tried the union_categoricals across the columns in the dataframe which are both categorical. cdf and df are my two dataframes.

for column in df:
    if df[column].dtype.name == "category" and cdf[column].dtype.name == "category":
        print (column)
        union_categoricals([cdf[column], df[column]], ignore_order=True)

cdf = pd.concat([cdf,df])

This is still not providing me with a categorical output.

like image 990
tom Avatar asked Aug 11 '17 16:08

tom


People also ask

How do pandas handle categorical data?

The basic strategy is to convert each category value into a new column and assign a 1 or 0 (True/False) value to the column. This has the benefit of not weighting a value improperly. There are many libraries out there that support one-hot encoding but the simplest one is using pandas ' . get_dummies() method.

What is the difference between merging and concatenating data frames?

Concat function concatenates dataframes along rows or columns. We can think of it as stacking up multiple dataframes. Merge combines dataframes based on values in shared columns. Merge function offers more flexibility compared to concat function because it allows combinations based on a condition.

Is PD concat faster than append?

Append function will add rows of second data frame to first dataframe iteratively one by one. Concat function will do a single operation to finish the job, which makes it faster than append().

Does PD concat match columns?

Columns matching and sortingThe concat() function is able to concatenate DataFrames with the columns in a different order. By default, the resulting DataFrame would have the same sorting as the first DataFrame.


2 Answers

I don't think this is completely obvious from the documentation, but you could do something like the following. Here's some sample data:

df1=pd.DataFrame({'x':pd.Categorical(['dog','cat'])})
df2=pd.DataFrame({'x':pd.Categorical(['cat','rat'])})

Use union_categoricals1 to get consistent categories accros dataframes. Try df.x.cat.codes if you need to convince yourself that this works.

from pandas.api.types import union_categoricals

uc = union_categoricals([df1.x,df2.x])
df1.x = pd.Categorical( df1.x, categories=uc.categories )
df2.x = pd.Categorical( df2.x, categories=uc.categories )

Concatenate and verify the dtype is categorical.

df3 = pd.concat([df1,df2])

df3.x.dtypes
category

As @C8H10N4O2 suggests, you could also just coerce from objects back to categoricals after concatenating. Honestly, for smaller datasets I think that's the best way to do it just because it's simpler. But for larger dataframes, using union_categoricals should be much more memory efficient.

like image 187
JohnE Avatar answered Oct 17 '22 05:10

JohnE


To complement JohnE's answer, here's a function that does the job by converting to union_categoricals all the category columns present on all input dataframes:

def concatenate(dfs):
    """Concatenate while preserving categorical columns.

    NB: We change the categories in-place for the input dataframes"""
    from pandas.api.types import union_categoricals
    import pandas as pd
    # Iterate on categorical columns common to all dfs
    for col in set.intersection(
        *[
            set(df.select_dtypes(include='category').columns)
            for df in dfs
        ]
    ):
        # Generate the union category across dfs for this column
        uc = union_categoricals([df[col] for df in dfs])
        # Change to union category for all dataframes
        for df in dfs:
            df[col] = pd.Categorical(df[col].values, categories=uc.categories)
    return pd.concat(dfs)

Note the categories are changed in place in the input list:

df1=pd.DataFrame({'a': [1, 2],
                  'x':pd.Categorical(['dog','cat']),
                  'y': pd.Categorical(['banana', 'bread'])})
df2=pd.DataFrame({'x':pd.Categorical(['rat']),
                  'y': pd.Categorical(['apple'])})

concatenate([df1, df2]).dtypes
like image 34
Tom Bug Avatar answered Oct 17 '22 05:10

Tom Bug