Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas - concat with columns of same categories turns to object

Tags:

python

pandas

I want to concatenate two dataframes with category-type columns, by first adding the missing categories to each column.

df = pd.DataFrame({"a": pd.Categorical(["foo", "foo", "bar"]), "b": [1, 2, 1]})
df2 = pd.DataFrame({"a": pd.Categorical(["baz"]), "b": [1]})

df["a"] = df["a"].cat.add_categories("baz")
df2["a"] = df2["a"].cat.add_categories(["foo", "bar"])

In theory the categories for both "a" columns are the same:

In [33]: df.a.cat.categories
Out[33]: Index(['bar', 'foo', 'baz'], dtype='object')

In [34]: df2.a.cat.categories
Out[34]: Index(['baz', 'foo', 'bar'], dtype='object')

However, when concatenating the two dataframes, I get an object-type "a" column:

In [35]: pd.concat([df, df2]).info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 0
Data columns (total 2 columns):
a    4 non-null object
b    4 non-null int64
dtypes: int64(1), object(1)
memory usage: 96.0+ bytes

In the documentation it says that when categories are the same, it should result in a category-type column. Does the order of the categories matter even though the category is unordered? I am using pandas-0.20.3.

like image 317
paljenczy Avatar asked Aug 11 '17 12:08

paljenczy


People also ask

Does PD concat match columns?

Pandas can concat dataframe while keeping common columns only, if you provide join='inner' argument in pd.

What is the difference between PD concat and PD merge?

Concat function concatenates dataframes along rows or columns. We can think of it as stacking up multiple dataframes. Merge combines dataframes based on values in shared columns. Merge function offers more flexibility compared to concat function because it allows combinations based on a condition.

Does Panda concat remove duplicates?

By default, when you concatenate two dataframes with duplicate records, Pandas automatically combine them together without removing the duplicate rows.

What does ignore_index true mean in Python?

ignore_index=True 'ignores', meaning doesn't align on the joining axis.

How to combine two category types in pandas?

You can use the Pandas concat () function to combine two category type Pandas series. The following is the syntax – Combining series with the same categories results in a category type series. In other cases, the resulting type will depend on the underlying categories. Let’s look at some examples of combining two category type series in Pandas.

How to concatenate two pandas objects in Python?

pandas.concat () function in Python. pandas.concat () function does all the heavy lifting of performing concatenation operations along with an axis od Pandas objects while performing optional set logic (union or intersection) of the indexes (if any) on the other axes. Syntax: concat (objs, axis, join, ignore_index, keys, levels, names, ...

How do I combine data in pandas?

You have now learned the three most important techniques for combining data in Pandas: 1 merge () for combining data on common columns or indices 2 .join () for combining data on a key column or an index 3 concat () for combining DataFrames across rows or columns

What is the difference between pandas merge and join?

Pandas .join (): Combining Data on a Column or Index. While merge () is a module function, .join () is an object function that lives on your DataFrame. This enables you to specify only one DataFrame, which will join the DataFrame you call .join () on.


1 Answers

Yes. By using reorder_categories you can change the order of categories, even though the category itself is unordered.

df2["a"] = df2.a.cat.reorder_categories(df.a.cat.categories)

In [43]: pd.concat([df, df2]).info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 0
Data columns (total 2 columns):
a    4 non-null category
b    4 non-null int64
dtypes: category(1), int64(1)
memory usage: 172.0 bytes
like image 195
paljenczy Avatar answered Nov 14 '22 23:11

paljenczy