I want to concatenate two dataframes with category-type columns, by first adding the missing categories to each column.
df = pd.DataFrame({"a": pd.Categorical(["foo", "foo", "bar"]), "b": [1, 2, 1]})
df2 = pd.DataFrame({"a": pd.Categorical(["baz"]), "b": [1]})
df["a"] = df["a"].cat.add_categories("baz")
df2["a"] = df2["a"].cat.add_categories(["foo", "bar"])
In theory the categories for both "a"
columns are the same:
In [33]: df.a.cat.categories
Out[33]: Index(['bar', 'foo', 'baz'], dtype='object')
In [34]: df2.a.cat.categories
Out[34]: Index(['baz', 'foo', 'bar'], dtype='object')
However, when concatenating the two dataframes, I get an object
-type "a"
column:
In [35]: pd.concat([df, df2]).info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 0
Data columns (total 2 columns):
a 4 non-null object
b 4 non-null int64
dtypes: int64(1), object(1)
memory usage: 96.0+ bytes
In the documentation it says that when categories are the same, it should result in a category
-type column. Does the order of the categories matter even though the category is unordered? I am using pandas-0.20.3
.
Pandas can concat dataframe while keeping common columns only, if you provide join='inner' argument in pd.
Concat function concatenates dataframes along rows or columns. We can think of it as stacking up multiple dataframes. Merge combines dataframes based on values in shared columns. Merge function offers more flexibility compared to concat function because it allows combinations based on a condition.
By default, when you concatenate two dataframes with duplicate records, Pandas automatically combine them together without removing the duplicate rows.
ignore_index=True 'ignores', meaning doesn't align on the joining axis.
You can use the Pandas concat () function to combine two category type Pandas series. The following is the syntax – Combining series with the same categories results in a category type series. In other cases, the resulting type will depend on the underlying categories. Let’s look at some examples of combining two category type series in Pandas.
pandas.concat () function in Python. pandas.concat () function does all the heavy lifting of performing concatenation operations along with an axis od Pandas objects while performing optional set logic (union or intersection) of the indexes (if any) on the other axes. Syntax: concat (objs, axis, join, ignore_index, keys, levels, names, ...
You have now learned the three most important techniques for combining data in Pandas: 1 merge () for combining data on common columns or indices 2 .join () for combining data on a key column or an index 3 concat () for combining DataFrames across rows or columns
Pandas .join (): Combining Data on a Column or Index. While merge () is a module function, .join () is an object function that lives on your DataFrame. This enables you to specify only one DataFrame, which will join the DataFrame you call .join () on.
Yes. By using reorder_categories
you can change the order of categories, even though the category itself is unordered.
df2["a"] = df2.a.cat.reorder_categories(df.a.cat.categories)
In [43]: pd.concat([df, df2]).info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 0
Data columns (total 2 columns):
a 4 non-null category
b 4 non-null int64
dtypes: category(1), int64(1)
memory usage: 172.0 bytes
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With