multi-column factorize in pandas

Question

The pandas factorize function assigns each unique value in a series to a sequential, 0-based index, and calculates which index each series entry belongs to.

I'd like to accomplish the equivalent of pandas.factorize on multiple columns:

import pandas as pd
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
pd.factorize(df)[0] # would like [0, 1, 2, 2, 1, 0]

That is, I want to determine each unique tuple of values in several columns of a data frame, assign a sequential index to each, and compute which index each row in the data frame belongs to.

Factorize only works on single columns. Is there a multi-column equivalent function in pandas?

HYRY · Accepted Answer

You need to create a ndarray of tuple first, pandas.lib.fast_zip can do this very fast in cython loop.

import pandas as pd
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
print pd.factorize(pd.lib.fast_zip([df.x, df.y]))[0]

the output is:

[0 1 2 2 1 0]

multi-column factorize in pandas

Tags:

python

enumeration

pandas

data-cleaning

ChrisB

1 Answers

HYRY

Recent Activity

Donate For Us

multi-column factorize in pandas

Tags:

python

enumeration

pandas

data-cleaning

ChrisB

1 Answers

HYRY

Related questions

Recent Activity

Donate For Us