How to speed LabelEncoder up recoding a categorical variable into integers

Tags:

I have a large csv with two strings per row in this form:

g,k
a,h
c,i
j,e
d,i
i,h
b,b
d,d
i,a
d,h

I read in the first two columns and recode the strings to integers as follows:

import pandas as pd
df = pd.read_csv("test.csv", usecols=[0,1], prefix="ID_", header=None)
from sklearn.preprocessing import LabelEncoder

# Initialize the LabelEncoder.
le = LabelEncoder()
le.fit(df.values.flat)

# Convert to digits.
df = df.apply(le.transform)

This code is from https://stackoverflow.com/a/39419342/2179021.

The code works very well but is slow when df is large. I timed each step and the result was surprising to me.

pd.read_csv takes about 40 seconds.
le.fit(df.values.flat) takes about 30 seconds
df = df.apply(le.transform) takes about 250 seconds.

Is there any way to speed up this last step? It feels like it should be the fastest step of them all!

More timings for the recoding step on a computer with 4GB of RAM

The answer below by maxymoo is fast but doesn't give the right answer. Taking the example csv from the top of the question, it translates it to:

Notice that 'd' is mapped to 3 in the first column but 2 in the second.

I tried the solution from https://stackoverflow.com/a/39356398/2179021 and get the following.

df = pd.DataFrame({'ID_0':np.random.randint(0,1000,1000000), 'ID_1':np.random.randint(0,1000,1000000)}).astype(str)
df.info()
memory usage: 7.6MB
%timeit x = (df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack())
1 loops, best of 3: 1.7 s per loop

Then I increased the dataframe size by a factor of 10.

df = pd.DataFrame({'ID_0':np.random.randint(0,1000,10000000), 'ID_1':np.random.randint(0,1000,10000000)}).astype(str) 
df.info()
memory usage: 76.3+ MB
%timeit x = (df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack())
MemoryError                               Traceback (most recent call last)

This method appears to use so much RAM trying to translate this relatively small dataframe that it crashes.

I also timed LabelEncoder with the larger dataset with 10 millions rows. It runs without crashing but the fit line alone took 50 seconds. The df.apply(le.transform) step took about 80 seconds.

How can I:

Get something of roughly the speed of maxymoo's answer and roughly the memory usage of LabelEncoder but that gives the right answer when the dataframe has two columns.
Store the mapping so that I can reuse it for different data (as in the way LabelEncoder allows me to do)?

795

asked Sep 13 '16 16:09

graffe

2 Answers

It looks like it will be much faster to use the pandas category datatype; internally this uses a hash table rather whereas LabelEncoder uses a sorted search:

In [87]: df = pd.DataFrame({'ID_0':np.random.randint(0,1000,1000000), 
                            'ID_1':np.random.randint(0,1000,1000000)}).astype(str)

In [88]: le.fit(df.values.flat) 
         %time x = df.apply(le.transform)
CPU times: user 6.28 s, sys: 48.9 ms, total: 6.33 s
Wall time: 6.37 s

In [89]: %time x = df.apply(lambda x: x.astype('category').cat.codes)
CPU times: user 301 ms, sys: 28.6 ms, total: 330 ms
Wall time: 331 ms

EDIT: Here is a custom transformer class that that you could use (you probably won't see this in an official scikit-learn release since the maintainers don't want to have pandas as a dependency)

import pandas as pd
from pandas.core.nanops import unique1d
from sklearn.base import BaseEstimator, TransformerMixin

class PandasLabelEncoder(BaseEstimator, TransformerMixin):
    def fit(self, y):
        self.classes_ = unique1d(y)
        return self

    def transform(self, y):
        s = pd.Series(y).astype('category', categories=self.classes_)
        return s.cat.codes

answered Oct 14 '22 09:10

maxymoo

I tried this with the DataFrame:

In [xxx]: import string
In [xxx]: letters = np.array([c for c in string.ascii_lowercase])
In [249]: df = pd.DataFrame({'ID_0': np.random.choice(letters, 10000000), 'ID_1':np.random.choice(letters, 10000000)})

It looks like this:

In [261]: df.head()
Out[261]: 
  ID_0 ID_1
0    v    z
1    i    i
2    d    n
3    z    r
4    x    x

In [262]: df.shape
Out[262]: (10000000, 2)

So, 10 million rows. Locally, my timings are:

In [257]: % timeit le.fit(df.values.flat)
1 loops, best of 3: 17.2 s per loop

In [258]: % timeit df2 = df.apply(le.transform)
1 loops, best of 3: 30.2 s per loop

Then I made a dict mapping letters to numbers and used pandas.Series.map:

In [248]: letters = np.array([l for l in string.ascii_lowercase])
In [263]: d = dict(zip(letters, range(26)))

In [273]: %timeit for c in df.columns: df[c] = df[c].map(d)
1 loops, best of 3: 1.12 s per loop

In [274]: df.head()
Out[274]: 
   ID_0  ID_1
0    21    25
1     8     8
2     3    13
3    25    17
4    23    23

So that might be an option. The dict just needs to have all of the values that occur in the data.

EDIT: The OP asked what timing I have for that second option, with categories. This is what I get:

In [40]: %timeit   x=df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack()
1 loops, best of 3: 13.5 s per loop

EDIT: per the 2nd comment:

In [45]: %timeit uniques = np.sort(pd.unique(df.values.ravel()))
1 loops, best of 3: 933 ms per loop

In [46]: %timeit  dfc = df.apply(lambda x: x.astype('category', categories=uniques))
1 loops, best of 3: 1.35 s per loop

answered Oct 14 '22 11:10

Dthal

Related questions
                            
                                Python at Synology, how to get Python3 modules installed and where is Python2.7 installed?
                            
                                how to convert column names into column values in pandas - python
                            
                                Splitting a string in pandas and join it to the old data
                            
                                Pandas, conditional column assignment based on column values
                            
                                Pandas: drop rows based on duplicated values in a list
                            
                                Add UUID's to pandas DF
                            
                                Why is matplotlib's notched boxplot folding back on itself?
                            
                                Error when creating executable file with pyinstaller
                            
                                assertRaises for method with optional parameters
                            
                                Using replace() method in python by index [duplicate]
                            
                                Django Channels
                            
                                How to create a new color image with python Imaging?
                            
                                UnicodeDecodeError on python3 [duplicate]
                            
                                Converting a dataframe to dictionary with multiple values
                            
                                Dataframe SMA Calculation
                            
                                Python: Correct Way to refer to index of unicode string
                            
                                How to preserve punctuation marks in Scikit-Learn text CountVectorizer or TfidfVectorizer?
                            
                                Getting the root word using the Wordnet Lemmatizer
                            
                                How does asyncio.sleep work with negative values?
                            
                                Python 3 Print Update on multiple lines

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to speed LabelEncoder up recoding a categorical variable into integers

Tags:

python

pandas

scikit-learn

graffe

People also ask

2 Answers

maxymoo

Dthal

Recent Activity

Donate For Us