definition factorize: Map each unique object into a unique integer. Typically, the range of integers mapped to is from zero to the n - 1 where n is the number of unique objects. Two variations are typical as well. Type 1 is where the numbering occurs in the order in which unique objects are identified. Type 2 is where unique objects are first sorted and then the same process as in Type 1 is applied. The Setup Consider the list of tuples <code>tups</code> <pre class="prettyprint"><code>tups = [(1, 2), ('a', 'b'), (3, 4), ('c', 5), (6, 'd'), ('a', 'b'), (3, 4)] </code></pre> I want to factorize this into <pre class="prettyprint"><code>[0, 1, 2, 3, 4, 1, 2] </code></pre> I know there are many ways to do this. However, I want to do this as efficiently as possible. <hr> What I've tried <code>pandas.factorize</code> and get an error... <pre class="prettyprint"><code>pd.factorize(tups)[0] --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-84-c84947ac948c> in <module>() ----> 1 pd.factorize(tups)[0] //anaconda/envs/3.6/lib/python3.6/site-packages/pandas/core/algorithms.py in factorize(values, sort, order, na_sentinel, size_hint) 553 uniques = vec_klass() 554 check_nulls = not is_integer_dtype(original) --> 555 labels = table.get_labels(values, uniques, 0, na_sentinel, check_nulls) 556 557 labels = _ensure_platform_int(labels) pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_labels (pandas/_libs/hashtable.c:21804)() ValueError: Buffer has wrong number of dimensions (expected 1, got 2) </code></pre> <hr> Or <code>numpy.unique</code> and get incorrect result... <pre class="prettyprint"><code>np.unique(tups, return_inverse=1)[1] array([0, 1, 6, 7, 2, 3, 8, 4, 5, 9, 6, 7, 2, 3]) </code></pre> <hr> I could use either of these on the hashes of the tuples <pre class="prettyprint"><code>pd.factorize([hash(t) for t in tups])[0] array([0, 1, 2, 3, 4, 1, 2]) </code></pre> <hr> Yay! That's what I wanted... so what's the problem? First Problem Look at the performance drop from this technique <pre class="prettyprint"><code>lst = [10, 7, 4, 33, 1005, 7, 4] %timeit pd.factorize(lst * 1000)[0] 1000 loops, best of 3: 506 µs per loop %timeit pd.factorize([hash(i) for i in lst * 1000])[0] 1000 loops, best of 3: 937 µs per loop </code></pre> Second Problem Hashing is not guaranteed unique! <hr> Question What is a super fast way to factorize a list of tuples? <hr> Timing both axes are in log space <img src="https://i.stack.imgur.com/zIlh0.png" alt="enter image description here"> <code>code</code> <pre class="prettyprint"><code>from itertools import count def champ(tups): d = {} c = count() return np.array( [d[tup] if tup in d else d.setdefault(tup, next(c)) for tup in tups] ) def root(tups): return pd.Series(tups).factorize()[0] def iobe(tups): return np.unique(tups, return_inverse=True, axis=0)[1] def get_row_view(a): void_dt = np.dtype((np.void, a.dtype.itemsize * np.prod(a.shape[1:]))) a = np.ascontiguousarray(a) return a.reshape(a.shape[0], -1).view(void_dt).ravel() def diva(tups): return np.unique(get_row_view(np.array(tups)), return_inverse=1)[1] def gdib(tups): return pd.factorize([str(t) for t in tups])[0] from string import ascii_letters def tups_creator_1(size, len_of_str=3, num_ints_to_choose_from=1000, seed=None): c = len_of_str n = num_ints_to_choose_from np.random.seed(seed) d = pd.DataFrame(np.random.choice(list(ascii_letters), (size, c))).sum(1).tolist() i = np.random.randint(n, size=size) return list(zip(d, i)) results = pd.DataFrame( index=pd.Index([100, 1000, 5000, 10000, 20000, 30000, 40000, 50000], name='Size'), columns=pd.Index('champ root iobe diva gdib'.split(), name='Method') ) for i in results.index: tups = tups_creator_1(i, max(1, int(np.log10(i))), max(10, i // 10)) for j in results.columns: stmt = '{}(tups)'.format(j) setup = 'from __main__ import {}, tups'.format(j) results.set_value(i, j, timeit(stmt, setup, number=100) / 100) results.plot(title='Avg Seconds', logx=True, logy=True) </code></pre>

A simple way to do it is use a <code>dict</code> to hold previous visits: <pre class="prettyprint"><code>>>> d = {} >>> [d.setdefault(tup, i) for i, tup in enumerate(tups)] [0, 1, 2, 3, 4, 1, 2] </code></pre> If you need to keep the numbers sequential then a slight change: <pre class="prettyprint"><code>>>> from itertools import count >>> c = count() >>> [d[tup] if tup in d else d.setdefault(tup, next(c)) for tup in tups] [0, 1, 2, 3, 4, 1, 2, 5] </code></pre> Or alternatively written: <pre class="prettyprint"><code>>>> [d.get(tup) or d.setdefault(tup, next(c)) for tup in tups] [0, 1, 2, 3, 4, 1, 2, 5] </code></pre>

Initialize your list of tuples as a Series, then call <code>factorize</code>: <pre class="prettyprint"><code>pd.Series(tups).factorize()[0] [0 1 2 3 4 1 2] </code></pre>

How to I factorize a list of tuples?

Tags:

python

pandas

numpy

definition
factorize: Map each unique object into a unique integer. Typically, the range of integers mapped to is from zero to the n - 1 where n is the number of unique objects. Two variations are typical as well. Type 1 is where the numbering occurs in the order in which unique objects are identified. Type 2 is where unique objects are first sorted and then the same process as in Type 1 is applied.

The Setup
Consider the list of tuples tups

tups = [(1, 2), ('a', 'b'), (3, 4), ('c', 5), (6, 'd'), ('a', 'b'), (3, 4)]

I want to factorize this into

[0, 1, 2, 3, 4, 1, 2]

I know there are many ways to do this. However, I want to do this as efficiently as possible.

What I've tried

pandas.factorize and get an error...

pd.factorize(tups)[0]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-84-c84947ac948c> in <module>()
----> 1 pd.factorize(tups)[0]

//anaconda/envs/3.6/lib/python3.6/site-packages/pandas/core/algorithms.py in factorize(values, sort, order, na_sentinel, size_hint)
    553     uniques = vec_klass()
    554     check_nulls = not is_integer_dtype(original)
--> 555     labels = table.get_labels(values, uniques, 0, na_sentinel, check_nulls)
    556 
    557     labels = _ensure_platform_int(labels)

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_labels (pandas/_libs/hashtable.c:21804)()

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

Or numpy.unique and get incorrect result...

np.unique(tups, return_inverse=1)[1]

array([0, 1, 6, 7, 2, 3, 8, 4, 5, 9, 6, 7, 2, 3])

I could use either of these on the hashes of the tuples

pd.factorize([hash(t) for t in tups])[0]

array([0, 1, 2, 3, 4, 1, 2])

Yay! That's what I wanted... so what's the problem?

First Problem
Look at the performance drop from this technique

lst = [10, 7, 4, 33, 1005, 7, 4]

%timeit pd.factorize(lst * 1000)[0]
1000 loops, best of 3: 506 µs per loop

%timeit pd.factorize([hash(i) for i in lst * 1000])[0]
1000 loops, best of 3: 937 µs per loop

Second Problem
Hashing is not guaranteed unique!

Question
What is a super fast way to factorize a list of tuples?

Timing
both axes are in log space

enter image description here

code

from itertools import count

def champ(tups):
    d = {}
    c = count()
    return np.array(
        [d[tup] if tup in d else d.setdefault(tup, next(c)) for tup in tups]
    )

def root(tups):
    return pd.Series(tups).factorize()[0]

def iobe(tups):
    return np.unique(tups, return_inverse=True, axis=0)[1]

def get_row_view(a):
    void_dt = np.dtype((np.void, a.dtype.itemsize * np.prod(a.shape[1:])))
    a = np.ascontiguousarray(a)
    return a.reshape(a.shape[0], -1).view(void_dt).ravel()

def diva(tups):
    return np.unique(get_row_view(np.array(tups)), return_inverse=1)[1]

def gdib(tups):
    return pd.factorize([str(t) for t in tups])[0]

from string import ascii_letters

def tups_creator_1(size, len_of_str=3, num_ints_to_choose_from=1000, seed=None):
    c = len_of_str
    n = num_ints_to_choose_from
    np.random.seed(seed)
    d = pd.DataFrame(np.random.choice(list(ascii_letters), (size, c))).sum(1).tolist()
    i = np.random.randint(n, size=size)
    return list(zip(d, i))

results = pd.DataFrame(
    index=pd.Index([100, 1000, 5000, 10000, 20000, 30000, 40000, 50000], name='Size'),
    columns=pd.Index('champ root iobe diva gdib'.split(), name='Method')
)

for i in results.index:
    tups = tups_creator_1(i, max(1, int(np.log10(i))), max(10, i // 10))
    for j in results.columns:
        stmt = '{}(tups)'.format(j)
        setup = 'from __main__ import {}, tups'.format(j)
        results.set_value(i, j, timeit(stmt, setup, number=100) / 100)

results.plot(title='Avg Seconds', logx=True, logy=True)

934

asked May 26 '17 18:05

piRSquared

2 Answers

A simple way to do it is use a dict to hold previous visits:

>>> d = {}
>>> [d.setdefault(tup, i) for i, tup in enumerate(tups)]
[0, 1, 2, 3, 4, 1, 2]

If you need to keep the numbers sequential then a slight change:

>>> from itertools import count
>>> c = count()
>>> [d[tup] if tup in d else d.setdefault(tup, next(c)) for tup in tups]
[0, 1, 2, 3, 4, 1, 2, 5]

Or alternatively written:

>>> [d.get(tup) or d.setdefault(tup, next(c)) for tup in tups]
[0, 1, 2, 3, 4, 1, 2, 5]

answered Sep 23 '22 11:09

AChampion

Initialize your list of tuples as a Series, then call factorize:

pd.Series(tups).factorize()[0]

[0 1 2 3 4 1 2]

answered Sep 23 '22 11:09

root

Related questions
                            
                                Writing a parallel loop
                            
                                Checking whether function has been called multiple times with different parameters
                            
                                Share Python code when handling multiple exceptions
                            
                                Subtract every column in dataframe with the mean of that column with Python
                            
                                Change size/alpha of markers in the legend box of matplotlib
                            
                                To_CSV unique values of a pandas column [duplicate]
                            
                                Best way to handle a keyerror in a dict
                            
                                Python - NameError: name itemgetter not defined
                            
                                How to decide the size of layers in Keras' Dense method?
                            
                                BeautifulSoup extract top-level tags only [duplicate]
                            
                                hackerrank new year chaos code optimization
                            
                                What does sys.exit really do with multiple threads?
                            
                                time complexity of random access in deque in Python [duplicate]
                            
                                Using pip on Windows installed with both python 2.7 and 3.5
                            
                                Python kernel dies for second run of PyQt5 GUI
                            
                                Use Numpy to convert rgb pixel array into grayscale [duplicate]
                            
                                Single instance of class in Python
                            
                                How to normalize a 4D numpy array?
                            
                                Color scatter plot points based on a value in third column?
                            
                                Pandas: what is a NDFrame object (and what is a non-NDFrame object)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With