Let's start off with a random (reproducible) data array -
# Setup
In [11]: np.random.seed(0)
...: a = np.random.randint(0,9,(7,2))
...: a[2] = a[0]
...: a[4] = a[1]
...: a[6] = a[1]
# Check values
In [12]: a
Out[12]:
array([[5, 0],
[3, 3],
[5, 0],
[5, 2],
[3, 3],
[6, 8],
[3, 3]])
# Check its itemsize
In [13]: a.dtype.itemsize
Out[13]: 8
Let's view each row as a scalar using custom datatype that covers two elements. We will use void-dtype
for this purpose. As mentioned in the docs -
https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.dtypes.html#specifying-and-constructing-data-types, https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.interface.html#arrays-interface) and in stackoverflow Q&A, it seems that would be -
In [23]: np.dtype((np.void, 16)) # 8 is the itemsize, so 8x2=16
Out[23]: dtype('V16')
# Create new view of the input
In [14]: b = a.view('V16').ravel()
# Check new view array
In [15]: b
Out[15]:
array([b'\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
b'\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
b'\x05\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00',
b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
b'\x06\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00',
b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00'],
dtype='|V16')
# Use pandas.factorize on the new view
In [16]: pd.factorize(b)
Out[16]:
(array([0, 1, 0, 0, 1, 2, 1]),
array(['\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
'\x06\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00'],
dtype=object))
Two things off factorize's output that I could not understand and hence the follow-up questions -
The fourth element of the first output (=0) looks wrong, because it has same ID as the third element, but in b
, the fourth and third elements are different. Why so?
Why does the second output has an object dtype, while the dtype of b
was V16
. Is this also causing the wrong value mentioned in 1.
?
A bigger question could be - Does pandas.factorize
cover custom datatypes? From docs, I see :
values : sequence A 1-D sequence. Sequences that aren’t pandas objects are coerced to ndarrays before factorization.
In the provided sample case, we have a NumPy array, so one would assume no issues with the input, unless the docs didn't clarify about the custom datatype part?
System setup : Ubuntu 16.04, Python : 2.7.12, NumPy : 1.16.2, Pandas : 0.24.2.
On Python-3.x
System setup : Ubuntu 16.04, Python : 3.5.2, NumPy : 1.16.2, Pandas : 0.24.2.
Running the same setup, I get -
In [18]: b
Out[18]:
array([b'\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
b'\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
b'\x05\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00',
b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
b'\x06\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00',
b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00'],
dtype='|V16')
In [19]: pd.factorize(b)
Out[19]:
(array([0, 1, 0, 2, 1, 3, 1]),
array([b'\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
b'\x05\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00',
b'\x06\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00'],
dtype=object))
So, the first output off factorize
looks alright here. But, the second output has object dtype again, different from the input. So, the same question - Why this dtype change?
With such a custom datatype :
Why wrong labels
, uniques
and different uniques
dtype on Python2.x?
Why different uniques
dtype on Python3.x?
As for why V16
is coerced to object
, many functions in pandas
convert data to one of the data types the internal functions can handle, here. If the data type is not in the list, it becomes an object – and pandas doesn't convert the result back into the original dtype, it appears.
Regarding the discrepancy between Python 2 and Python 3: There's only one pandas codebase for both, so why do they give different results?
Turns out that Python 2 uses the string type (which are just arrays of bytes) to represent your data¹, and Python 3 the bytes type. The effect of this is that Python 2 uses a StringHashTable
for the factorization and Python 3 uses a PyObjectHashTable
, and the StringHashTable
gives incorrect results in your case. I believe that this is because the strings in the StringHashTable
are assumed to be zero-terminated, which is not the case for your strings – and indeed, if you only compare the rows up to the first zero byte, the first and fourth row look identical.
Conclusion: It's a bug, and we should probably file an issue for it.
¹ More detail: This call to ensure_object
returns an array of strings in Python 2, but an array of bytes in Python 3 (as can be seen by the b
prefix). Correspondingly, the hashtable chosen here is different.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With