I have a bytearray that I need to use as a key to a dictionary. Ideally I'd like to do this without doing a copy of memory the size of the bytearray. Is there anyway to do this? Basically, <pre class="prettyprint"><code>b = some bytearray d[byte(b)] = x </code></pre> Is there any faster way to do this? byte(b) is an O(len(bytearray)) operation which is undesirable.

Any hash algorithm that actually does its job correctly will use O(len(b)) time. So the answer to "is there any faster way to do this" is no. If your actual concern is memory usage, then you could, in principle, add a <code>__hash__</code> method to a subclass of bytearray. But that's a pretty bad idea. Look what happens: <pre class="prettyprint"><code>>>> class HashableBytearray(bytearray): ... def __hash__(self): ... return hash(str(self)) ... >>> h = HashableBytearray('abcd') >>> hash(h) -2835746963027601024 >>> h[2] = 'z' >>> hash(h) -2835746963002600949 </code></pre> So the same object could hash to two different spots in the dictionary, which isn't supposed to happen. And it gets worse: <pre class="prettyprint"><code>>>> d = dict() >>> hb1 = HashableBytearray('abcd') >>> hb2 = HashableBytearray('abcd') >>> d[hb1] = 0 >>> d[hb2] = 1 >>> d {bytearray(b'abcd'): 1} </code></pre> Ok, so far, so good. The values are equal, so there should be only one item in the dictionary. Everything is working as expected. Now let's see what happens when we change <code>hb1</code>: <pre class="prettyprint"><code>>>> hb1[2] = 'z' >>> d[hb2] = 2 >>> d {bytearray(b'abzd'): 1, bytearray(b'abcd'): 2} </code></pre> See how even though <code>hb2</code> didn't change at all, it created a new key-value pair in the dictionary this time? Every time I passed a key to <code>d</code>, that key was equal to <code>'abcd'</code>. But because the value of the first key changed after being added to the dictionary, Python couldn't tell that the value of the new key was the same as the old key had been when it was added. Now there are two key-value pairs in the dictionary, when there should be only one. This is only one of many ways that using mutable values as keys can lead to unpredictable and very wrong behavior. Just convert the <code>bytearray</code> to an immutable type, or work with immutable types in the first place. <hr> And for the inquisitive: sure, <code>buffer</code> caches the first hash, but that doesn't help at all. There are only two key values, so this should generate only two dict entries: <pre class="prettyprint"><code>>>> a, b, c = bytearray('abcd'), bytearray('abcd'), bytearray('abzd') >>> a_buf, b_buf, c_buf = buffer(a), buffer(b), buffer(c) >>> d = {b_buf:1, c_buf:2} >>> b[2] = 'z' >>> d[a_buf] = 0 </code></pre> But it generates three: <pre class="prettyprint"><code>>>> d {<read-only buffer for 0x1004a2300, size -1, offset 0 at 0x100499cb0>: 1, <read-only buffer for 0x1004a2420, size -1, offset 0 at 0x100499cf0>: 0, <read-only buffer for 0x1004a22d0, size -1, offset 0 at 0x100499c70>: 2} </code></pre>

If you're concerned about time, and the key that you are using is always the same object, you can use its <code>id</code> (location in memory) as the key in your dictionary: <pre class="prettyprint"><code>b = some byte array d[id(b)] = x </code></pre> If you're concerned about memory, you can use a good cryptographic hash function over your byte array, and you'll probably never get a collision (git, for example, uses sha1, and there are some discussions out on the internet about how likely it is to get an inadvertent sha1 collision). If you're okay with that infinitesimal risk, you could: <pre class="prettyprint"><code>b = some byte array d[hashlib.sha1(b).hexdigest()] = x </code></pre> That's going to be O(n) in the size of your byte array in time (each time you calculate the hash), but you'd be able to have a different byte array read in at a later time, but representing the same sequence of bytes, that would hash to the same dictionary key. And @senderle is absolutely right; you don't want to use an object that is actually mutable, when using it by value (as opposed to an immutable function of it, like <code>id()</code>) as the key to a dictionary. The hash of an object used as dictionary key must not change; it violates an invariant of what the dictionary object expects out of a hash function.

Python quickly hash mutable object

Tags:

python

bytearray

I have a bytearray that I need to use as a key to a dictionary. Ideally I'd like to do this without doing a copy of memory the size of the bytearray. Is there anyway to do this? Basically,

Click to copy

b = some bytearray
d[byte(b)] = x

Is there any faster way to do this? byte(b) is an O(len(bytearray)) operation which is undesirable.

748

asked Oct 24 '12 00:10

RyanCheu

2 Answers

Any hash algorithm that actually does its job correctly will use O(len(b)) time. So the answer to "is there any faster way to do this" is no.

If your actual concern is memory usage, then you could, in principle, add a __hash__ method to a subclass of bytearray. But that's a pretty bad idea. Look what happens:

Click to copy

>>> class HashableBytearray(bytearray):
...     def __hash__(self):
...         return hash(str(self))
... 
>>> h = HashableBytearray('abcd')
>>> hash(h)
-2835746963027601024
>>> h[2] = 'z'
>>> hash(h)
-2835746963002600949

So the same object could hash to two different spots in the dictionary, which isn't supposed to happen. And it gets worse:

Click to copy

>>> d = dict()
>>> hb1 = HashableBytearray('abcd')
>>> hb2 = HashableBytearray('abcd')
>>> d[hb1] = 0
>>> d[hb2] = 1
>>> d
{bytearray(b'abcd'): 1}

Ok, so far, so good. The values are equal, so there should be only one item in the dictionary. Everything is working as expected. Now let's see what happens when we change hb1:

Click to copy

>>> hb1[2] = 'z'
>>> d[hb2] = 2
>>> d
{bytearray(b'abzd'): 1, bytearray(b'abcd'): 2}

See how even though hb2 didn't change at all, it created a new key-value pair in the dictionary this time?

Every time I passed a key to d, that key was equal to 'abcd'. But because the value of the first key changed after being added to the dictionary, Python couldn't tell that the value of the new key was the same as the old key had been when it was added. Now there are two key-value pairs in the dictionary, when there should be only one.

This is only one of many ways that using mutable values as keys can lead to unpredictable and very wrong behavior. Just convert the bytearray to an immutable type, or work with immutable types in the first place.

And for the inquisitive: sure, buffer caches the first hash, but that doesn't help at all. There are only two key values, so this should generate only two dict entries:

Click to copy

>>> a, b, c = bytearray('abcd'), bytearray('abcd'), bytearray('abzd')
>>> a_buf, b_buf, c_buf = buffer(a), buffer(b), buffer(c)
>>> d = {b_buf:1, c_buf:2}
>>> b[2] = 'z'
>>> d[a_buf] = 0

But it generates three:

Click to copy

>>> d
{<read-only buffer for 0x1004a2300, size -1, offset 0 at 0x100499cb0>: 1, 
 <read-only buffer for 0x1004a2420, size -1, offset 0 at 0x100499cf0>: 0, 
 <read-only buffer for 0x1004a22d0, size -1, offset 0 at 0x100499c70>: 2}

answered Sep 20 '22 19:09

senderle

If you're concerned about time, and the key that you are using is always the same object, you can use its id (location in memory) as the key in your dictionary:

Click to copy

b = some byte array
d[id(b)] = x

If you're concerned about memory, you can use a good cryptographic hash function over your byte array, and you'll probably never get a collision (git, for example, uses sha1, and there are some discussions out on the internet about how likely it is to get an inadvertent sha1 collision). If you're okay with that infinitesimal risk, you could:

Click to copy

b = some byte array
d[hashlib.sha1(b).hexdigest()] = x

That's going to be O(n) in the size of your byte array in time (each time you calculate the hash), but you'd be able to have a different byte array read in at a later time, but representing the same sequence of bytes, that would hash to the same dictionary key.

And @senderle is absolutely right; you don't want to use an object that is actually mutable, when using it by value (as opposed to an immutable function of it, like id()) as the key to a dictionary. The hash of an object used as dictionary key must not change; it violates an invariant of what the dictionary object expects out of a hash function.

answered Sep 24 '22 19:09

Matt Anderson

Related questions
                            
                                How to wake up a thread being blocked by select.poll.poll() function from another thread in socket programming in python?
                            
                                Python logging dictionary config
                            
                                Monitor remote FTP directory
                            
                                what are the formats supported in Pygame for playing sound?
                            
                                pip install django timeout on MacOSX Lion
                            
                                Passing an argument to a python script and opening a file
                            
                                Python , XML AttributeError: 'NodeList' object has no attribute 'firstChild'
                            
                                KVM api to start virtual machine
                            
                                EnumChildWindows not working in pywin32
                            
                                Difference between pygame.draw and pygame.gfxdraw
                            
                                Pretty Print output in a sideways tree format in console window
                            
                                Plotting elliptical orbits
                            
                                Python threading, threads do not close
                            
                                Circus, running circusd as a daemon?
                            
                                tuple to dict:one key and multiple values
                            
                                Why doesn't __getattr__ work with __exit__?
                            
                                I have an Errno 13 Permission denied with subprocess in python
                            
                                Python: Print next x lines from text file when hitting string
                            
                                Better platform to turn software into VHDL/Verilog for an FPGA
                            
                                threads vs. processes in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python quickly hash mutable object

Tags:

python

bytearray

RyanCheu

People also ask

2 Answers

senderle

Matt Anderson

Recent Activity

Donate For Us