I have around 2 billion key-value pairs and I want to load them into Redis efficiently. I am currently using Python and used Pipe as documented by the redis-py. How can I speed the following approach up?
import redis
def load(pdt_dict):
"""
Load data into redis.
Parameters
----------
pdt_dict : Dict[str, str]
To be stored in Redis
"""
redIs = redis.Redis()
pipe = redIs.pipeline()
for key in pdt_dict.keys():
pipe.hmset(self.seller + ":" + str(key), pdt_dict[key])
pipe.execute()
Redis can handle up to 2^32 keys, and was tested in practice to handle at least 250 million keys per instance. Every hash, list, set, and sorted set, can hold 2^32 elements. In other words your limit is likely the available memory in your system.
Values can be strings (including binary data) of every kind, for instance you can store a jpeg image inside a value. A value can't be bigger than 512 MB. The INCR command parses the string value as an integer, increments it by one, and finally sets the obtained value as the new value.
The maximum allowed size of a key is 512 MB. To reduce memory usage and facilitate key query, ensure that each key does not exceed 1 KB. The maximum allowed size of a string is 512 MB. The maximum allowed size of a Set, List, or Hash is 512 MB.
Redis Strings Strings is an array data structure of bytes (or words) that stores a sequence of elements, typically characters, using some character encoding. It can store any data-a string, integer, floating point value, JPEG image, serialized Ruby object, or anything else you want it to carry.
A few points regarding the question and sample code.
Pipelining isn't a silver bullet - you need to understand what it does before you use it. What pipelining does is batch several operations that are sent as bulk, as is their response from the server. What you gain is that the network round trip time for each operation is replaced by that of the batch. But infinitely-sized batches are a real drain on resource - you need to keep their size small enough to be effective. As a rule of thumb I usually try to aim to 60KB per pipeline and since every data is different, so does the number of actual operations in a pipeline. Assuming that your key and its value are ~1KB, you need to call pipeline.execute()
every 60 operations or so.
Unless I grossly misunderstand, this code shouldn't run. You're using HMSET
as if it is SET
, so you're basically missing the field->value mapping of Hashes. Hashs (HMSET
) and Strings (SET
) are different data types and should therefore be used accordingly.
It appears as if this one little loop is in charge of the entire "Billion of data" - if that is the case, not only would your server running the code be swapping like crazy unless it has a lot of RAM to hold the dictionary, it would also be very ineffective (regardless Python's speed). You need to parallelize the data insertion by running multiple instances of this process.
Are you connecting to Redis remotely? If so, the network may be limiting your performance.
Consider your Redis' settings - perhaps these can be tweaked/tuned for better performance for this task, assuming that it is indeed a bottleneck.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With