I have around 2 billion key-value pairs and I want to load them into Redis efficiently. I am currently using Python and used Pipe as documented by the redis-py. How can I speed the following approach up? <pre class="prettyprint"><code>import redis def load(pdt_dict): """ Load data into redis. Parameters ---------- pdt_dict : Dict[str, str] To be stored in Redis """ redIs = redis.Redis() pipe = redIs.pipeline() for key in pdt_dict.keys(): pipe.hmset(self.seller + ":" + str(key), pdt_dict[key]) pipe.execute() </code></pre>

A few points regarding the question and sample code. <ol> <li>Pipelining isn't a silver bullet - you need to understand what it does before you use it. What pipelining does is batch several operations that are sent as bulk, as is their response from the server. What you gain is that the network round trip time for each operation is replaced by that of the batch. But infinitely-sized batches are a real drain on resource - you need to keep their size small enough to be effective. As a rule of thumb I usually try to aim to 60KB per pipeline and since every data is different, so does the number of actual operations in a pipeline. Assuming that your key and its value are ~1KB, you need to call <code>pipeline.execute()</code> every 60 operations or so.</li> <li>Unless I grossly misunderstand, this code shouldn't run. You're using <code>HMSET</code> as if it is <code>SET</code>, so you're basically missing the field->value mapping of Hashes. Hashs (<code>HMSET</code>) and Strings (<code>SET</code>) are different data types and should therefore be used accordingly.</li> <li>It appears as if this one little loop is in charge of the entire "Billion of data" - if that is the case, not only would your server running the code be swapping like crazy unless it has a lot of RAM to hold the dictionary, it would also be very ineffective (regardless Python's speed). You need to parallelize the data insertion by running multiple instances of this process.</li> <li>Are you connecting to Redis remotely? If so, the network may be limiting your performance.</li> <li>Consider your Redis' settings - perhaps these can be tweaked/tuned for better performance for this task, assuming that it is indeed a bottleneck.</li> </ol>

How to insert Billion of data to Redis efficiently?

import redis

def load(pdt_dict):
    """
    Load data into redis.

    Parameters
    ----------
    pdt_dict : Dict[str, str]
        To be stored in Redis
    """
    redIs = redis.Redis()
    pipe = redIs.pipeline()
    for key in pdt_dict.keys():
        pipe.hmset(self.seller + ":" + str(key), pdt_dict[key])
    pipe.execute()

414

asked Aug 21 '15 21:08

John Deep

1 Answers

A few points regarding the question and sample code.

Pipelining isn't a silver bullet - you need to understand what it does before you use it. What pipelining does is batch several operations that are sent as bulk, as is their response from the server. What you gain is that the network round trip time for each operation is replaced by that of the batch. But infinitely-sized batches are a real drain on resource - you need to keep their size small enough to be effective. As a rule of thumb I usually try to aim to 60KB per pipeline and since every data is different, so does the number of actual operations in a pipeline. Assuming that your key and its value are ~1KB, you need to call pipeline.execute() every 60 operations or so.
Unless I grossly misunderstand, this code shouldn't run. You're using HMSET as if it is SET, so you're basically missing the field->value mapping of Hashes. Hashs (HMSET) and Strings (SET) are different data types and should therefore be used accordingly.
It appears as if this one little loop is in charge of the entire "Billion of data" - if that is the case, not only would your server running the code be swapping like crazy unless it has a lot of RAM to hold the dictionary, it would also be very ineffective (regardless Python's speed). You need to parallelize the data insertion by running multiple instances of this process.
Are you connecting to Redis remotely? If so, the network may be limiting your performance.
Consider your Redis' settings - perhaps these can be tweaked/tuned for better performance for this task, assuming that it is indeed a bottleneck.

119

answered Oct 03 '22 11:10

Itamar Haber

Related questions
                            
                                How to read numbers in text file using python?
                            
                                shapely and matplotlib point-in-polygon not accurate with geolocation
                            
                                NumPy - Faster way to implement threshold value ceiling
                            
                                Matplotlib drag overlapping points interactively
                            
                                Ipython is working in command prompt but not in browser
                            
                                Plotting communities with python igraph
                            
                                RuntimeWarnings with GPIO.setup and GPIO.cleanup not work with KeyboardInterrupt
                            
                                PYTHON IndexError: tuple index out of range
                            
                                Audio spectrum extraction from audio file by python
                            
                                Moving balls in Tkinter Canvas
                            
                                How do you scroll a GridLayout inside Kivy ScrollView?
                            
                                Django 1.7 google oauth2 token validation failure
                            
                                2D Gaussian Fit for intensities at certain coordinates in Python
                            
                                Fully disable python logging
                            
                                Numpy: find index of elements in one array that occur in another array
                            
                                Create variable name from two string in python
                            
                                Include with url variable in Django template
                            
                                list() takes at most 1 argument (3 given)
                            
                                How do I properly set DPI when saving a pillow image?
                            
                                Creating a matrix of arbitrary size where rows sum to 1?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to insert Billion of data to Redis efficiently?

Tags:

python

redis

redis-py

John Deep

People also ask

1 Answers

Itamar Haber

Recent Activity

Donate For Us