I have a data model like this: Fields: <ol> <li>counter number (e.g. 00888, 00777, 00123 etc)</li> <li>counter code (e.g. XA, XD, ZA, SI etc)</li> <li>start date (e.g. 2017-12-31 ...)</li> <li>end date (e.g. 2017-12-31 ...)</li> <li>Other counter date (e.g. xxxxx)</li> </ol> Current Datastructure organization is like this (root and multiple child format): <pre class="prettyprint"><code>counter_num + counter_code ---> start_date + end_date --> xxxxxxxx ---> start_date + end_date --> xxxxxxxx ---> start_date + end_date --> xxxxxxxx </code></pre> Example: <pre class="prettyprint"><code>00888 + XA ---> Jan 10 + Jan 20 --> xxxxxxxx ---> Jan 21 + Jan 31 --> xxxxxxxx ---> Feb 01 + Dec 31 --> xxxxxxxx 00888 + ZI ---> Jan 09 + Feb 24 --> xxxxxxxx ---> Feb 25 + Dec 31 --> xxxxxxxx 00777 + XA ---> Jan 09 + Feb 24 --> xxxxxxxx ---> Feb 25 + Dec 31 --> xxxxxxxx </code></pre> Today the retrieval happens in 2 ways: <pre class="prettyprint"><code>//Fetch unique counter data using all the composite keys counter_number + counter_code + date (start_date <= date <= end_date) //Fetch all the counter codes and corresponding data matching the below conditions counter_number + date (start_date <= date <= end_date) </code></pre> What's the best way to model this in redis as I need to cache some of the frequently hit data. I feel sorted sets should do this somehow, but unable to model it. UPDATE: Just to remove the confusion, the ask here is not for an SQL "BETWEEN" like query. 'Coz I don't know what the start_date and end_date values are. Think they are just column names. What I don't want is <pre class="prettyprint"><code>SELECT * FROM redis_db WHERE counter_num AND date_value BETWEEN start_date AND end_date </code></pre> What I want is <pre class="prettyprint"><code>SELECT * FROM redis_db WHERE counter_num AND start_date <= specifc_date AND end_date >= specific_date </code></pre> NOTE: The requirement is pretty much close to 2D indexing of what is proposed in Redis multi-dimensional indexing document https://redis.io/topics/indexes#multi-dimensional-indexes I understood the concept but unable to digest the implementation detail that is given.

I'm unlikely to get this done in time for the bounty, but what the hell... This sounds like a job for geohashing. Geohashing is what you do when you want to index a 2-dimensional (or higher) dataset. For example, if you have a database of cities and you want to be able to quickly respond to queries like "find all the cities within 50km of X", you use geohashing. For the purposes of this question, you can think of <code>start_date</code> and <code>end_date</code> as <code>x</code> and <code>y</code> coordinates. Normally in geohashing you're searching for points in your dataset near a particular point in space, or in a certain bounded region of space. In this case you just have a lower bound on one of the coordinates and an upper bound on the other one. But I suppose in practice the whole dataset is bounded anyway, so that's not a problem. It would be nice if there was a library for doing this in Redis. There probably is, if you look hard enough. The newer versions of Redis have built-in geohashing functionality. See the commands starting with <code>GEO</code>. But it doesn't claim to be very accurate, and it's designed for the surface of a sphere rather than a flat surface. So as far as I can see you have 3 options: <ul> <li>Map your search space to a small part of the sphere, preferably near the equator. Use the Redis GEO commands. To search, use <code>GEOSPHERE</code> on a circle covering the triangle you're trying to search, taking into account the inbuilt inaccuracy and the distortion you get by mapping onto the sphere, then filter the results to get the ones that are actually inside the triangle.</li> <li>Find some 3rd-party geohashing client for Redis which works on flat space and is more accurate than GEO.</li> <li>Read the rest of this answer, or some other primer on geohashing, then implement it yourself on top of Redis. This is the hardest (but most educational) option.</li> </ul> If you have a database that indexes data using a numerical ordering, such that you can do queries like "find all the rows/records for which <code>z</code> is between <code>a</code> and <code>b</code>", you can build a geohash index on top of it. Suppose the coordinates are (non-negative) integers <code>x</code> and <code>y</code>. Then you add an integer-valued column <code>z</code>, and index by <code>z</code>. To calculate <code>z</code>, write <code>x</code> and <code>y</code> in binary, then take alternate digits from each. Example: <pre class="prettyprint"><code>x = 969 = 0 1 1 1 1 0 0 1 0 0 1 y = 1130 = 1 0 0 0 1 1 0 1 0 1 0 z = 1750214 = 0110101011010011000110 </code></pre> Note that the index allows you to find, for example, all records positioned with <code>z</code> between <code>0101100000000000000000</code> and <code>0101101111111111111111</code> inclusive. In other words, all records for which <code>z</code> starts with <code>010110</code>. Or to put it another way, you can find all records for which <code>x</code> starts with <code>001</code> and <code>y</code> starts with <code>110</code>. This set of records corresponds to a square in the 2-dimensional space we are trying to search. Not all squares can be searched in this way. We'll call these ones searchable squares. Suppose the client sends a request for all records for which <code>(x,y)</code> is inside a particular rectangle. (Or a circle, or some other reasonable geometric shape.) Then you need to find a set of searchable squares which cover the rectangle. Then, for each of these squares you've chosen, query the database for records inside that square and send the results to the client. (But you'll have to filter the results, because not all the records in the square are actually in the original rectangle.) There's a balance to be struck. If you choose a small number of large special squares, you'll probably end up covering a much larger area of the map than you need; the query to the database will return lots of extra results that you'll have to filter out. Alternatively, if you use lots of little special squares, you'll be doing lots of queries to the database, many of which will return no results. I said above that <code>x</code> and <code>y</code> could be <code>start_time</code> and <code>end_time</code>. But actually the distribution of your dataset won't be as symmetrical as in most uses of geohashing. So the performance might be better (or worse) if you use <code>x = end_time + start_time</code> and <code>y = end_time - start_time</code>.

Composite indexing using Redis in a hierarchical data model

Tags:

redis

data-modeling

I have a data model like this:
Fields:

counter number (e.g. 00888, 00777, 00123 etc)
counter code (e.g. XA, XD, ZA, SI etc)
start date (e.g. 2017-12-31 ...)
end date (e.g. 2017-12-31 ...)
Other counter date (e.g. xxxxx)

Current Datastructure organization is like this (root and multiple child format):

counter_num + counter_code
       ---> start_date + end_date --> xxxxxxxx
       ---> start_date + end_date --> xxxxxxxx
       ---> start_date + end_date --> xxxxxxxx

Example:

00888 + XA
       ---> Jan 10 + Jan 20 --> xxxxxxxx
       ---> Jan 21 + Jan 31 --> xxxxxxxx
       ---> Feb 01 + Dec 31 --> xxxxxxxx

00888 + ZI
       ---> Jan 09 + Feb 24 --> xxxxxxxx
       ---> Feb 25 + Dec 31 --> xxxxxxxx

00777 + XA
       ---> Jan 09 + Feb 24 --> xxxxxxxx
       ---> Feb 25 + Dec 31 --> xxxxxxxx

Today the retrieval happens in 2 ways:

//Fetch unique counter data using all the composite keys
counter_number + counter_code + date (start_date <= date <= end_date)

//Fetch all the counter codes and corresponding data matching the below conditions
counter_number + date (start_date <= date <= end_date)

What's the best way to model this in redis as I need to cache some of the frequently hit data. I feel sorted sets should do this somehow, but unable to model it.

UPDATE:

Just to remove the confusion, the ask here is not for an SQL "BETWEEN" like query. 'Coz I don't know what the start_date and end_date values are. Think they are just column names.

What I don't want is

SELECT * FROM redis_db  
WHERE counter_num AND 
date_value BETWEEN start_date AND end_date

What I want is

SELECT * FROM redis_db
WHERE counter_num AND
start_date <= specifc_date AND end_date >= specific_date

NOTE: The requirement is pretty much close to 2D indexing of what is proposed in Redis multi-dimensional indexing document

https://redis.io/topics/indexes#multi-dimensional-indexes

I understood the concept but unable to digest the implementation detail that is given.

474

asked May 24 '17 16:05

manikawnth

2 Answers

I'm unlikely to get this done in time for the bounty, but what the hell...

This sounds like a job for geohashing. Geohashing is what you do when you want to index a 2-dimensional (or higher) dataset. For example, if you have a database of cities and you want to be able to quickly respond to queries like "find all the cities within 50km of X", you use geohashing.

For the purposes of this question, you can think of start_date and end_date as x and y coordinates. Normally in geohashing you're searching for points in your dataset near a particular point in space, or in a certain bounded region of space. In this case you just have a lower bound on one of the coordinates and an upper bound on the other one. But I suppose in practice the whole dataset is bounded anyway, so that's not a problem.

It would be nice if there was a library for doing this in Redis. There probably is, if you look hard enough. The newer versions of Redis have built-in geohashing functionality. See the commands starting with GEO. But it doesn't claim to be very accurate, and it's designed for the surface of a sphere rather than a flat surface.

So as far as I can see you have 3 options:

Map your search space to a small part of the sphere, preferably near the equator. Use the Redis GEO commands. To search, use GEOSPHERE on a circle covering the triangle you're trying to search, taking into account the inbuilt inaccuracy and the distortion you get by mapping onto the sphere, then filter the results to get the ones that are actually inside the triangle.
Find some 3rd-party geohashing client for Redis which works on flat space and is more accurate than GEO.
Read the rest of this answer, or some other primer on geohashing, then implement it yourself on top of Redis. This is the hardest (but most educational) option.

If you have a database that indexes data using a numerical ordering, such that you can do queries like "find all the rows/records for which z is between a and b", you can build a geohash index on top of it. Suppose the coordinates are (non-negative) integers x and y. Then you add an integer-valued column z, and index by z. To calculate z, write x and y in binary, then take alternate digits from each. Example:

x =     969 = 0 1 1 1 1 0 0 1 0 0 1 
y =    1130 =  1 0 0 0 1 1 0 1 0 1 0
z = 1750214 = 0110101011010011000110

Note that the index allows you to find, for example, all records positioned with z between 0101100000000000000000 and 0101101111111111111111 inclusive. In other words, all records for which z starts with 010110. Or to put it another way, you can find all records for which x starts with 001 and y starts with 110. This set of records corresponds to a square in the 2-dimensional space we are trying to search.

Not all squares can be searched in this way. We'll call these ones searchable squares. Suppose the client sends a request for all records for which (x,y) is inside a particular rectangle. (Or a circle, or some other reasonable geometric shape.) Then you need to find a set of searchable squares which cover the rectangle. Then, for each of these squares you've chosen, query the database for records inside that square and send the results to the client. (But you'll have to filter the results, because not all the records in the square are actually in the original rectangle.)

There's a balance to be struck. If you choose a small number of large special squares, you'll probably end up covering a much larger area of the map than you need; the query to the database will return lots of extra results that you'll have to filter out. Alternatively, if you use lots of little special squares, you'll be doing lots of queries to the database, many of which will return no results.

I said above that x and y could be start_time and end_time. But actually the distribution of your dataset won't be as symmetrical as in most uses of geohashing. So the performance might be better (or worse) if you use x = end_time + start_time and y = end_time - start_time.

answered Oct 02 '22 13:10

David Knipe

Because your question remains a bit vague on how you desire to query your data, it remains unclear on how to solve your question. With that in mind, however, here are my thoughts on how I might model your data:

Updated answer, detailing how to use SORTED SET

I have edited this answer to be able to store your values in a way that you can query by dynamic date ranges. This edit assumes that your database values are timestamps, as in the value is for a single time, not 2, as in your current setup.

Yes, you are correct that using Sorted Sets will be able to accomplish this. I suggest that you always use a Unix timestamp value for the score component in these sorted sets.

In case you were not already familiar with redis, let's explain indexing limitations. Redis is a simple key-value designed to quickly retrieve values by a key. Because of this design, it does not contain many features of your traditional DBMS, like indexing a column for instance.

In redis, you accomplish indexing by using a key, and the most nested key-like structures are available in HASH and SORTED SET, but you only get 2 key-like structures. In a HASH, you have the key (same as any data type), and a inner hash key, which can take the form of any string.

In a SORTED SET, you have the key (same as any data type), and a numeric value.

A HASH is nice to use to keep a grouped data organized.

A SORTED SET is nice if you want to query by a range of values. This could be a good fit for your data.

Your SORTED SET would look like the following:

key
00888:XA =>
            score       (date value)   value
            1452427200  (2016-01-10)   xxxxxxxx
            1452859200  (2016-01-10)   yyyyxxxx
            1453291200  (2016-01-10)   zzzzxxxx

Let's use a more intuitive example, the 2017 Juventus roster:

To produce the SORTED SET in the table below, issue this command in your redis client:

ZADD JUVENTUS 32 "Emil Audero" 1 "Gianluigi Buffon" 42 "Mattia Del Favero" 36 "Leonardo Loria" 25 "Neto" 15 "Andrea Barzagli" 4 "Medhi Benatia" 19 "Leonardo Bonucci" 3 "Giorgio Chiellini" 40 "Luca Coccolo" 29 "Paolo De Ceglie" 26 "Stephan Lichtsteiner" 12 "Alex Sandro" 24 "Daniele Rugani" 43 "Alessandro Semprini" 23 "Dani Alves" 22 "Kwadwo Asamoah" 7 "Juan Cuadrado" 6 "Sami Khedira" 18 "Mario Lemina" 46 "Mehdi Leris" 38 "Rolando Mandragora" 8 "Claudio Marchisio" 14 "Federico Mattiello" 45 "Simone Muratore" 20 "Marko Pjaca" 5 "Miralem Pjanic" 28 "Tomás Rincón" 27 "Stefano Sturaro" 21 "Paulo Dybala" 9 "Gonzalo Higuaín" 34 "Moise Kean" 17 "Mario Mandzukic"


 Jersey  Name                     Jersey  Name 
 32      Emil Audero              23     Dani Alves 
 1       Gianluigi Buffon         42      Mattia Del Favero 
 36      Leonardo Loria           25      Neto 
 15      Andrea Barzagli          4       Medhi Benatia 
 19      Leonardo Bonucci         3       Giorgio Chiellini 
 40      Luca Coccolo             29      Paolo De Ceglie 
 26      Stephan Lichtsteiner     12      Alex Sandro 
 24      Daniele Rugani           43      Alessandro Semprini 
 22     Kwadwo Asamoah            7     Juan Cuadrado 
 6     Sami Khedira               18     Mario Lemina 
 46     Mehdi Leris               38     Rolando Mandragora 
 8     Claudio Marchisio          14     Federico Mattiello 
 45     Simone Muratore           20     Marko Pjaca 
 5     Miralem Pjanic             28     Tomás Rincón 
 27     Stefano Sturaro           21     Paulo Dybala 
 9     Gonzalo Higuaín            34     Moise Kean 
 17     Mario Mandzukic

To query the roster by a range of jersey numbers:

ZRANGEBYSCORE JUVENTUS 1 5
Output: 
1) "Gianluigi Buffon"
2) "Giorgio Chiellini"
3) "Medhi Benatia"
4) "Miralem Pjanic"

Note that the scores are not returned, however ZRANGEBYSCORE command orders the results in ASC order by score. To add the scores, append "WITHSCORES" to the command, like so: ZRANGEBYSCORE JUVENTUS 1 5 WITHSCORES

By using ZRANGEBYSCORE, you should be able to query any key (counter number + counter code) with a date range, producing the values in that range.

Original: Below is my original answer, recommending HASH

Based on your examples, I recommend you use a HASH.

With a hash, you would have a main key to find the hash (Ex. 00888:XA). Then within the hash, you have key -> value pairs (Ex. 2017-01-10:2017-01-20 -> xxxxxxxx). I prefer to delimit or tokenize my keys' components with the colon char :, but you can use any delimiter.

HASH follows your example data structure very well:

key
00888:XA =>
            hashkey                  value
            2017-01-10:2017-01-20    xxxxxxxx
            2017-01-21:2017-01-31    yyyyxxxx
            2016-02-01:2016-12-31    zzzzxxxx

key
00888:ZI =>
            hashkey                  value
            2017-01-10:2017-01-20    xxxxxxxx
            2017-01-21:2017-01-31    xxxxyyyy
            2016-02-01:2016-12-31    xxxxzzzz

When querying for data, instead of GET key, you would query with HGET key hashkey. Same for setting values, instead of SET key value, use HSET key hashkey value.

Example commands

HSET  00777:XA  2017-01-10:2017-01-20  xxxxxxxx
HSET  00777:XA  2017-01-21:2017-01-31  yyyyyyyy
HSET  00777:XA  2016-02-01:2016-12-31  zzzzzzzz

(Note: there is also a HMSET to simplify this into a single command) Then:

HGET  00777:XA  2017-01-21:2017-01-31

Would return yyyyyyyy

Unless there is some specific performance consideration, or other goal for your data, I think Hashes will work great for your system.

It's also very convenient if you want to get all hashkeys or all values for a given hash, using commands like HKEYS, HVALS, or HGETALL.

answered Oct 02 '22 14:10

Julian Soro

Related questions
                            
                                python rq - how to trigger a job when multiple other jobs are finished? Multi job dependency work arround?
                            
                                does redis cluster use consistent hashing
                            
                                Laravel CRON or Event process respond to api request via long poll - how to re-vitalise the session
                            
                                Redis 5.0.5: WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128
                            
                                Best database for a Statistics System [closed]
                            
                                Redis full text search : reverse indexing or sunspot?
                            
                                Celery gives connection reset by peer
                            
                                Cannot connect to redis using jedis
                            
                                Does Action Cable 5 require Redis?
                            
                                What exactly is a "Null reply" in Redis?
                            
                                How to prevent Redis stream memory increases infinitely?
                            
                                Node.js: How Many Redis Clients?
                            
                                Iterating through keys in Redis
                            
                                Can redis pipeline multiple commands that depend on previous ones?
                            
                                Unable to stub redis in rspec tests
                            
                                Celery: access all previous results in a chain
                            
                                Using Redis to cache SQL result
                            
                                Azure Redis StackExchange.Redis ConnectionMultiplexer in ASP.net MVC
                            
                                Rails, Sidekiq - Redis NOAUTH
                            
                                docker redis -Can't open the log file: No such file or directory

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With