Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I store IP addresses and CIDR Ranges effectively

I have a use case which I do not know how to solve. I am asking it here to get some guidance on what I need to learn to solve this.

  • I need store IP addresses(lots of them, may be few hundred million). These can be as 1.1.1.2 or CIDR ranges like 1.1.1.1/24

The requirement is following

- Save all the IP address which comes as above format in-memory
- Search should work as following  
   - if I have IP address as 1.1.1.2 and 1.1.1.1/24, it should match the specific IP address 1.1.1.2 and not the CIDR range it falls into (1.1.1.1/24)
   - If specific IP address is not found but CIDR range is available, CIDR range is returned
   - if no match is found, return null/throw exception

Question
- What data structure(s) can help me solve this problem? Tries? - what approach you would take? - How to make sure it does not consume too much memory and search is fast (some reasonable trade-off will be there)

Thanks

like image 771
daydreamer Avatar asked Nov 04 '25 14:11

daydreamer


2 Answers

I would use a binary tree (this kind of trees are called Radix Trees):

  1. The bits of the IP address are used for navigating it starting from the MSB (e.g. 0 = left child and 1 = right child)
  2. All specific IP addresses are leaves while CIDR ranges are internal nodes (using dummy leaves to indicate "missing" IP addresses just like in red-black trees)
  3. Some nodes will be actual data that you have, other are only "navigation" nodes (they correspond to CIDR ranges you don't have, mark them with a flag)

Search: Simply navigate the tree with the address you have. If you end up in a leaf then you have that specific IP address otherwise the "lowest" node with a flag (see point 3) is the most specific CIDR range that you have.

Example: let's work with 8 bits IP and 2 bit "sections" addresses; after inserting 1.1.1.0/6 the tree will be (the number after the IP is the "contains" flag, nils are dummy leaves)

    <root> -- nil
      |
00.00.00.00/1 (0) -- nil
      |
01.00.00.00/2 (0) -- nil
      |
01.00.00.00/3 (0) -- nil
      |
01.01.00.00/4 (0) -- nil
      |
01.01.00.00/5 (0) -- nil
      |
01.01.01.00/6 (1) --nil
      |
     nil

If you look for the IP address 1.1.1.1 you will stop at 1.1.1.1/6, which is a CIDR range because it has nil children and is the most specific (down in the tree) that there is. If you now insert 1.1.1.1 the tree will be

    <root> -- nil
      |
00.00.00.00/1 (0) -- nil
      |
01.00.00.00/2 (0) -- nil
      |
01.00.00.00/3 (0) -- nil
      |
01.01.00.00/4 (0) -- nil
      |
01.01.00.00/5 (0) -- nil
      |
01.01.01.00/6 (1) -- nil
      |
01.01.01.00/7 (0) -- nil
      |
01.01.01.01 (1)

1.1.1.1 does not have leaves because it is an IP address. Finally let's insert 1.1.2.1

    <root> -- nil
      |
00.00.00.00/1 (0) -- nil
      |
01.00.00.00/2 (0) -- nil
      |
01.00.00.00/3 (0) -- nil
      |
01.01.00.00/4 (0) --------------------+
      |                               |
01.01.00.00/5 (0) -- nil        01.01.10.00/5 (0) -- nil
      |                               |
01.01.01.00/6 (1) -- nil        01.01.10.00/6 (0) -- nil        
      |                               |
01.01.01.00/7 (0) -- nil        01.01.10.00/7 (0) -- nil
      |                               |
01.01.01.01 (1)                 01.01.10.01 (1)
like image 78
BlackBear Avatar answered Nov 07 '25 16:11

BlackBear


First, you need to determine some decision points. This has to be completely deterministic. The only decision point you have stated is that it should be checked for a host address (/32) first, then look for a shorter mask.

  • Should it look for progressively shorter masks (longest match)?
  • If it includes a specific mask length, do you look at shorter or longer mask lengths, too, or is is restricted to that mask length, that mask length and shorter, or that mask length and longer? If you want to look at other lengths, do you want the longest match the way routing tables do it?
  • What do you do with addresses like your example, 1.1.1.1/24, where the address is longer than allowed by the mask length? Do you assume it to be a host-only address (1.1.1.1/32), or do you look for the subnet (1.1.1.0/24) where the above decision points come into play?

You are going to need to get over memory consumption concerns where you need this for a few hundred million addresses with any reasonable speed.

I will assume that host addresses will the largest number of addresses you have, and, as the mask lengths become shorter, you will have progressively fewer addresses for each mask length.

If you use hash tables (one for each of the 32 mask lengths), you could start at the longest mask length (/32) table and, if the match isn't found, look at short mask length tables, depending on the deterministic rules above.

For instance, with 1.1.1.1/24 you have in your question, assume you have aggregates of 1.1.1.0/24, 1.1.0.0/23, and 1.1.0.0/22.

Table /32
1.1.1.1

Table /31
not found

Table /30
not found

Table /29
not found

Table /28
not found

Table /27
not found

Table /26
not found

Table /25
not found

Table /24
1.1.1.0

Table /23
1.1.0.0

Table /22
1.1.0.0

Table /21
not found

Table /20
not found

Table /19
not found

Table /18
not found

Table /17
not found

Table /16
not found

Table /15
not found

Table /14
not found

Table /13
not found

Table /12
not found

Table /11
not found

Table /10
not found

Table /9
not found

Table /8
not found

Table /7
not found

Table /6
not found

Table /5
not found

Table /4
not found

Table /3
not found

Table /2
not found

Table /1
not found

For example, with 1.1.1.1/24: You could start looking for the host address in the /32 table. If you don't find that, then AND the address with the mask to get 1.1.1.0 in the /24 table. If you don't find that, AND with the next shorter mask and look in that table. Repeat until you find a match or the mask length gets to 0. Exactly how you do this depends on the the algorithm resulting from your decisions above.

This only takes up one 32-bit memory block for each host address or aggregate, but, with a few hundred million addresses, it won't be a small amount. Performance will be highly dependent on the hashing algorithm used, but it should perform quite well.

like image 20
Ron Maupin Avatar answered Nov 07 '25 15:11

Ron Maupin