Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching JSON keys with regex in Python

Tags:

python

json

regex

I'm trying to find a regular expression which matches repeated keys on different levels of a nested JSON string representation. All my "solutions" suffer from catastrophic backtracking so far.

An example of that JSON string looks like this:

d = {               
        "a": {      
            "b": {
                "c": {
                    "d": "v1", 
                    "key": "v2"
                }
            },
            "c": {  
                "g": "v3",     
                "key": "v4"
            },
            "key": "v5"        
        }
    }

The value of key is the target. My application does have all object names leading to that key. With these names I can use a for loop to construct my final regex. So basically I need the parts to put in between.

Example: If I get "a" and "key" I could construct the following: "a"[^}]*"key". This matches the first "key" in my string d, the one with value v2.

What should happen though, is that "a" + "key" matches the key with value v5. The key with value v2 should be match when the full path "a" + "b" + "c" + "key" comes in. The last case in this example would be matching the key with value v4 when "a" + "c" + "key" is given.

So a complete regex for the last one would look similar to this:

"a"MATCH_EVERYTHING_IN_BETWEEN_REGEX"c"MATCH_EVERYTHING_IN_BETWEEN_REGEX"key":\s*(\[[^}]*?\]|".*?"|\d+\.*\d*) 

To be clear, I am looking for this MATCH_EVERYTHING_IN_BETWEEN_REGEX expression which I can plug in as connectors. This is to make sure it matches only the key I have received the path for. The JSON string could be infinitely nested.

Here is an online regex tester with the example: https://regex101.com/r/yNZ3wo/2

Note: I know this is not python specific but I'm also grateful about python hints in this context. I thought about building my own parser, using a stack and counting { and } but before I would like to make sure there is no easy regex solution.

EDIT: I know about the json library but this doesn't solve my case since I'm tracking the coordinates of my targets within the string representation inside an editor window. I'm not looking for the values themselves, I can access them from an associated dictionary.

like image 379
loxosceles Avatar asked Apr 26 '26 22:04

loxosceles


1 Answers

This is hard. A possible solution is to use

  1. a recursive regex* to match nested braces
    (?<="a": )({(?>[^{}]|(?1))*})
  2. and then, continue the search for the key on the inner level using a trash-can approach, i.e. ignore the overall match and just look at a specific capturing group if it contains a value
    (here $2, add groups as needed):
    ({(?>[^{}]|(?1))*})|"key":\s*"([^"]*?)"

Code sample:

import regex as re

test_str = ("{                   \n"
    "  \"a\": {            \n"
    "    \"b\": {          \n"
    "      \"c\": {        \n"
    "        \"d\": \"v1\",  \n"
    "        \"key\": \"v2\" \n"
    "      }             \n"
    "    },              \n"
    "    \"c\": {          \n"
    "      \"g\": \"v3\",    \n"
    "      \"key\": \"v4\"   \n"
    "    },              \n"
    "    \"key\": \"v5\"  \n"
    "    }     \n"
    "  }                 \n"
    "}                   \n")

regex = r"(?<=\"a\": )({(?>[^{}]|(?1))*})"
innerRegex = r"({(?>[^{}]|(?1))*})|\"key\":\s*\"([^\"]*?)\""

matches = re.finditer(regex, test_str, re.DOTALL)

for n, match in enumerate(matches):
    n = n + 1    
    #print ("Match {n} was found at {start}-{end}: {match}".format(n = n, start = match.start(), end = match.end(), match = match.group()))
    inner = match.group()[1:-1]

    innerMatches = re.finditer(innerRegex, inner, re.DOTALL)
    for m, innerMatch in enumerate(innerMatches):
        #m = m + 1
        if (innerMatch.groups()[1] is not None):          
            print ("Found at {start}-{end}: {group}".format(start = innerMatch.start(2), end = innerMatch.end(2), group = innerMatch.group(2)))

or continue the search on the next level (not shown in the above) code.
Basically, you would continue from the inner match again from step 1 in the same way (see demo), e.g.:

(?<="c": )({(?>[^{}]|(?1))*})

This should give you head-start.

*Since we use regex recursion, we need the alternative Python regex package.

like image 188
wp78de Avatar answered Apr 28 '26 12:04

wp78de