Say I have a collection with these fields:
{
"category" : "ONE",
"data": [
{
"regex": "/^[0-9]{2}$/",
"type" : "TYPE1"
},
{
"regex": "/^[a-z]{3}$/",
"type" : "TYPE2"
}
// etc
]
}
So my input is "abc" so I'd like to obtain the corresponding type (or best match, although initially I'm assuming RegExes are exclusive). Is there any possible way to achieve this with decent performance? (that would be excluding iterating over each item of the RegEx array)
Please note the schema could be re-arranged if possible, as this project is still in the design phase. So alternatives would be welcomed.
Each category can have around 100 - 150 RegExes. I plan to have around 300 categories. But I do know that types are mutually exclusive.
Real world example for one category:
type1=^34[0-9]{4}$,
type2=^54[0-9]{4}$,
type3=^39[0-9]{4}$,
type4=^1[5-9]{2}$,
type5=^2[4-9]{2,3}$
MongoDB provides the functionality to search a pattern in a string during a query by writing a regular expression. A regular expression is a generalized way to match patterns with sequences of characters. MongoDB uses Perl compatible regular expressions(PCRE) version 8.42 along with UTF-8 support.
To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).
$[] The all positional operator $[] indicates that the update operator should modify all elements in the specified array field. The $[] operator has the following form: { <update operator>: { "<array>.$[]" : value } }
Describing the RegEx (Divide et Impera) would greatly help in limiting the number of Documents needed to be processed.
Some ideas in this direction:
[:alpha:]
, [:digit:]
, [:alnum:]
, etc.)Implementing each of these would add to the complexity (code and/or manual input) for Insertion and also some overhead for describing the searchterm
before the query.
Having mutually exclusive types in a category simplifies things, but what about between categories?
300 categories @ 100-150 RegExps/category => 30k to 45k RegExps
... some would surely be exact duplicates if not most of them.
In this approach I'll try to minimise the total number of Documents to be stored/queried in a reversed style vs. your initial proposed 'schema'.
Note: included only string lengths in this demo for narrowing, this may come naturally for manual input as it could reinforce a visual check over the RegEx
Consider rewiting the regexes
Collection with Documents as follows:
{
"max_length": NumberLong(2),
"min_length": NumberLong(2),
"regex": "^[0-9][2]$",
"types": [
"ONE/TYPE1",
"NINE/TYPE6"
]
},
{
"max_length": NumberLong(4),
"min_length": NumberLong(3),
"regex": "^2[4-9][2,3]$",
"types": [
"ONE/TYPE5",
"TWO/TYPE2",
"SIX/TYPE8"
]
},
{
"max_length": NumberLong(6),
"min_length": NumberLong(6),
"regex": "^39[0-9][4]$",
"types": [
"ONE/TYPE3",
"SIX/TYPE2"
]
},
{
"max_length": NumberLong(3),
"min_length": NumberLong(3),
"regex": "^[a-z][3]$",
"types": [
"ONE/TYPE2"
]
}
.. each unique RegEx as it's own document, having Categories it belongs to (extensible to multiple types per category)
Demo Aggregation code:
function () {
match=null;
query='abc';
db.regexes.aggregate(
{$match: {
max_length: {$gte: query.length},
min_length: {$lte: query.length},
types: /^ONE\//
}
},
{$project: {
regex: 1,
types: 1,
_id:0
}
}
).result.some(function(re){
if (query.match(new RegExp(re.regex))) return match=re.types;
});
return match;
}
Return for 'abc'
query:
[
"ONE/TYPE2"
]
this will run against only these two Documents:
{
"regex": "^2[4-9][2,3]$",
"types": [
"ONE/TYPE5",
"TWO/TYPE2",
"SIX/TYPE8"
]
},
{
"regex": "^[a-z][3]$",
"types": [
"ONE/TYPE2"
]
}
narrowed by the length 3
and having the category ONE
.
Could be narrowed even further by implementing POSIX
descriptors (easy to test against the searchterm
but have to input 2 RegExps in the DB)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With