Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I extract keys and values from a string when the value contains the separator between keys and values or the separator between pairs?

Tags:

rust

How can I efficiently extract the key-value pairs from a string into a HashMap when

  • key is always followed by : and then the value
  • value ends with a , followed by another key (sometimes whitespace and then key)
  • value can contain , : throughout
  • no value will include any key
  • the order of the keys are not fixed
  • the key names are known

For these key-value pairs

key1:value1, key2:this is, some value2, key3:anothe:r val,ue,

It should produce this HashMap:

"key1", "value1"
"key2", "this is, some value2"
"key3", "anothe:r val,ue"

I have tried the following code but it is no good with just a , as a delimiter as the value can contain commas throughout.

"key1:value1, key2:this is, some value2, key3:anothe:r val,ue,"
    .split(",")
    .map(|kv| kv.splitn(2, ":").collect::<Vec<&str>>())
    .filter(|vec| vec.len() == 2)
    .map(|vec| (vec[0].trim().into(), vec[1].trim().into()))
    .collect()

My thought would be to provide a list of keys: ["key1", "key2", "key3"] to use as delimiters

UPDATE:

Using @Lucretiel answer I have come up with:

fn key_value<'a>(keys: &[&str], mut command: &'a str) -> HashMap<&'a str, &'a str> {
    let mut hashmap = HashMap::new();
    loop {
        if let Some(key) = key(&keys, &command) {
            command = &command[key.len() + 1..];

            let value = value(&keys, &command);
            let trim: &[_] = &[',', ' '];
            command = &command[value.len()..].trim_start_matches(trim);

            hashmap.insert(key, value);
        } else {
            break;
        }
    }
    hashmap
}

fn key<'a>(keys: &[&str], command: &'a str) -> Option<&'a str> {
    let regex = format!("^({}):", keys.join("|"));
    let regex = regex::Regex::new(&regex).expect("Invalid regex");
    match regex.shortest_match(&command) {
        Some(position) => Some(&command[..position - 1]),
        None => None,
    }
}

fn value<'a>(keys: &[&str], command: &'a str) -> &'a str {
    let regex = format!(r#",\s*({}):"#, keys.join("|"));
    let regex = regex::Regex::new(&regex).expect("Invalid regex");
    match regex.find(&command) {
        Some(position) => &command[..position.start()],
        None => command,
    }
}

(Playground)

like image 792
James Avatar asked Apr 05 '20 09:04

James


1 Answers

The actual code to solve this is nontrivial, but it can be done. There's a lot of little fiddly edge cases, depending on what error cases you want to account for (for instance, do you require that every key in your known key list is present in the input string to parse? Do you allow duplicate keys? etc.). The basic algorithm looks like this:

  • while the key list isn't empty:
    • find the key that starts the string, matching ^{key}:. This is the current key.
      • if there is no such key, it's an error; the input is malformed
    • find the next earliest key in the string, matching ,\s*{key}:. This is the next key.
      • if there are no more keys, the remainder of the string is the value for this key
      • otherwise, all the content between the two found keys is the current value
    • add (current key, current value) to your hash table
    • remove current key from the key list (assuming you don't accept duplicate keys)
    • slice the (current key, current value) off the front of your input string
  • Once you're out of keys, return the hash map

There's no way to do this with a conventional grammar; as presented it's very ambiguous. However, if you structure your parsing around scanning for each subsequent key (assuming that keys never appear as substrings in values) you can successfully parse this kind of input.

The algorithm as described runs in quadratic time, but hypothetically it should be reducible to linear time if you create a composite regular expression to search for every key simultaneously:

,\s*(key1|key2|key3|...):

like image 148
Lucretiel Avatar answered Sep 29 '22 00:09

Lucretiel