Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest correct way to detect that there are no duplicates in a JSON array?

I need to check if all items are unique in an array of serde_json::Value. Since this type does not implement Hash I came up with the following solution:

use serde_json::{json, Value};
use std::collections::HashSet;

fn is_unique(items: &[Value]) -> bool {
    let mut seen = HashSet::with_capacity(items.len());
    for item in items.iter() {
        if !seen.insert(item.to_string()) {
            return false;
        }
    }
    true
}

fn main() {
    let value1 = json!([1, 2]);
    assert!(is_unique(&value1.as_array().unwrap()));
    let value2 = json!([1, 1]);
    assert!(!is_unique(&value2.as_array().unwrap()));
}

I assume that it should only work if serde_json is built with preserve_order feature (to have objects serialized in the same order every time), but I am not 100% sure about it.

Main usage context:

JSON Schema validation. "uniqueItems" keyword implementation.

Related usage case

Deduplication of JSON arrays to optimize JSON Schema inference on them.

For example, the input data is [1, 2, {"foo": "bar"}]. A straightforward inference might output this:

{
    "type": "array", 
    "items": {
        "anyOf": [
            {"type": "integer"}, 
            {"type": "integer"},
            {"type": "object", "required": ["foo"]}
        ]
    }
}

values in items/anyOf can be reduced to only two values.

Question: What would be the most time-efficient and correct way to check that there are no duplicates in an arbitrary JSON array?

I used serde_json = "1.0.48"

Rust: 1.42.0

Playground

like image 589
Stranger6667 Avatar asked Nov 28 '25 00:11

Stranger6667


1 Answers

Converting each array item to a string is rather expensive – it requires at least one string allocation per item, and quite likely more than that. It's also difficult to make sure mappings (or "objects" in JSON language) are represented in a canonical form.

A faster and more robust alternative is to implement Hash for Value yourself. You need to define a newtype wrapper, since you can't implement a foreign trait on a foreign type. Here's a simple example implementation:

use serde_json::Value;
use std::hash::{Hash, Hasher};
use std::collections::hash_map::DefaultHasher;

#[derive(PartialEq)]
struct HashValue<'a>(pub &'a Value);

impl Eq for HashValue<'_> {}

impl Hash for HashValue<'_> {
    fn hash<H: Hasher>(&self, state: &mut H) {
        use Value::*;
        match self.0 {
            Null => state.write_u32(3_221_225_473), // chosen randomly
            Bool(ref b) => b.hash(state),
            Number(ref n) => {
                if let Some(x) = n.as_u64() {
                    x.hash(state);
                } else if let Some(x) = n.as_i64() {
                    x.hash(state);
                } else if let Some(x) = n.as_f64() {
                    // `f64` does not implement `Hash`. However, floats in JSON are guaranteed to be
                    // finite, so we can use the `Hash` implementation in the `ordered-float` crate.
                    ordered_float::NotNan::new(x).unwrap().hash(state);
                }
            }
            String(ref s) => s.hash(state),
            Array(ref v) => {
                for x in v {
                    HashValue(x).hash(state);
                }
            }
            Object(ref map) => {
                let mut hash = 0;
                for (k, v) in map {
                    // We have no way of building a new hasher of type `H`, so we
                    // hardcode using the default hasher of a hash map.
                    let mut item_hasher = DefaultHasher::new();
                    k.hash(&mut item_hasher);
                    HashValue(v).hash(&mut item_hasher);
                    hash ^= item_hasher.finish();
                }
                state.write_u64(hash);
            }
        }
    }
}

The value for None is chosen randomly to make it unlikely to collide with other entries. To calculate hashes for floating point numbers, I used the ordered-float crate. For mappings, the code calculates a hash for each key/value pair and simply XORs these hashes together, which is order-independent. It's a bit unfortunate that we need to hardcode the hasher used for hashing the map entries. We could abstract that out by defining our own version of the Hash trait, and then derive concrete implementations of std::hash::Hash from our custom Hash trait, but this complicates the code quite a bit, so I wouldn't do that unless you need to.

We can't derive Eq, since Value does not implement Eq. However, I believe this is just an oversight, so I filed an issue to add an Eq implementation (which the PR has been accepted for, so it will land in some future release).

like image 133
Sven Marnach Avatar answered Nov 29 '25 19:11

Sven Marnach



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!