Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Intersection Between Two String Sets with Substring Compare

I know this is bike shedding but is there a way to get the set of strings C, between two (sorted) sets A,B of strings, where B is a sub string of A, with a complexity better than of A.size * B.size * comp_substr, as the naive solution I came up?

    std::copy_if(devices.cbegin(), devices.cend(),
                          std::back_inserter(ports),
                          [&comport_keys] (const auto& v) {
        return std::any_of(comport_keys.begin(),comport_keys.end(), [&v](auto& k) {
           return v.find(k) != std::string::npos;
        });
    });

The easier case of just where B is a string of A, with std::set_intersection would be pretty simple with a complexity of (A.size + B.size) * comp_substr, with would be even better if one had to sort it before (n * log(n)), but I don't know how to write the compare function for it, or rather the sort of both.

    #define BOOST_TEST_MODULE My Test

    #include <boost/test/included/unit_test.hpp>

    #include <vector>
    #include <string>
    #include <algorithm>
    #include <iterator>
    #include <set>

    BOOST_AUTO_TEST_CASE(TEST) {
        std::vector<std::string> devices{
                "tty1",
                "ttyOfk",
                "ttyS05",
                "bsd",
        }, ports{};

        const std::set<std::string> comport_keys{
                "ttyS",
                "ttyO",
                "ttyUSB",
                "ttyACM",
                "ttyGS",
                "ttyMI",
                "ttymxc",
                "ttyAMA",
                "ttyTHS",
                "ircomm",
                "rfcomm",
                "tnt",
                "cu",
                "ser",
        };

        std::sort(devices.begin(), devices.end());
        std::set_intersection(devices.cbegin(), devices.cend(),
                              comport_keys.cbegin(), comport_keys.cend(),
                              std::back_inserter(ports),
                              [&comport_keys] (auto a, auto b) {
            return a.find(b) != std::string::npos; //This is wrong
        });

        const std::vector<std::string>test_set {
                "ttyOfk",
                "ttyS05",
        };

        BOOST_TEST(ports == test_set);
    }
like image 372
Superlokkus Avatar asked May 03 '19 12:05

Superlokkus


1 Answers

Say we have two sets of strings: A and B. B contains a set of potential prefixes for the strings in A. So we want to take each element a from A and try to match it with all potential prefixes of B. If we find a matching prefix, we store our result a in C. The trivial solution works in O(|A| |B|). You ask: Can we optimize this?

You said, B is already sorted. Then we can build a generalised prefix tree on B in linear time and query it with each string in A to solve it in O(|A|+|B). The problem is, sorting B takes O(|B| log|B|) and the tree is non-trivial.

So I provide a simple solution with O(|A| log|B|) which is more efficient than O(|A|+|B|) if |A| is small, like in your example. B is still assumed to be sorted (the sorting is really the upper bound here...).

bool
validate_prefixes(const std::multiset<std::string>& keys) {
    auto itb = keys.begin(), it = itb;
    if(it == keys.end()) return false; //no keys
    for(++it; it != keys.end(); ++it) {
        if( (*it).find(*itb) != std::string::npos ) return false; //redundant keys
        itb++;
    }
    return true;
}

bool
copy_from_intersecting_prefixes(const std::vector<std::string>& data, 
                                std::multiset<std::string>& prefix_keys,
                                std::vector<std::string>& dest, bool check = false) {
    if(check && !validate_prefixes(prefix_keys)) return false;

    for(auto it_data = data.begin(); it_data != data.end(); ++it_data) {
        auto ptr = prefix_keys.insert(*it_data), ptrb = ptr;
        if(ptrb != prefix_keys.begin()) {  //if data is at the start, there is no prefix
            if( (*ptr).find(*(--ptrb)) != std::string::npos ) dest.push_back(*it_data);
        }
        prefix_keys.erase(ptr);
    } //Complexity: O(|data|) * O( log(|prefix_keys|) ) * O(substr) = loop*insert*find
    return check;
}

//.... in main()
std::multiset<std::string> tmp(comport_keys.begin(), comport_keys.end()); //copy const    
copy_from_intersecting_prefixes(devices, tmp, ports);

validate_prefixes enforces a precondition. It checks if we have at least one valid prefix and that the keys are not self-matching. E.g. we could have keys cu and cu2, but cu is a prefix for cu2, so they can't be both valid prefixes, either cu is too general or cu2 too specific. If we try to match cu3 with cu and cu2 this is inconsistent. Here validate_prefixes(comport_keys) returns true, but it might be nice to check it automatically.

copy_from_intersecting_prefixes does the actual asked work. It iterates over A, and puts a inside the ordered B. The prefix is smaller than prefix+ending, so if a corresponding prefix exists, it will occur before a in B. Because the keys are not self-matching, we know that the prefix will precede a in B. So we decrement the iterator from a and compare. Note that prefix might equal a, so we need multiset.

like image 147
BlueLemon Avatar answered Nov 01 '22 23:11

BlueLemon