Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing secret data without giving away source

Issue:

Company A has secret data they don't want to give away to company B. Company B has secret data they don't want to give away to company A.

The secret data is IP addresses on both sides.

But the two companies want to know the number of overlapping IPs they have (IP addresses that both companies have in the database).

Without using a third party I can't think of a way to solve this issue without one party compromising their secret data set. Is there any type of hashing algo written to solve this problem?

like image 548
user2263572 Avatar asked Oct 13 '15 13:10

user2263572


People also ask

What is TruffleHog scan?

Enterprise. SEE IT IN ACTION. The enterprise version of TruffleHog provides everything you need to operationalize continuous secrets scanning across your company. TruffleHog runs across all your platforms quietly in the background and only alerts when verified secrets are leaked.

What is Keywhiz?

Keywhiz is a system for managing and distributing secrets. It can fit well with a service oriented architecture (SOA). Here is an overview in presentation format. Every organization has services or systems that require secrets.

Is Gitguardian open source?

The project is open-source and relies on Terraform, the popular infrastructure-as-code software tool by HashiCorp, to create and manage AWS canary tokens. Its intrusion detection is highly-sensitive – ggcanary uses AWS CloudTrail audit logs to track all types of actions performed on the canary tokens by attackers.


1 Answers

First I'll describe a simple but not very secure idea. Then I'll describe a way that I think it can be easily made much more secure. The basic idea is to have each company send an encoding of a one-way function to the other company.

Sending Programs

As a warm-up, let's first suppose that one company (let's say A) develops an ordinary computer program in some language and sends it to B; B will then run it, supplying its own list of email addresses as input, and the program will report how many of them are also used by A. At this point, B knows how many email addresses it shares with A. Then the process can be repeated, but with the roles of A and B reversed.

Sending SAT Instances

Implementing this program straightforwardly in a normal programming language would yield a program that is almost trivially easy to reverse-engineer. To mitigate this, first, instead of having the program report the count directly, let's reformulate the problem as a decision problem: Does the other company have at least k of the emails in the input? (This involves choosing some value k to test for; of course, if both parties agree then the whole procedure can be performed for many different values of k. (But see the last section for possible ramifications.)) Now the program can be represented instead as a SAT instance that takes as input (some bitstring encoding of) a list of email addresses, and outputs a single bit that indicates whether k or more of them also belong to the company that created the instance.

It's computationally easy to supply inputs to a SAT instance and read off the output bit, but when the instance is large, it's (in principle) very difficult to go in "the other direction" -- that is, to find a satisfying assignment of inputs, i.e., a list of email addresses that will drive the output bit to 1: SAT being an NP-hard problem, all known exact techniques take time exponential in the problem size.

Making it Harder with Hashing

[EDIT: Actually there are many more than (n choose k) possible hashes to be ORed together, since any valid subsequence (with gaps allowed) in the list of email addresses that contains at least k shared ones needs to turn the output bit on. If each email address takes at most b bits, then there are much more than 2^((n-k)b)*(n choose k) possibilities. It's probably only feasible to sample a small fraction of them, and I don't know if unsampled ones can be somehow turned into "don't-cares"...]

The SAT instance I propose here would certainly be very large, as it would have to be a disjunction (OR) of all (n choose k) possible allowed bitstrings. (Let's assume that email addresses are required to be listed in some particular order, to wipe off an n-factorial factor.) However it has a very regular structure that might make it amenable to analysis that could dramatically reduce the time required to solve it. To get around this, all we need to do is to require the receiver to hash the original input and supply this hash value as input instead. The resulting SAT instance will still look like the disjunction (OR) of (n choose k) possible valid bitstrings (which now represent hashes of lists of strings, rather than raw lists of strings) -- but, by choosing a hash size large enough and applying some logic minimisation to the resulting instance, I'm confident that any remaining telltale patterns can be removed. (If anyone with more knowledge in the area can confirm or deny this, please edit or comment.)

Possible Attacks

One weakness of this approach is that nothing stops the receiver from "running" (supplying inputs to) the SAT instance many times. So, choosing k too low allows the receiver to easily isolate the email addresses shared with the sender by rerunning the SAT instance many times using different k-combinations of their own addresses, and dummy values (e.g. invalid email addresses) for the remaining input bits. E.g. if k=2, then the receiver can simply try running all n^2 pairs of its own email addresses and invalid email addresses for the rest until a pair is found that turns the output bit on; either of these email addresses can then be paired with all remaining email addresses to detect them in linear time.

like image 172
j_random_hacker Avatar answered Nov 09 '22 04:11

j_random_hacker