Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to compute for hash of a whole table [duplicate]

We need to be able to compute table hashes for an external environment and compare it to pre-computed hash from an internal environment. The use of this is to ensure that data in the external environment is not tampered by a "rogue" database administrator. Users insist this feature.

Currently, we do this by computing the individual hashes of each column value, perform bit-xor on the column hashes to get the row hash, then perform bit-xor on all the row hashes to come up with the table hash. Pseudo-script below:

cursor hash_cur is
select /*+ PARALLEL(4)*/ dbms_crypto.mac(column1_in_raw_type, HMAC_SH512, string_to_raw('COLUMN1_NAME')) as COLUMN1_NAME
       ...
from TABLE_NAME;

open hash_cur;
fetch hash_cur bulk collect into hashes;
close hash_cur;

for i in 1..hashes.count
loop
  rec := hashes(i);
  record_xor = rec.COLUMN1;
  record_xor = bit_xor(record_xor, rec.COLUMN2);
  ...
  record_xor = bit_xor(record_xor, rec.COLUMNN);

  table_xor = bit_xor(table_xor, record_xor);
end loop;

The pseudo-script above will be run in parallel by using dbms_job.

Problem with this is that we have terabytes of data for certain tables and currently the performance does not meet the performance we want to achieve. Hashing must be done "on-the-fly" as users want to perform hash checking themselves.

  1. Do you guys have a better way to perform whole table hashing, or basically comparing tables from different environments which are connected by a low-latency and relatively low-bandwidth network?

It seems to me that the operation is more CPU-bound than I/O bound. I am thinking of storing the table data in a blob instead, where data is properly arranged by record, then by column. Then perform hash on the output file. This should make the operation completely I/O bound.

  1. What is the fastest way to do this? Anyway to do this within the select clause of a query to remove any overhead PL/SQL-to-SQL engine context switch?
    • I was thinking of modifiying a global blob for this
    • Would also like to remove I/O overhead of bulk collecting the results.

Any suggestions that can lead me to a better performing script would be greatly appreciated. Thanks.

like image 919
user3367701 Avatar asked Nov 20 '15 10:11

user3367701


People also ask

What are hash duplicates?

In a lot of data management systems, a hash is used as a kind of data fingerprint, with the intent not to cryptographically defeat an opponent, but to quickly identify duplicate data by seeing if other data with the given hash already exists in the system.

How is hash value calculated?

Hashing is simply passing some data through a formula that produces a result, called a hash. That hash is usually a string of characters and the hashes generated by a formula are always the same length, regardless of how much data you feed into it. For example, the MD5 formula always produces 32 character-long hashes.

Can we get the original value from hash?

The cryptographic hashing of a value cannot be inverted to find the original value. Given a value, it is infeasible to find another value with the same cryptographic hash.

What is a hash in programming?

Hashing is the process of transforming any given key or a string of characters into another value. This is usually represented by a shorter, fixed-length value or key that represents and makes it easier to find or employ the original string. The most popular use for hashing is the implementation of hash tables.


1 Answers

First of all, I think the way to approach "rogue administrators" is with a combination of Oracle's audit trail and Database Vault features.

That said, here's what I might try:

1) Create a custom ODCI aggregate function to compute a hash of multiple rows as an aggregate. 2) Create a VIRTUAL NOT NULL column on the table that was an SHA hash of all the columns in the table -- or all the one's you care about protecting. You'd keep this around all the time -- basically trading away some insert/update/delete performance in exchange to be able to compute hashes more quickly. 3) Create a non-unique index on that virtual column 4) SELECT my_aggregate_hash_function(virtual_hash_column) FROM my_table to get the results.

Here's code:

Create an aggregate function to compute a SHA hash over a bunch of rows

CREATE OR REPLACE TYPE matt_hash_aggregate_impl AS OBJECT
(
  hash_value RAW(32000),
  CONSTRUCTOR FUNCTION matt_hash_aggregate_impl(SELF IN OUT NOCOPY matt_hash_aggregate_impl ) RETURN SELF AS RESULT,  
-- Called to initialize a new aggregation context
-- For analytic functions, the aggregation context of the *previous* window is passed in, so we only need to adjust as needed instead 
-- of creating the new aggregation context from scratch
  STATIC FUNCTION ODCIAggregateInitialize (sctx IN OUT matt_hash_aggregate_impl) RETURN NUMBER,
-- Called when a new data point is added to an aggregation context  
  MEMBER FUNCTION ODCIAggregateIterate (self IN OUT matt_hash_aggregate_impl, value IN raw ) RETURN NUMBER,
-- Called to return the computed aggragate from an aggregation context
  MEMBER FUNCTION ODCIAggregateTerminate (self IN matt_hash_aggregate_impl, returnValue OUT raw, flags IN NUMBER) RETURN NUMBER,
-- Called to merge to two aggregation contexts into one (e.g., merging results of parallel slaves) 
  MEMBER FUNCTION ODCIAggregateMerge (self IN OUT matt_hash_aggregate_impl, ctx2 IN matt_hash_aggregate_impl) RETURN NUMBER,
  -- ODCIAggregateDelete
  MEMBER FUNCTION ODCIAggregateDelete(self IN OUT matt_hash_aggregate_impl, value raw) RETURN NUMBER  
);

/

CREATE OR REPLACE TYPE BODY matt_hash_aggregate_impl IS

CONSTRUCTOR FUNCTION matt_hash_aggregate_impl(SELF IN OUT NOCOPY matt_hash_aggregate_impl ) RETURN SELF AS RESULT IS
BEGIN
  SELF.hash_value := null;
  RETURN;
END;


STATIC FUNCTION ODCIAggregateInitialize (sctx IN OUT matt_hash_aggregate_impl) RETURN NUMBER IS
BEGIN
  sctx := matt_hash_aggregate_impl ();
  RETURN ODCIConst.Success;
END;


MEMBER FUNCTION ODCIAggregateIterate (self IN OUT matt_hash_aggregate_impl, value IN raw ) RETURN NUMBER IS
BEGIN
  IF self.hash_value IS NULL THEN
    self.hash_value := dbms_crypto.hash(value, dbms_crypto.hash_sh1);
  ELSE 
      self.hash_value := dbms_crypto.hash(self.hash_value || value, dbms_crypto.hash_sh1);
  END IF;
  RETURN ODCIConst.Success;
END;

MEMBER FUNCTION ODCIAggregateTerminate (self IN matt_hash_aggregate_impl, returnValue OUT raw, flags IN NUMBER) RETURN NUMBER IS
BEGIN
  returnValue := dbms_crypto.hash(self.hash_value,dbms_crypto.hash_sh1);
  RETURN ODCIConst.Success;
END;

MEMBER FUNCTION ODCIAggregateMerge (self IN OUT matt_hash_aggregate_impl, ctx2 IN matt_hash_aggregate_impl) RETURN NUMBER IS
BEGIN
    self.hash_value := dbms_crypto.hash(self.hash_value || ctx2.hash_value, dbms_crypto.hash_sh1);
  RETURN ODCIConst.Success;
END;

-- ODCIAggregateDelete
MEMBER FUNCTION ODCIAggregateDelete(self IN OUT matt_hash_aggregate_impl, value raw) RETURN NUMBER IS
BEGIN
  raise_application_error(-20001, 'Invalid operation -- hash aggregate function does not support windowing!');
END;  

END;
/

CREATE OR REPLACE FUNCTION matt_hash_aggregate ( input raw) RETURN raw
PARALLEL_ENABLE AGGREGATE USING matt_hash_aggregate_impl;
/

Create a test table to work with (you skip this since you have your real table)

create table mattmsi as select * from mtl_system_items where rownum <= 200000;

Create a virtual column hash of each row's data. Make sure it is NOT NULL

alter table mattmsi add compliance_hash generated always as ( dbms_crypto.hash(to_clob(inventory_item_id || segment1 || last_update_date || created_by || description), 3 /*dbms_crypto.hash_sh1*/) ) VIRTUAL not null ;

Create an index on the virtual column; this way you can compute your hash with an full scan of the narrow index instead of a full scan of the fat table

create index msi_compliance_hash_n1 on mattmsi (compliance_hash);  

Put it all together to compute your hash

SELECT matt_hash_aggregate(compliance_hash) from (select compliance_hash from mattmsi order by compliance_hash);

A few comments:

  1. I think it is important to use a hash to compute the aggregate (rather than merely doing a SUM() over the row-level hashes, because an attacker could forge the correct sum very easily.
  2. I don't think you can (easily?) use parallel query because it is important that the rows be fed to the aggregate function in a consistent order, or else the hash value will change.
like image 51
Matthew McPeak Avatar answered Sep 24 '22 21:09

Matthew McPeak