Fastest way to compute for hash of a whole table [duplicate]

Tags:

We need to be able to compute table hashes for an external environment and compare it to pre-computed hash from an internal environment. The use of this is to ensure that data in the external environment is not tampered by a "rogue" database administrator. Users insist this feature.

Currently, we do this by computing the individual hashes of each column value, perform bit-xor on the column hashes to get the row hash, then perform bit-xor on all the row hashes to come up with the table hash. Pseudo-script below:

cursor hash_cur is
select /*+ PARALLEL(4)*/ dbms_crypto.mac(column1_in_raw_type, HMAC_SH512, string_to_raw('COLUMN1_NAME')) as COLUMN1_NAME
       ...
from TABLE_NAME;

open hash_cur;
fetch hash_cur bulk collect into hashes;
close hash_cur;

for i in 1..hashes.count
loop
  rec := hashes(i);
  record_xor = rec.COLUMN1;
  record_xor = bit_xor(record_xor, rec.COLUMN2);
  ...
  record_xor = bit_xor(record_xor, rec.COLUMNN);

  table_xor = bit_xor(table_xor, record_xor);
end loop;

The pseudo-script above will be run in parallel by using dbms_job.

Problem with this is that we have terabytes of data for certain tables and currently the performance does not meet the performance we want to achieve. Hashing must be done "on-the-fly" as users want to perform hash checking themselves.

Do you guys have a better way to perform whole table hashing, or basically comparing tables from different environments which are connected by a low-latency and relatively low-bandwidth network?

It seems to me that the operation is more CPU-bound than I/O bound. I am thinking of storing the table data in a blob instead, where data is properly arranged by record, then by column. Then perform hash on the output file. This should make the operation completely I/O bound.

What is the fastest way to do this? Anyway to do this within the select clause of a query to remove any overhead PL/SQL-to-SQL engine context switch?
- I was thinking of modifiying a global blob for this
- Would also like to remove I/O overhead of bulk collecting the results.

Any suggestions that can lead me to a better performing script would be greatly appreciated. Thanks.

919

asked Nov 20 '15 10:11

user3367701

1 Answers

First of all, I think the way to approach "rogue administrators" is with a combination of Oracle's audit trail and Database Vault features.

That said, here's what I might try:

1) Create a custom ODCI aggregate function to compute a hash of multiple rows as an aggregate. 2) Create a VIRTUAL NOT NULL column on the table that was an SHA hash of all the columns in the table -- or all the one's you care about protecting. You'd keep this around all the time -- basically trading away some insert/update/delete performance in exchange to be able to compute hashes more quickly. 3) Create a non-unique index on that virtual column 4) SELECT my_aggregate_hash_function(virtual_hash_column) FROM my_table to get the results.

Here's code:

Create an aggregate function to compute a SHA hash over a bunch of rows

CREATE OR REPLACE TYPE matt_hash_aggregate_impl AS OBJECT
(
  hash_value RAW(32000),
  CONSTRUCTOR FUNCTION matt_hash_aggregate_impl(SELF IN OUT NOCOPY matt_hash_aggregate_impl ) RETURN SELF AS RESULT,  
-- Called to initialize a new aggregation context
-- For analytic functions, the aggregation context of the *previous* window is passed in, so we only need to adjust as needed instead 
-- of creating the new aggregation context from scratch
  STATIC FUNCTION ODCIAggregateInitialize (sctx IN OUT matt_hash_aggregate_impl) RETURN NUMBER,
-- Called when a new data point is added to an aggregation context  
  MEMBER FUNCTION ODCIAggregateIterate (self IN OUT matt_hash_aggregate_impl, value IN raw ) RETURN NUMBER,
-- Called to return the computed aggragate from an aggregation context
  MEMBER FUNCTION ODCIAggregateTerminate (self IN matt_hash_aggregate_impl, returnValue OUT raw, flags IN NUMBER) RETURN NUMBER,
-- Called to merge to two aggregation contexts into one (e.g., merging results of parallel slaves) 
  MEMBER FUNCTION ODCIAggregateMerge (self IN OUT matt_hash_aggregate_impl, ctx2 IN matt_hash_aggregate_impl) RETURN NUMBER,
  -- ODCIAggregateDelete
  MEMBER FUNCTION ODCIAggregateDelete(self IN OUT matt_hash_aggregate_impl, value raw) RETURN NUMBER  
);

/

CREATE OR REPLACE TYPE BODY matt_hash_aggregate_impl IS

CONSTRUCTOR FUNCTION matt_hash_aggregate_impl(SELF IN OUT NOCOPY matt_hash_aggregate_impl ) RETURN SELF AS RESULT IS
BEGIN
  SELF.hash_value := null;
  RETURN;
END;


STATIC FUNCTION ODCIAggregateInitialize (sctx IN OUT matt_hash_aggregate_impl) RETURN NUMBER IS
BEGIN
  sctx := matt_hash_aggregate_impl ();
  RETURN ODCIConst.Success;
END;


MEMBER FUNCTION ODCIAggregateIterate (self IN OUT matt_hash_aggregate_impl, value IN raw ) RETURN NUMBER IS
BEGIN
  IF self.hash_value IS NULL THEN
    self.hash_value := dbms_crypto.hash(value, dbms_crypto.hash_sh1);
  ELSE 
      self.hash_value := dbms_crypto.hash(self.hash_value || value, dbms_crypto.hash_sh1);
  END IF;
  RETURN ODCIConst.Success;
END;

MEMBER FUNCTION ODCIAggregateTerminate (self IN matt_hash_aggregate_impl, returnValue OUT raw, flags IN NUMBER) RETURN NUMBER IS
BEGIN
  returnValue := dbms_crypto.hash(self.hash_value,dbms_crypto.hash_sh1);
  RETURN ODCIConst.Success;
END;

MEMBER FUNCTION ODCIAggregateMerge (self IN OUT matt_hash_aggregate_impl, ctx2 IN matt_hash_aggregate_impl) RETURN NUMBER IS
BEGIN
    self.hash_value := dbms_crypto.hash(self.hash_value || ctx2.hash_value, dbms_crypto.hash_sh1);
  RETURN ODCIConst.Success;
END;

-- ODCIAggregateDelete
MEMBER FUNCTION ODCIAggregateDelete(self IN OUT matt_hash_aggregate_impl, value raw) RETURN NUMBER IS
BEGIN
  raise_application_error(-20001, 'Invalid operation -- hash aggregate function does not support windowing!');
END;  

END;
/

CREATE OR REPLACE FUNCTION matt_hash_aggregate ( input raw) RETURN raw
PARALLEL_ENABLE AGGREGATE USING matt_hash_aggregate_impl;
/

Create a test table to work with (you skip this since you have your real table)

create table mattmsi as select * from mtl_system_items where rownum <= 200000;

Create a virtual column hash of each row's data. Make sure it is `NOT NULL`

alter table mattmsi add compliance_hash generated always as ( dbms_crypto.hash(to_clob(inventory_item_id || segment1 || last_update_date || created_by || description), 3 /*dbms_crypto.hash_sh1*/) ) VIRTUAL not null ;

Create an index on the virtual column; this way you can compute your hash with an full scan of the narrow index instead of a full scan of the fat table

create index msi_compliance_hash_n1 on mattmsi (compliance_hash);

Put it all together to compute your hash

SELECT matt_hash_aggregate(compliance_hash) from (select compliance_hash from mattmsi order by compliance_hash);

A few comments:

I think it is important to use a hash to compute the aggregate (rather than merely doing a SUM() over the row-level hashes, because an attacker could forge the correct sum very easily.
I don't think you can (easily?) use parallel query because it is important that the rows be fed to the aggregate function in a consistent order, or else the hash value will change.

answered Sep 24 '22 21:09

Matthew McPeak

Related questions
                            
                                Correct format of sql query in java
                            
                                Is there any equivalent of sp_help in postgres
                            
                                consecutive days in sql
                            
                                Calculating the Weighted Average Cost of products stock
                            
                                django.db.utils.OperationalError: my_table has no column id error?
                            
                                Passing a subquery directly to a table type parameter in USP in SQL Server
                            
                                Similar expression to "Coalesce" in Entity Framework
                            
                                Can an INSERT-SELECT query be subject to race conditions?
                            
                                Rails includes query with conditions not returning all results from left table
                            
                                MySQL LIMIT 0,15 where 15 is the number of parent_ids, not children
                            
                                NHibernate named query and multiple result sets
                            
                                Precedence of numeric types in T-SQL
                            
                                How to know insert date for a row in mysql database [duplicate]
                            
                                LINQ inserts 'ESCAPE N'~' in query
                            
                                How to fetch the filename from CSV and insert it into one of columns of a table
                            
                                PHP: What is the best way to create higher level query language to create criteria filter in yii
                            
                                Mysql intersection of two sets having comma separated value
                            
                                Displaying and linking Foreign Key content in PHP
                            
                                How to send a connection string as a parameter to MsBuild to perform SQL Schema Compare?
                            
                                How To Get Entire Linked Group Details using SQL?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fastest way to compute for hash of a whole table [duplicate]

Tags:

sql

oracle

plsql

database-performance

oracle12c

user3367701

People also ask

1 Answers

Create an aggregate function to compute a SHA hash over a bunch of rows

Create a test table to work with (you skip this since you have your real table)

Create a virtual column hash of each row's data. Make sure it is `NOT NULL`

Create an index on the virtual column; this way you can compute your hash with an full scan of the narrow index instead of a full scan of the fat table

Put it all together to compute your hash

Matthew McPeak

Recent Activity

Donate For Us

Fastest way to compute for hash of a whole table [duplicate]

Tags:

sql

oracle

plsql

database-performance

oracle12c

user3367701

People also ask

1 Answers

Create an aggregate function to compute a SHA hash over a bunch of rows

Create a test table to work with (you skip this since you have your real table)

Create a virtual column hash of each row's data. Make sure it is NOT NULL

Create an index on the virtual column; this way you can compute your hash with an full scan of the narrow index instead of a full scan of the fat table

Put it all together to compute your hash

Matthew McPeak

Related questions

Recent Activity

Donate For Us

Create a virtual column hash of each row's data. Make sure it is `NOT NULL`