In a table with > 100k rows, how can I efficiently shuffle the values of a specific column?
Table definition:
CREATE TABLE person
(
id integer NOT NULL,
first_name character varying,
last_name character varying,
CONSTRAINT person_pkey PRIMARY KEY (id)
)
In order to anonymize data, I have to shuffle the values of the 'first_name' column in place (I'm not allowed to create a new table).
My try:
with
first_names as (
select row_number() over (order by random()),
first_name as new_first_name
from person
),
ids as (
select row_number() over (order by random()),
id as ref_id
from person
)
update person
set first_name = new_first_name
from first_names, ids
where id = ref_id;
It takes hours to complete.
Is there an efficient way to do it?
This one takes 5 seconds to shuffle 500.000 rows on my laptop:
with names as (
select id, first_name, last_name,
lead(first_name) over w as first_1,
lag(first_name) over w as first_2
from person
window w as (order by random())
)
update person
set first_name = coalesce(first_1, first_2)
from names
where person.id = names.id;
The idea is to pick the "next" name after sorting the data randomly. Which is just as good as picking a random name.
There is a chance that not all names are shuffled, but if you run it two or three times, this should be good enough.
Here is a test setup on SQLFiddle: http://sqlfiddle.com/#!15/15713/1
The query on the right hand side checks if any first name stayed the same after the "randomizing"
The problem with postgres is every update mean delete
+ insert
SELECT
instead UPDATE
to see what is the performance of CTE.
CREATE TABLE new_table AS
SELECT * ....
DROP oldtable;
Rename new_table to old_table
CREATE index and constrains
Sorry that isnt an option for you :(
EDIT: After reading a_horse_with_no_name
looks like you need
with
first_names as (
select row_number() over (order by random()) rn,
first_name as new_first_name
from person
),
ids as (
select row_number() over (order by random()) rn,
id as ref_id
from person
)
update person
set first_name = new_first_name
from first_names
join ids
on first_names.rn = ids.rn
where id = ref_id;
Again for performance question is better if you provide the ANALYZE / EXPLAIN
result.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With