In a table with > 100k rows, how can I efficiently shuffle the values of a specific column?
Table definition:
CREATE TABLE person
(
  id integer NOT NULL,
  first_name character varying,
  last_name character varying,
 CONSTRAINT person_pkey PRIMARY KEY (id)
)
In order to anonymize data, I have to shuffle the values of the 'first_name' column in place (I'm not allowed to create a new table).
My try:
with
first_names as (
select row_number() over (order by random()),
       first_name as new_first_name
from person
),
ids as (
select row_number() over (order by random()), 
       id as ref_id
from person
)
update person
set first_name = new_first_name
from first_names, ids
where id = ref_id;
It takes hours to complete.
Is there an efficient way to do it?
This one takes 5 seconds to shuffle 500.000 rows on my laptop:
with names as (
  select id, first_name, last_name,
         lead(first_name) over w as first_1,
         lag(first_name) over w as first_2
  from person
  window w as (order by random())
)
update person
  set first_name = coalesce(first_1, first_2)
from names 
where person.id = names.id;
The idea is to pick the "next" name after sorting the data randomly. Which is just as good as picking a random name.
There is a chance that not all names are shuffled, but if you run it two or three times, this should be good enough.
Here is a test setup on SQLFiddle: http://sqlfiddle.com/#!15/15713/1
The query on the right hand side checks if any first name stayed the same after the "randomizing"
The problem with postgres is every update mean delete + insert
SELECT instead UPDATE to see what is the performance of CTE.
CREATE TABLE new_table AS 
     SELECT * ....
DROP oldtable;
Rename new_table to old_table
CREATE index and constrains
Sorry that isnt an option for you :(
EDIT: After reading  a_horse_with_no_name
looks like you need
with
first_names as (
    select row_number() over (order by random()) rn,
           first_name as new_first_name
    from person
),
ids as (
    select row_number() over (order by random()) rn, 
           id as ref_id
    from person
)
update person
set first_name = new_first_name
from first_names
join ids
  on first_names.rn = ids.rn
where id = ref_id;
Again for performance question is better if you provide the ANALYZE / EXPLAIN result.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With