Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PostgreSQL shuffle column values

In a table with > 100k rows, how can I efficiently shuffle the values of a specific column?

Table definition:

CREATE TABLE person
(
  id integer NOT NULL,
  first_name character varying,
  last_name character varying,
 CONSTRAINT person_pkey PRIMARY KEY (id)
)

In order to anonymize data, I have to shuffle the values of the 'first_name' column in place (I'm not allowed to create a new table).

My try:

with
first_names as (
select row_number() over (order by random()),
       first_name as new_first_name
from person
),
ids as (
select row_number() over (order by random()), 
       id as ref_id
from person
)
update person
set first_name = new_first_name
from first_names, ids
where id = ref_id;

It takes hours to complete.

Is there an efficient way to do it?

like image 408
Serge Avatar asked Nov 05 '15 21:11

Serge


2 Answers

This one takes 5 seconds to shuffle 500.000 rows on my laptop:

with names as (
  select id, first_name, last_name,
         lead(first_name) over w as first_1,
         lag(first_name) over w as first_2
  from person
  window w as (order by random())
)
update person
  set first_name = coalesce(first_1, first_2)
from names 
where person.id = names.id;

The idea is to pick the "next" name after sorting the data randomly. Which is just as good as picking a random name.

There is a chance that not all names are shuffled, but if you run it two or three times, this should be good enough.

Here is a test setup on SQLFiddle: http://sqlfiddle.com/#!15/15713/1

The query on the right hand side checks if any first name stayed the same after the "randomizing"

like image 129
a_horse_with_no_name Avatar answered Oct 08 '22 15:10

a_horse_with_no_name


The problem with postgres is every update mean delete + insert

  • You can check the analyze with using a SELECT instead UPDATE to see what is the performance of CTE
  • You can turn off index so update are faster
  • But the best solution I use when need update all the rows is create the table again

.

CREATE TABLE new_table AS 
     SELECT * ....


DROP oldtable;

Rename new_table to old_table

CREATE index and constrains

Sorry that isnt an option for you :(

EDIT: After reading a_horse_with_no_name

looks like you need

with
first_names as (
    select row_number() over (order by random()) rn,
           first_name as new_first_name
    from person
),
ids as (
    select row_number() over (order by random()) rn, 
           id as ref_id
    from person
)
update person
set first_name = new_first_name
from first_names
join ids
  on first_names.rn = ids.rn
where id = ref_id;

Again for performance question is better if you provide the ANALYZE / EXPLAIN result.

like image 6
Juan Carlos Oropeza Avatar answered Oct 08 '22 14:10

Juan Carlos Oropeza