I use the following query to create my table.
create table t1 (url varchar(250) unique);
Then I insert about 500 urls, twice. I am expecting that the second time I had the URLs that no new entries show up in my table, but instead my count value doubles for:
select count(*) from t1;
What I want is that when I try and add a url that is already in my table, it is skipped. Have I declared something in my table deceleration incorrect?
I am using RedShift from AWS.
Sample
urlenrich=# insert into seed(url, source) select 'http://www.google.com', '1';
INSERT 0 1
urlenrich=# select * from seed;
url | wascrawled | source | date_crawled
-----------------------+------------+--------+--------------
http://www.google.com | 0 | 1 |
(1 row)
urlenrich=# insert into seed(url, source) select 'http://www.google.com', '1';
INSERT 0 1
urlenrich=# select * from seed;
url | wascrawled | source | date_crawled
-----------------------+------------+--------+--------------
http://www.google.com | 0 | 1 |
http://www.google.com | 0 | 1 |
(2 rows)
Output of \d seed
urlenrich=# \d seed
Table "public.seed"
Column | Type | Modifiers
--------------+-----------------------------+-----------
url | character varying(250) |
wascrawled | integer | default 0
source | integer | not null
date_crawled | timestamp without time zone |
Indexes:
"seed_url_key" UNIQUE, btree (url)
Figured out the problem
Amazon RedShift does not enforce constraints...
As explained here http://docs.aws.amazon.com/redshift/latest/dg/t_Defining_constraints.html
They said they may get around to changing it at some point.
NEW 11/21/2013 RDS has added support for PostGres, if you need unique and such an postgres rds instance is now the best way to go.
In redshift, constraints are recommended but doesn't take effect, constraints will just help to the query planner to select better ways to perform the query.
Usually, columnar databases do not manage indexes or constraints.
Although Amazon Redshift doesn't support unique constraints, there are some ways to delete duplicated records that can be helpful. See the following link for the details.
copy data from Amazon s3 to Red Shift and avoid duplicate rows
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With