Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to check if a value already exists to avoid duplicates?

Tags:

sql

php

mysql

I've got a table of URLs and I don't want any duplicate URLs. How do I check to see if a given URL is already in the table using PHP/MySQL?

like image 949
Gilean Avatar asked Sep 14 '08 01:09

Gilean


People also ask

How do I avoid duplicates in SQL?

The go to solution for removing duplicate rows from your result sets is to include the distinct keyword in your select statement. It tells the query engine to remove duplicates to produce a result set in which every row is unique.

Does full outer join remove duplicates?

With help of it you can eliminate the duplicates on selected columns. @chirag-mehta - Yes, my sql with look like that.


2 Answers

If you don't want to have duplicates you can do following:

  • add uniqueness constraint
  • use "REPLACE" or "INSERT ... ON DUPLICATE KEY UPDATE" syntax

If multiple users can insert data to DB, method suggested by @Jeremy Ruten, can lead to an error: after you performed a check someone can insert similar data to the table.

like image 144
aku Avatar answered Sep 19 '22 19:09

aku


To answer your initial question, the easiest way to check whether there is a duplicate is to run an SQL query against what you're trying to add!

For example, were you to want to check for the url http://www.example.com/ in the table links, then your query would look something like

SELECT * FROM links WHERE url = 'http://www.example.com/';

Your PHP code would look something like

$conn = mysql_connect('localhost', 'username', 'password');
if (!$conn)
{
    die('Could not connect to database');
}
if(!mysql_select_db('mydb', $conn))
{
    die('Could not select database mydb');
}

$result = mysql_query("SELECT * FROM links WHERE url = 'http://www.example.com/'", $conn);

if (!$result)
{
    die('There was a problem executing the query');
}

$number_of_rows = mysql_num_rows($result);

if ($number_of_rows > 0)
{
    die('This URL already exists in the database');
}

I've written this out longhand here, with all the connecting to the database, etc. It's likely that you'll already have a connection to a database, so you should use that rather than starting a new connection (replace $conn in the mysql_query command and remove the stuff to do with mysql_connect and mysql_select_db)

Of course, there are other ways of connecting to the database, like PDO, or using an ORM, or similar, so if you're already using those, this answer may not be relevant (and it's probably a bit beyond the scope to give answers related to this here!)

However, MySQL provides many ways to prevent this from happening in the first place.

Firstly, you can mark a field as "unique".

Lets say I have a table where I want to just store all the URLs that are linked to from my site, and the last time they were visited.

My definition might look something like this:-

CREATE TABLE links
(
    url VARCHAR(255) NOT NULL,
    last_visited TIMESTAMP
)

This would allow me to add the same URL over and over again, unless I wrote some PHP code similar to the above to stop this happening.

However, were my definition to change to

CREATE TABLE links
(
  url VARCHAR(255)  NOT NULL,
  last_visited TIMESTAMP,
  PRIMARY KEY (url)
)

Then this would make mysql throw an error when I tried to insert the same value twice.

An example in PHP would be

$result = mysql_query("INSERT INTO links (url, last_visited) VALUES ('http://www.example.com/', NOW()", $conn);

if (!$result)
{
    die('Could not Insert Row 1');
}

$result2 = mysql_query("INSERT INTO links (url, last_visited) VALUES ('http://www.example.com/', NOW()", $conn);

if (!$result2)
{
    die('Could not Insert Row 2');
}

If you ran this, you'd find that on the first attempt, the script would die with the comment Could not Insert Row 2. However, on subsequent runs, it'd die with Could not Insert Row 1.

This is because MySQL knows that the url is the Primary Key of the table. A Primary key is a unique identifier for that row. Most of the time, it's useful to set the unique identifier for a row to be a number. This is because MySQL is quicker at looking up numbers than it is looking up text. Within MySQL, keys (and espescially Primary Keys) are used to define relationships between two tables. For example, if we had a table for users, we could define it as

CREATE TABLE users (
  username VARCHAR(255)  NOT NULL,
  password VARCHAR(40) NOT NULL,
  PRIMARY KEY (username)
)

However, when we wanted to store information about a post the user had made, we'd have to store the username with that post to identify that the post belonged to that user.

I've already mentioned that MySQL is faster at looking up numbers than strings, so this would mean we'd be spending time looking up strings when we didn't have to.

To solve this, we can add an extra column, user_id, and make that the primary key (so when looking up the user record based on a post, we can find it quicker)

CREATE TABLE users (
  user_id INT(10)  NOT NULL AUTO_INCREMENT,
  username VARCHAR(255)  NOT NULL,
  password VARCHAR(40)  NOT NULL,
  PRIMARY KEY (`user_id`)
)

You'll notice that I've also added something new here - AUTO_INCREMENT. This basically allows us to let that field look after itself. Each time a new row is inserted, it adds 1 to the previous number, and stores that, so we don't have to worry about numbering, and can just let it do this itself.

So, with the above table, we can do something like

INSERT INTO users (username, password) VALUES('Mez', 'd3571ce95af4dc281f142add33384abc5e574671');

and then

INSERT INTO users (username, password) VALUES('User', '988881adc9fc3655077dc2d4d757d480b5ea0e11');

When we select the records from the database, we get the following:-

mysql> SELECT * FROM users;
+---------+----------+------------------------------------------+
| user_id | username | password                                 |
+---------+----------+------------------------------------------+
|       1 | Mez      | d3571ce95af4dc281f142add33384abc5e574671 |
|       2 | User     | 988881adc9fc3655077dc2d4d757d480b5ea0e11 |
+---------+----------+------------------------------------------+
2 rows in set (0.00 sec)

However, here - we have a problem - we can still add another user with the same username! Obviously, this is something we don't want to do!

mysql> SELECT * FROM users;
+---------+----------+------------------------------------------+
| user_id | username | password                                 |
+---------+----------+------------------------------------------+
|       1 | Mez      | d3571ce95af4dc281f142add33384abc5e574671 |
|       2 | User     | 988881adc9fc3655077dc2d4d757d480b5ea0e11 |
|       3 | Mez      | d3571ce95af4dc281f142add33384abc5e574671 |
+---------+----------+------------------------------------------+
3 rows in set (0.00 sec)

Lets change our table definition!

CREATE TABLE users (
  user_id INT(10)  NOT NULL AUTO_INCREMENT,
  username VARCHAR(255)  NOT NULL,
  password VARCHAR(40)  NOT NULL,
  PRIMARY KEY (user_id),
  UNIQUE KEY (username)
)

Lets see what happens when we now try and insert the same user twice.

mysql> INSERT INTO users (username, password) VALUES('Mez', 'd3571ce95af4dc281f142add33384abc5e574671');
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO users (username, password) VALUES('Mez', 'd3571ce95af4dc281f142add33384abc5e574671');
ERROR 1062 (23000): Duplicate entry 'Mez' for key 'username'

Huzzah!! We now get an error when we try and insert the username for the second time. Using something like the above, we can detect this in PHP.

Now, lets go back to our links table, but with a new definition.

CREATE TABLE links
(
    link_id INT(10)  NOT NULL AUTO_INCREMENT,
    url VARCHAR(255)  NOT NULL,
    last_visited TIMESTAMP,
    PRIMARY KEY (link_id),
    UNIQUE KEY (url)
)

and let's insert "http://www.example.com" into the database.

INSERT INTO links (url, last_visited) VALUES ('http://www.example.com/', NOW());

If we try and insert it again....

ERROR 1062 (23000): Duplicate entry 'http://www.example.com/' for key 'url'

But what happens if we want to update the time it was last visited?

Well, we could do something complex with PHP, like so:-

$result = mysql_query("SELECT * FROM links WHERE url = 'http://www.example.com/'", $conn);

if (!$result)
{
    die('There was a problem executing the query');
}

$number_of_rows = mysql_num_rows($result);

if ($number_of_rows > 0)
{
    $result = mysql_query("UPDATE links SET last_visited = NOW() WHERE url = 'http://www.example.com/'", $conn);

    if (!$result)
    {
        die('There was a problem updating the links table');
    }
}

Or, even grab the id of the row in the database and use that to update it.

$result = mysql_query("SELECT * FROM links WHERE url = 'http://www.example.com/'", $conn);

if (!$result)
{
    die('There was a problem executing the query');
}

$number_of_rows = mysql_num_rows($result);

if ($number_of_rows > 0)
{
    $row = mysql_fetch_assoc($result);

    $result = mysql_query('UPDATE links SET last_visited = NOW() WHERE link_id = ' . intval($row['link_id'], $conn);

    if (!$result)
    {
        die('There was a problem updating the links table');
    }
}

But, MySQL has a nice built in feature called REPLACE INTO

Let's see how it works.

mysql> SELECT * FROM links;
+---------+-------------------------+---------------------+
| link_id | url                     | last_visited        |
+---------+-------------------------+---------------------+
|       1 | http://www.example.com/ | 2011-08-19 23:48:03 |
+---------+-------------------------+---------------------+
1 row in set (0.00 sec)

mysql> INSERT INTO links (url, last_visited) VALUES ('http://www.example.com/', NOW());
ERROR 1062 (23000): Duplicate entry 'http://www.example.com/' for key 'url'
mysql> REPLACE INTO links (url, last_visited) VALUES ('http://www.example.com/', NOW());
Query OK, 2 rows affected (0.00 sec)

mysql> SELECT * FROM links;
+---------+-------------------------+---------------------+
| link_id | url                     | last_visited        |
+---------+-------------------------+---------------------+
|       2 | http://www.example.com/ | 2011-08-19 23:55:55 |
+---------+-------------------------+---------------------+
1 row in set (0.00 sec)

Notice that when using REPLACE INTO, it's updated the last_visited time, and not thrown an error!

This is because MySQL detects that you're attempting to replace a row. It knows the row that you want, as you've set url to be unique. MySQL figures out the row to replace by using the bit that you passed in that should be unique (in this case, the url) and updating for that row the other values. It's also updated the link_id - which is a bit unexpected! (In fact, I didn't realise this would happen until I just saw it happen!)

But what if you wanted to add a new URL? Well, REPLACE INTO will happily insert a new row if it can't find a matching unique row!

mysql> REPLACE INTO links (url, last_visited) VALUES ('http://www.stackoverflow.com/', NOW());
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM links;
+---------+-------------------------------+---------------------+
| link_id | url                           | last_visited        |
+---------+-------------------------------+---------------------+
|       2 | http://www.example.com/       | 2011-08-20 00:00:07 |
|       3 | http://www.stackoverflow.com/ | 2011-08-20 00:01:22 |
+---------+-------------------------------+---------------------+
2 rows in set (0.00 sec)

I hope this answers your question, and gives you a bit more information about how MySQL works!

like image 29
Mez Avatar answered Sep 20 '22 19:09

Mez