Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

mysql accent insensitive and dotted insensitive search

Tags:

php

search

mysql

The Problem: I am trying to implement a search algorithm that shows the results even when dotted chars are provided. In other words: SELECT 'über' = 'uber' or SELECT 'mas' = 'maş' these results will return true. This would apply for every single char in the following array:

$arr = array('ş' => 's', 'ç' => 'c', 'ö' => 'o', 'ü' => 'u' and so on ...);

The Solution In My Mind: Along with the original column, I can have a particular column that stores the English names. So before storing 'über' to database, I will also convert it to 'uber' in php and then will store both 'über' (as the original) and 'uber' (as the searchable) to the database.

But then, even though I've searched for this the whole day, I still believe that there should be a simplier and cleaner way to accomplish the task since this would mean (more or less) to store the same data twice in the database. So guys, what do you think is the solution the only way to go or you know a better approach?

EDIT

For accent insensitive I've seen the posts on SO, they are working but since I am also considering the dotted chars, I had to ask this question.

EDIT2

I cannot post the whole table structure and code exactly for some reasons but I'll provide a close example.

myusers | CREATE TABLE `myusers` (
id int auto_increment not null primary key,
email varchar(100) COLLATE latin1_general_ci not null,
fullname varchar(75) COLLATE latin1_general_ci not null)
PRIMARY KEY('id')
) ENGINE=MyISAM AUTO_INCREMENET=2 DEFAULT CHARSET=latin1 COLLATE latin1_general_ci |

The above is the structure of the table. Here comes the inserts and selects:

INSERT INTO myusers (fullname) VALUES ('Agüeda');
INSERT INTO myusers (fullname) VALUES ('Agueda');

SELECT * FROM myusers WHERE fullname = 'Agüeda' COLLATE latin1_general_ci 

+----+-------+----------+
| id | email | fullname |
+----+-------+----------+
|  1 |       | Agüeda   |
+----+-------+----------+
1 row in set (0.00 sec)

SELECT * FROM myusers WHERE fullname = 'agueda' COLLATE latin1_general_ci 

+----+-------+----------+
| id | email | fullname |
+----+-------+----------+
|  2 |       | Agueda   |
+----+-------+----------+
1 row in set (0.00 sec)

Well, the desired result is obviously when agueda is searched both 'Agueda' and 'Agüeda' will return, but that's not the case. As I mentioned above, I have created a new column and store the whole name in English characters and make the search from there as well. But still, it costs me a two times search (because I am also searching from the original columns which rank higher in the search result). There should be a better way...

like image 870
Shaokan Avatar asked Oct 09 '11 21:10

Shaokan


2 Answers

1) Write your own collation. latin1_general_diacriticinsensitive. I wouldn't even know where to begin, though :).

2) Use regex and character groups: /[uü]ber/

3) The Solution In Your Mind. I'd personally use this, since design is all about compromise and this is a simple solution with just a 100% space overhead. Granted, the space overhead might eventually turn into a speed overhead, especially with MySQL, but that's to worry about later. This is also very easy to undo if need be.

like image 77
aib Avatar answered Oct 17 '22 17:10

aib


Just use an appropriate collation. For instance:

create table test(
    foo text
) collate = utf8_unicode_ci;
insert into test values('Agüeda');
insert into test values('Agueda');
select * from test where foo = 'Agueda';

This gives your two rows.

like image 20
Artefacto Avatar answered Oct 17 '22 18:10

Artefacto