Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do an accent and case-insensitive search in MediaWiki database?

Let's pretend that I have these page titles in my wiki (MediaWiki 1.19.4):

SOMETHIng
Sómethìng
SomêthÏng
SÒmetHínG

If a user searches something I want that all 4 pages are returned as the result.

At the moment the only thing I could think of is this query (MySQL Percona 5.5.30-30.2):

SELECT page_title
FROM page
WHERE page_title LIKE '%something%' COLLATE utf8_general_ci

Which only returns SOMETHIng.

I must be on the right path, because if I search sóméthíng OR SÓMÉTHÍNG, I get SOMETHIng as the result. How could I modify the query so I get the other results as expected? Performance is not critical here since the page table contains only ~2K rows.

This is the table definition with the relevant bits:

CREATE TABLE page (
    (...)
    page_title VARCHAR(255) NOT NULL DEFAULT '' COLLATE latin1_bin,
    (...)
    UNIQUE INDEX name_title (page_namespace, page_title),
)

The table definition must not be modified, since this is a stock installation of MediaWiki and AFAIK its code expects this field being defined that way (i.e. unicode stored as binary data).

like image 468
MM. Avatar asked Apr 15 '13 11:04

MM.


1 Answers

The MediaWiki TitleKey extension is basically designed for this, but it only does case-folding. However, if you don't mind hacking it a bit, and have the PHP iconv extension installed, you could edit TitleKey_body.php and replace the method:

static function normalize( $text ) {
    global $wgContLang;
    return $wgContLang->caseFold( $text );
}

with e.g.:

static function normalize( $text ) {
    return strtoupper( iconv( 'UTF-8', 'US-ASCII//TRANSLIT', $text ) );
}

and (re)run rebuildTitleKeys.php.

The TitleKey extension stores its normalized titles in a separate table, surprisingly named titlekey. It's intended to accessed through the MediaWiki search interface, but if you want, you can certainly query it directly too, e.g. like this:

SELECT page.* FROM page
  JOIN titlekey ON tk_page = page_id
WHERE tk_namespace = 0 AND tk_key = 'SOMETHING';
like image 90
Ilmari Karonen Avatar answered Sep 24 '22 18:09

Ilmari Karonen