Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Thesaurus class or API for PHP [edited]

TL;DR Summary: I need a single command-line application which I can use to get synonyms and other related words. It needs to be multi-lingual and works cross platform. Can anyone suggest a suitable program for me, or help me with the ones I've already found? Thanks.


Longer version: I've been tasked with writing a system in PHP that can come up with alternative suggestions for words entered by the user. I need to find a thesaurus application / API or similar which I can use to generate these suggestions.

Importantly, it needs to be multilingual (English, Danish, French and German). This rules out most of the software that I managed to find using Google. It also needs to be cross-platform (it needs to work on Linux and Windows).

My research has let me to two promising candidates: WordNet and Stardict.

I've been focusing on WordNet so far, calling it from PHP using the shell_exec() function, and I've managed to use it to create a very promising prototype PHP page, but so far in English only. I'm struggling with how to use it multi-lingual.

The Wordnet site has external links to Wordnet projects in other language (eg DanNet for Danish), but although they're often called Wordnet, they seem to use a variety of database formats and software, which makes them unsuitable for me. I need a consistent interface that I can call from my PHP program.

Stardict looked more promising from that perspective: they provide dictionaries in many languages in a standard DB format for the one application.

But the down-side of Stardict is that its primarily a GUI app. Calling it from the command-line launches the GUI. There is apparently a command-line version (SDCV), but it seems quite out of date (last update 2006), and only for Linux.

Can anyone help me with my problems with either of these programs? Or else, can anyone suggest any other alternative software or API that I could use?

Many thanks.

like image 917
Spudley Avatar asked Apr 28 '11 11:04

Spudley


2 Answers

You could try to leverage PostgreSQL's full text search functionality:

http://www.postgresql.org/docs/9.0/static/textsearch.html

You can configure it with any of the available languages and all sorts of collations to fit your needs. PostgreSQL 9.1 adds some extra collation functionality that you may want to look into if the approach seems reasonable.

The basic steps would be (for each language):

  1. Create the needed table (collated appropriately). For our sake, a single column is enough, e.g.:

    create table dict_en (
      word text check (word = lower(word)) primary key
    );
    
  2. Fetch the needed dictionary/thesaurus files (those from aspell/Open-Office should work).

  3. Configure text search (see link above, namely section 12.6) using the relevant files.

  4. Insert the whole dictionary into the table. (Surely there's a csv file somewhere...)

  5. And finally index the vector, e.g.:

    create index on dict_en using gin (to_tsvector('english', word));
    

You can now run queries that use this index:

-- Find words related to `:word`
select word
from dict_en
where to_tsvector('english', word) @@ plainto_tsquery('english', :word)
and word <> :word;

You might need to create a separate database or schema for each language, and add an additional field (tsvector) if Postgres refuses to index the expression because of the language parameter. (I read the full text docs a long time ago). The details on this would be in section 12.2, and I'm sure you'll know how to adjust the above if this is the case.

Whichever the implementation details, though, I believe the approach should work.

like image 146
Denis de Bernardy Avatar answered Oct 23 '22 05:10

Denis de Bernardy


There is a PHP example for a thesaurus API usage here...

http://thesaurus.altervista.org/testphp

Available for Italian, English, French, Deutsch, Spanish and Portuguese.

like image 29
Fenton Avatar answered Oct 23 '22 07:10

Fenton