Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Language detection with data in PostgreSQL

I have a table in PostgreSQL where a column is a text. I need a library or tool that can identify the language of each text for a test purpose.

There is no need for a PostgreSQL code because I'm having problems to install languages, but any language that can connect to the database, retrieve the texts and identify it arewelcome.

I used Lingua::Identify suggested in the answers right in the Perl script, it worked, but the results are not precise.

The texts I want to identify comes from the web and most are in portuguese, but Lingua::Identify is classifying much as french, italian and spanish that are similar languages.

I need something more precise.

I added the java and r tags because are the languages I'm using in the system and solution using they will be easy to implement, but solutions in any language are welcome.

like image 935
Renato Dinhani Avatar asked Jan 21 '12 20:01

Renato Dinhani


People also ask

Which programming language is used by PostgreSQL?

The supported programming languages for PostgreSQL include . Net, C, C++, Delphi, Java, JavaScript (Node. js), Perl, PHP, Python and Tcl, but PostgreSQL can support many server-side procedural languages through its available extensions.

How do I find special characters in PostgreSQL?

SELECT * FROM spatial_ref_sys WHERE srtext LIKE '%\ /%'; Sometimes these ticks are very useful for searching special characters in a database.

Is PostgreSQL a language or database?

PostgreSQL is an advanced, enterprise class open source relational database that supports both SQL (relational) and JSON (non-relational) querying.

How do I escape a special character in PostgreSQL?

PostgreSQL also accepts “escape” string constants, which are an extension to the SQL standard. An escape string constant is specified by writing the letter E (upper or lower case) just before the opening single quote, e.g., E'foo' .


1 Answers

You can use PL/Perl (CREATE FUNCTION langof(text) LANGUAGEplperluAS ...) with Lingua::Identify CPAN module.

Perl script:

#!/usr/bin/perl
use Lingua::Identify qw(langof);
undef $/;
my $textstring = <>;  ## warning - slurps whole file to memory
my $a = langof( $textstring );    # gives the most probable language
print "$a\n";

And the function:

create or replace function langof( text ) returns varchar(2)
immutable returns null on null input
language plperlu as $perlcode$
    use Lingua::Identify qw(langof);
    return langof( shift );
$perlcode$;

Works for me:

filip@filip=# select langof('Pójdź, kiń-że tę chmurność w głąb flaszy');
 langof
--------
 pl
(1 row)

Time: 1.801 ms

PL/Perl on Windows

PL/Perl language libary (plperl.dll) comes preinstalled in latest Windows installer of postgres.

But to use PL/Perl, you need Perl interpreter itself. Specifically, Perl 5.14 (at the time of this writing). Most common installer is ActiveState, but it's not free. Free one comes from StrawberryPerl. Make sure you have PERL514.DLL in place.

After installing Perl, login to your postgres database and try to run

CREATE LANGUAGE plperlu;

Language identification library

If quality is your concern, you have some options: You can improve Lingua::Identify yourself (it's open source) or you could try another library. I found this one, which is commercial but looks promising.

like image 192
filiprem Avatar answered Sep 17 '22 17:09

filiprem