Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find singular in the plural when some letters change? What is the best approach?

How can I find the singular in the plural when some letters change?

Following situation:

  • The German word Schließfach is a lockbox.
  • The plural is Schließfächer.

As you see, the letter a has changed in ä. For this reason, the first word is not a substring of the second one anymore, they are "regex-technically" different.

Maybe I'm not in the right corner with my chosen tags below. Maybe Regex is not the right tool for me. I've seen naturaljs (natural.NounIflector()) provides this functionality out of the box for English words. Maybe there are also solutions for the German language in the same way?

What is the best approach, how can I find singular in the plural in German?

like image 961
Lonely Avatar asked Nov 12 '20 14:11

Lonely


People also ask

When should you add es to make a singular noun plural?

If a word ends in –s, –sh, –ch, –x, or –z, you add –es. For almost all other nouns, add –s to pluralize.


2 Answers

I once had to build a text processor that parsed many languages, including very casual to very formal. One of the things to identify was if certain words were related (like a noun in the title which was related to a list of things - sometimes labeled with a plural form.)

IIRC, 70-90% of singular & plural word forms across all languages we supported had a "Levenshtein distance" of less than 3 or 4. (Eventually several dictionaries were added to improve accuracy because "distance" alone produced many false positives.) Another interesting find was that the longer the words, the more likely a distance of 3 or fewer meant a relationship in meaning.

Here's an example of the libraries we used:

const fastLevenshtein = require('fast-levenshtein');

console.log('Deburred Distances:')
console.log('Score 1:', fastLevenshtein.get('Schließfächer', 'Schließfach'));
// -> 3
console.log('Score 2:', fastLevenshtein.get('Blumtach', 'Blumtächer'));
// -> 3
console.log('Score 3:', fastLevenshtein.get('schließfächer', 'Schliessfaech'));
// -> 7
console.log('Score 4:', fastLevenshtein.get('not-it', 'Schliessfaech'));
// -> 12
console.log('Score 5:', fastLevenshtein.get('not-it', 'Schiesse'));
// -> 8


/**
 * Additional strategy for dealing with other various languages:
 *   "Deburr" the strings to omit diacritics before checking the distance:
 */

const deburr = require('lodash.deburr');
console.log('Deburred Distances:')
console.log('Score 1:', deburr(fastLevenshtein.get('Schließfächer', 'Schließfach')));
// -> 3
console.log('Score 2:', deburr(fastLevenshtein.get('Blumtach', 'Blumtächer')));
// -> 3
console.log('Score 3:', deburr(fastLevenshtein.get('schließfächer', 'Schliessfaech')));
// -> 7


// Same in this case, but helpful in other similar use cases.
like image 157
Dan Levy Avatar answered Nov 15 '22 08:11

Dan Levy


You can use a stemmer (which is in fact a lemmatizer) from the nlp.js library, which has models for 40 languages.

const { StemmerDe } = require('@nlpjs/lang-de');

const stemmer = new StemmerDe();
console.log(stemmer.stemWord('Schließfach'));
console.log(stemmer.stemWord('Schließfächer'));
like image 35
Jindřich Avatar answered Nov 15 '22 09:11

Jindřich