How can I find the singular in the plural when some letters change?
Following situation:
Schließfach
is a lockbox.Schließfächer.
As you see, the letter a
has changed in ä
. For this reason, the first word is not a substring of the second one anymore, they are "regex-technically" different.
Maybe I'm not in the right corner with my chosen tags below. Maybe Regex is not the right tool for me. I've seen naturaljs
(natural.NounIflector()
) provides this functionality out of the box for English words. Maybe there are also solutions for the German language in the same way?
What is the best approach, how can I find singular in the plural in German?
If a word ends in –s, –sh, –ch, –x, or –z, you add –es. For almost all other nouns, add –s to pluralize.
I once had to build a text processor that parsed many languages, including very casual to very formal. One of the things to identify was if certain words were related (like a noun in the title which was related to a list of things - sometimes labeled with a plural form.)
IIRC, 70-90% of singular & plural word forms across all languages we supported had a "Levenshtein distance" of less than 3 or 4. (Eventually several dictionaries were added to improve accuracy because "distance" alone produced many false positives.) Another interesting find was that the longer the words, the more likely a distance of 3 or fewer meant a relationship in meaning.
Here's an example of the libraries we used:
const fastLevenshtein = require('fast-levenshtein');
console.log('Deburred Distances:')
console.log('Score 1:', fastLevenshtein.get('Schließfächer', 'Schließfach'));
// -> 3
console.log('Score 2:', fastLevenshtein.get('Blumtach', 'Blumtächer'));
// -> 3
console.log('Score 3:', fastLevenshtein.get('schließfächer', 'Schliessfaech'));
// -> 7
console.log('Score 4:', fastLevenshtein.get('not-it', 'Schliessfaech'));
// -> 12
console.log('Score 5:', fastLevenshtein.get('not-it', 'Schiesse'));
// -> 8
/**
* Additional strategy for dealing with other various languages:
* "Deburr" the strings to omit diacritics before checking the distance:
*/
const deburr = require('lodash.deburr');
console.log('Deburred Distances:')
console.log('Score 1:', deburr(fastLevenshtein.get('Schließfächer', 'Schließfach')));
// -> 3
console.log('Score 2:', deburr(fastLevenshtein.get('Blumtach', 'Blumtächer')));
// -> 3
console.log('Score 3:', deburr(fastLevenshtein.get('schließfächer', 'Schliessfaech')));
// -> 7
// Same in this case, but helpful in other similar use cases.
You can use a stemmer (which is in fact a lemmatizer) from the nlp.js library, which has models for 40 languages.
const { StemmerDe } = require('@nlpjs/lang-de');
const stemmer = new StemmerDe();
console.log(stemmer.stemWord('Schließfach'));
console.log(stemmer.stemWord('Schließfächer'));
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With