Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is stanford corenlp gender identification nondeterministic?

I have the following results and as you can see the name edward has different results (null and male). This has happened with several names.

edward, Gender: null
james, Gender: MALE
karla, Gender: null
edward, Gender: MALE

Additionally, how can I customize the gender dictionaries? I want to add Spanish and Chinese names.

like image 377
user3390236 Avatar asked Dec 14 '25 18:12

user3390236


1 Answers

You have raised a lot of issues!

1.) Karla is not in the default gender mappings file, so that is why that's getting null

2.) If you want to make your own custom file, it should be in this format:

JOHN\tMALE

There should be one NAME\tGENDER entry per line

The GenderAnnotator can only take 1 file for the mappings, so you need to make a new file with the names you want added on.

The default gender mappings file is in the stanford-corenlp-3.5.2-models.jar file.

You can extract the default gender mappings file from that jar in this manner:

  • mkdir tmp-stanford-models-expanded

  • cp /path/of/stanford-corenlp-3.5.2-models.jar tmp-stanford-models-expanded

  • cd tmp-stanford-models-expanded

  • jar xf stanford-corenlp-3.5.2-models.jar

  • there should now be tmp-stanford-models-expanded/edu

  • the file you want is tmp-stanford-models-expanded/edu/stanford/nlp/models/gender/first_name_map_small

3.) Build your pipeline in this manner to use your custom gender dictionary:

Properties props = new Properties();
props.setProperty("annotators",
    "tokenize, ssplit, pos, lemma, gender, ner");
props.setProperty("gender.firstnames","/path/to/your/gender_dictionary.txt");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

4.) Try running gender BEFORE ner in your pipeline (see my ordering of the annotators above). It is possible for the RegexNERSequenceClassifier (which is the class that adds the Gender tags) to get blocked if tokens already have NER tags. It looks to me like running the gender annotator first will fix the problem. So when you build the pipeline, make sure gender comes before ner.

The sequence "edward james karla edward" is tagged "O O PERSON PERSON" by the NER tagger. I am not entirely sure why those first two tokens get "O" for their NER tags. I would note that "Edward James Karla Edward" yields "PERSON PERSON PERSON PERSON", and keep in mind the NER tagger factors in position in the sentence, so perhaps being lower cased at the beginning of the sentence is causing the first token "edward" to be marked as "O"?

If you have any issues with this, please let me know and I will be happy to help more!

TL;DR

1.) Karla is marked wrong because that name is not in the gender dictionary

2.) You can make your own gender mappings file with NAME\tGENDER , make sure the property "gender.firstnames" is set to path of your new gender mapping file.

3.) Make sure the gender annotator goes before the ner annotator, this should fix the problem!

like image 114
StanfordNLPHelp Avatar answered Dec 16 '25 23:12

StanfordNLPHelp



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!