I am writing a script to reverse all genders in a piece of text, so all gendered words are swapped - "man" is swapped with "woman", "she" is swapped with "he", etc. But there is an ambiguity as to whether "her" should be replaced with "him" or "his".
Okay. Lets look at this like a linguist might. I am thinking aloud here.
"Her" is a pronoun. It can either be a:
1.
possessive pronoun
This is her book.
2.
personal pronoun
Give it to her. (after preposition)
He wrote her a letter. (indirect object)
He treated her for a cold. (direct object)
So lets look at case (1), possessive pronoun. That is it is a pronoun which is in the "genitive" case (meaning, it is a noun which is being "possessive." Okay, that detail isn't quite as important as the next one.)
In this case, "her" is acting as a "determiner". Determiners may occur in two places in a sentence (this is a simplification):
Det + Noun ("her book")
Det + Adj + Noun ("her nice book")
So to figure out if her is a determiner, you could have this logic:
a. If the word following "her" is a noun, then "her" is a determiner.
b. If the 2 words following "her" is an adjective, then a noun, then "her" is a determiner"
And if you establish that "her" is a determiner, then you know that you must replace it with "his", which is also a determiner (aka genitive noun, aka possessive pronoun).
If it doesn't match criteria (a) and (b) above, then you could possibly conclude that it is not a determiner, which means it must be a personal pronoun. In that case, you would replace "her" with "him".
You wouldn't even have to do the tests below, but I'll try to describe them anyway.
Looking at (2) from above: personal pronoun, rather than possessive. This gets trickier.
The examples above show "her" occurring in 3 ways:
(1) Give it to her. (after preposition. we call this the "object of a preposition".)
So you could maybe devise a rule: "If 'her' occurs immediately after a preposition, then it should be treated as a noun, so we would replace it with 'him'".
The next two are tricky. "her" can either be a direct object or an indirect object.
(2) He wrote her a letter. (indirect object)
(3) He treated her for a cold. (direct object)
Syntactically, how can we tell the difference?
A direct object occurs immediately after a verb.
If you have a verb, followed by a noun, then that noun is a direct object. eg:
He treated her.*
If you have a verb, followed by a noun, followed by a prepositional phrase, then the noun is a direct object.
He treated her for a cold. ("her" is a noun, and it comes immediately after the verb "treated". "for a cold" is a prepositional phrase.)
Which means that you could say "If you have Verb + Noun + Prep" then the noun is a direct object. Since the noun is a direct object, then it is a personal pronoun, so use "him". (note, you only have to check for a preposition, not the entire prep phrase, since the phrase will always begin with a preposition.)
If it is an indirect object, then you'll have the form "verb + noun + noun".
He wrote her a letter. ("her" is a noun, "letter" is a noun. well, "a letter" is a "noun phrase", so you'd have to account for determiners as well.)
So... if "her" is a direct object, indirect object, or obj of prep, you could change it to "him", otherwise, change it to "his".
This method seems a lot more complicated - so I'd just start by checking to see if "her" is a determiner (see above), and if it is a determiner, use "his" otherwise, just use "him".
So, the above has a lot of simplifications. It doesn't cover "interrupting phrases", or clause structures, or constituency tests, or embedded clauses, or punctuation, or anything like that.
Also, this solution requires a dictionary - a list of "nouns" and "verbs" and "prepositions" so that you can determine the lexical category of each word in the sentence.
And even there, man, natural language processing is hard. You'd want to do some sort of "training" for your model to have a good solution. BUT for very simple things, try some of the stuff described above.
Sorry for being so verbose! (None of the existing answers gave any hard data, or precise linguistic definitions, so here goes.)
Given the scope of your project: reversing all gender-related words, it appears that :
Furthermore, Regex too seems a poor choice of tool; natural language is just not a regular langugage ;-).
Instead, you should consider introducing Part-of-Speech (POS) tagging, possibly with a hint of Named Entity Recognition, and then apply substitution rules based on the extra info the tagging supplied.
This may seem like a lot of work, but if for example your scripting language happens to be Python, you can leverage NTLK to implement all this with a relatively small effort.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With