I am curious to know whether there is a package or method for R to guess gender from first names.
I am thinking of running it on the U.S. Congress as a test.
I need this to work over several European languages.
CRAN does not have such a package.
CRAN has the gender package, but it works only on English names.
Issue solved by the genderizeR package. See links in my self-answer.
There is now a package on CRAN for specifically for this: gender
From the description:
Encodes gender based on names and dates of birth, using either the Social Security Administration's data set of first names by year of birth or the Census Bureau data from 1789 to 1940, both from the United States of America. By using these data sets instead of lists of male and female names, this package is able to more accurately guess the gender of a name, and it is able to report the probability that a name was male or female.
It also has a very helpful vignette demonstrating typical uses.
I believe the answer is "no," but you could still use R to analyze this. Obviously it would be a probabilistic type of answer since some names are ambiguous or unique. This stackoverflow question has some helpful suggestions but links are out of date. US census data is a good place to start. From the 2000 United States census, you can find name directories and metadata at http://www.census.gov/genealogy/www/data/1990surnames/names_files.html. Some interesting issues are discussed in http://www.census.gov/srd/papers/pdf/rr97-2.pdf and http://www.census.gov/population/www/documentation/twps07/twps07.pdf.
Please don't accept this as an answer as it is based on other's answers and links. I have added this function to the qdap package as it fits the package.
library(qdap)
name2sex(qcv(mary, jenn, linda, JAME, GABRIEL, OLIVA,
tyler, jamie, JAMES, tyrone, cheryl, drew))
name2sex(qcv(mary, jenn, linda, JAME, GABRIEL, OLIVA,
tyler, jamie, JAMES, tyrone, cheryl, drew), FALSE)
name2sex(qcv(mary, jenn, linda, JAME, GABRIEL, OLIVA,
tyler, jamie, JAMES, tyrone, cheryl, drew), FALSE, TRUE)
name2sex(qcv(mary, jenn, linda, JAME, GABRIEL, OLIVA,
tyler, jamie, JAMES, tyrone, cheryl, drew), TRUE, FALSE)
## > name2sex(qcv(mary, jenn, linda, JAME, GABRIEL, OLIVA,
## + tyler, jamie, JAMES, tyrone, cheryl, drew))
## [1] F F F M M F M F M M F M
## Levels: F M
## > name2sex(qcv(mary, jenn, linda, JAME, GABRIEL, OLIVA,
## + tyler, jamie, JAMES, tyrone, cheryl, drew), FALSE)
## [1] B <NA> F B B F B B B M F B
## Levels: B F M
## > name2sex(qcv(mary, jenn, linda, JAME, GABRIEL, OLIVA,
## + tyler, jamie, JAMES, tyrone, cheryl, drew), FALSE, TRUE)
## [1] B F F B B F B B B M F B
## Levels: B F M
## > name2sex(qcv(mary, jenn, linda, JAME, GABRIEL, OLIVA,
## + tyler, jamie, JAMES, tyrone, cheryl, drew), TRUE, FALSE)
## [1] F <NA> F M M F M F M M F M
## Levels: F M
Edit- I added a fuzzy.match
argument to attempt to guess gender for non recognized names based on fuzzy matching, though this is computationally expensive.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With