Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a package to determine gender from English first names? [closed]

Tags:

r

I am curious to know whether there is a package or method for R to guess gender from first names.

I am thinking of running it on the U.S. Congress as a test.

I need this to work over several European languages.

CRAN does not have such a package.

CRAN has the gender package, but it works only on English names.

Issue solved by the genderizeR package. See links in my self-answer.

like image 622
Fr. Avatar asked May 28 '13 22:05

Fr.


3 Answers

There is now a package on CRAN for specifically for this: gender

From the description:

Encodes gender based on names and dates of birth, using either the Social Security Administration's data set of first names by year of birth or the Census Bureau data from 1789 to 1940, both from the United States of America. By using these data sets instead of lists of male and female names, this package is able to more accurately guess the gender of a name, and it is able to report the probability that a name was male or female.

It also has a very helpful vignette demonstrating typical uses.

like image 149
Ben Avatar answered Oct 24 '22 05:10

Ben


I believe the answer is "no," but you could still use R to analyze this. Obviously it would be a probabilistic type of answer since some names are ambiguous or unique. This stackoverflow question has some helpful suggestions but links are out of date. US census data is a good place to start. From the 2000 United States census, you can find name directories and metadata at http://www.census.gov/genealogy/www/data/1990surnames/names_files.html. Some interesting issues are discussed in http://www.census.gov/srd/papers/pdf/rr97-2.pdf and http://www.census.gov/population/www/documentation/twps07/twps07.pdf.

like image 4
J. Win. Avatar answered Oct 24 '22 07:10

J. Win.


Please don't accept this as an answer as it is based on other's answers and links. I have added this function to the qdap package as it fits the package.

library(qdap)

name2sex(qcv(mary, jenn, linda, JAME, GABRIEL, OLIVA, 
    tyler, jamie, JAMES, tyrone, cheryl, drew))

name2sex(qcv(mary, jenn, linda, JAME, GABRIEL, OLIVA, 
    tyler, jamie, JAMES, tyrone, cheryl, drew), FALSE)

name2sex(qcv(mary, jenn, linda, JAME, GABRIEL, OLIVA, 
    tyler, jamie, JAMES, tyrone, cheryl, drew), FALSE, TRUE)

name2sex(qcv(mary, jenn, linda, JAME, GABRIEL, OLIVA, 
    tyler, jamie, JAMES, tyrone, cheryl, drew), TRUE, FALSE)


## > name2sex(qcv(mary, jenn, linda, JAME, GABRIEL, OLIVA, 
## +     tyler, jamie, JAMES, tyrone, cheryl, drew))
##  [1] F F F M M F M F M M F M
## Levels: F M

## > name2sex(qcv(mary, jenn, linda, JAME, GABRIEL, OLIVA, 
## +     tyler, jamie, JAMES, tyrone, cheryl, drew), FALSE)
##  [1] B    <NA> F    B    B    F    B    B    B    M    F    B   
## Levels: B F M

## > name2sex(qcv(mary, jenn, linda, JAME, GABRIEL, OLIVA, 
## +     tyler, jamie, JAMES, tyrone, cheryl, drew), FALSE, TRUE)
##  [1] B F F B B F B B B M F B
## Levels: B F M

## > name2sex(qcv(mary, jenn, linda, JAME, GABRIEL, OLIVA, 
## +     tyler, jamie, JAMES, tyrone, cheryl, drew), TRUE, FALSE)
##  [1] F    <NA> F    M    M    F    M    F    M    M    F    M   
## Levels: F M

Edit- I added a fuzzy.match argument to attempt to guess gender for non recognized names based on fuzzy matching, though this is computationally expensive.

like image 4
Tyler Rinker Avatar answered Oct 24 '22 05:10

Tyler Rinker