Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java name parse library?

Tags:

java

parsing

I'm searching for a library similar in functionality to the Perl Lingua::EN::NameParse module. Essentially, I'd like to parse strings like 'Mr. Bob R. Smith' into prefix, first name, last name, and name suffix components. Google hasn't been much help in finding something like this and I'd prefer not to roll my own if possible. Anyone know of a OSS Java library that can do this in a sophisticated way?

like image 201
ackdesha Avatar asked Aug 25 '09 18:08

ackdesha


People also ask

How do you parse a name?

Name parsing consists of separating names into their given name and surname components and identifying titles and qualifiers, such as Mr. and Jr. You parse names as one of the first steps to scoring names to increase the likelihood that each name component is analyzed correctly.

What is Java parser?

What is parse in Java? There are many Java classes that have the parse() method. Usually the parse() method receives some string as input, "extracts" the necessary information from it and converts it into an object of the calling class.

How do you write parser in Java?

There are three ways of parsing in Java: Using an existing library. Using a tool or library to build a parser. By building a custom parser from scratch.

How does Java parser work?

A parser is a Java class that extracts attributes from a local file and stores the information in the repository. More specifically, in the case of a document, a parser: Takes in an InputStream or Reader object. Processes the character input, extracting attributes as it goes.


2 Answers

I just can't believe someone hasn't shared a library for this - well I looked in github and there's a javascript name parser that could be easily translated to java: https://github.com/joshfraser/JavaScript-Name-Parser

I also modified the code in one of the answers to work a little better and have included a test case:

import java.util.ArrayList;
import java.util.List;

import org.apache.commons.lang.StringUtils;

public class NameParser {
    private String firstName = "";
    private String lastName = "";
    private String middleName = "";
    private List<String> middleNames = new ArrayList<String>();
    private List<String> titlesBefore = new ArrayList<String>();
    private List<String> titlesAfter = new ArrayList<String>();
    private String[] prefixes = { "dr", "mr", "ms", "atty", "prof", "miss", "mrs" };
    private String[] suffixes = { "jr", "sr", "ii", "iii", "iv", "v", "vi", "esq", "2nd", "3rd", "jd", "phd",
            "md", "cpa" };

    public NameParser() {
    }

    public NameParser(String name) {
        parse(name);
    }

    private void reset() {
        firstName = lastName = middleName = "";
        middleNames = new ArrayList<String>();
        titlesBefore = new ArrayList<String>();
        titlesAfter = new ArrayList<String>();
    }

    private boolean isOneOf(String checkStr, String[] titles) {
        for (String title : titles) {
            if (checkStr.toLowerCase().startsWith(title))
                return true;
        }
        return false;
    }

    public void parse(String name) {
        if (StringUtils.isBlank(name))
            return;
        this.reset();
        String[] words = name.split(" ");
        boolean isFirstName = false;

        for (String word : words) {
            if (StringUtils.isBlank(word))
                continue;
            if (word.charAt(word.length() - 1) == '.') {
                if (!isFirstName && !this.isOneOf(word, prefixes)) {
                    firstName = word;
                    isFirstName = true;
                } else if (isFirstName) {
                    middleNames.add(word);
                } else {
                    titlesBefore.add(word);
                }
            } else {
                if (word.endsWith(","))
                    word = StringUtils.chop(word);
                if (isFirstName == false) {
                    firstName = word;
                    isFirstName = true;
                } else {
                    middleNames.add(word);
                }
            }
        }
        if (middleNames.size() > 0) {
            boolean stop = false;
            List<String> toRemove = new ArrayList<String>();
            for (int i = middleNames.size() - 1; i >= 0 && !stop; i--) {
                String str = middleNames.get(i);
                if (this.isOneOf(str, suffixes)) {
                    titlesAfter.add(str);
                } else {
                    lastName = str;
                    stop = true;
                }
                toRemove.add(str);
            }
            if (StringUtils.isBlank(lastName) && titlesAfter.size() > 0) {
                lastName = titlesAfter.get(titlesAfter.size() - 1);
                titlesAfter.remove(titlesAfter.size() - 1);
            }
            for (String s : toRemove) {
                middleNames.remove(s);
            }
        }
    }

    public String getFirstName() {
        return firstName;
    }

    public String getLastName() {
        return lastName;
    }

    public String getMiddleName() {
        if (StringUtils.isBlank(this.middleName)) {
            for (String name : middleNames) {
                middleName += (name + " ");
            }
            middleName = StringUtils.chop(middleName);
        }
        return middleName;
    }

    public List<String> getTitlesBefore() {
        return titlesBefore;
    }

    public List<String> getTitlesAfter() {
        return titlesAfter;
    }

}

Test case:

import junit.framework.Assert;

import org.junit.Test;

public class NameParserTest {

    private class TestData {
        String name;

        String firstName;
        String lastName;
        String middleName;

        public TestData(String name, String firstName, String middleName, String lastName) {
            super();
            this.name = name;
            this.firstName = firstName;
            this.lastName = lastName;
            this.middleName = middleName;
        }

    }

    @Test
    public void test() {

        TestData td[] = { new TestData("Henry \"Hank\" J. Fasthoff IV", "Henry", "\"Hank\" J.", "Fasthoff"),
                new TestData("April A. (Caminez) Bentley", "April", "A. (Caminez)", "Bentley"),
                new TestData("fff lll", "fff", "", "lll"),
                new TestData("fff mmmmm lll", "fff", "mmmmm", "lll"),
                new TestData("fff mmm1      mm2 lll", "fff", "mmm1 mm2", "lll"),
                new TestData("Mr. Dr. Tom Jones", "Tom", "", "Jones"),
                new TestData("Robert P. Bethea Jr.", "Robert", "P.", "Bethea"),
                new TestData("Charles P. Adams, Jr.", "Charles", "P.", "Adams"),
                new TestData("B. Herbert Boatner, Jr.", "B.", "Herbert", "Boatner"),
                new TestData("Bernard H. Booth IV", "Bernard", "H.", "Booth"),
                new TestData("F. Laurens \"Larry\" Brock", "F.", "Laurens \"Larry\"", "Brock"),
                new TestData("Chris A. D'Amour", "Chris", "A.", "D'Amour") };

        NameParser bp = new NameParser();
        for (int i = 0; i < td.length; i++) {
            bp.parse(td[i].name);
            Assert.assertEquals(td[i].firstName, bp.getFirstName());
            Assert.assertEquals(td[i].lastName, bp.getLastName());
            Assert.assertEquals(td[i].middleName, bp.getMiddleName());
        }
    }

}
like image 191
James O'Brien Avatar answered Oct 10 '22 19:10

James O'Brien


Maybe you could try the GATE named entity extraction component? It has build in jape grammar and gazetteer lists to extract first names, last names etc. among other things. See this page.

like image 20
Anand Avatar answered Oct 10 '22 19:10

Anand