I was wondering the amount of work on NLP framework to get partial (without city) or complete postal address extraction with NLP frameworks from unstructured text? Are NLP frameworks efficient to do this? Also, how difficult is it to "train" Named Entity Recognition modules to match new locations ?
As long as most addresses are correctly formatted and regular, i.e. contain contact name, street number, street name, separated by commas, you may find rule-based frameworks.
Using unstructured or partially structured text will require more preprocessing and statistics e.g. morpho-syntax and CRF. Stanford tools are the most popular for this purpose. It may also be an interresting direction to search for corpus containing intermediary annotations: not only "LOC", but also "NUMBER", "STREETNAME", "CITY", etc. so as to be able to extract location even if they are not complete. For this kind of annotation, you may have a look at tree-structured approaches.
So the amount of work mostly depends on how much regular are expressions you are looking for.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With