All Projects → datamade → Usaddress

datamade / Usaddress

Licence: other
🇺🇸 a python library for parsing unstructured address strings into address components

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Usaddress

Pytorch Bert Crf Ner
KoBERT와 CRF로 만든 한국어 개체명인식기 (BERT+CRF based Named Entity Recognition model for Korean)
Stars: ✭ 236 (-79.74%)
Mutual labels:  natural-language-processing, crf
Open Sesame
A frame-semantic parsing system based on a softmax-margin SegRNN.
Stars: ✭ 170 (-85.41%)
Mutual labels:  natural-language-processing, crf
Ncrfpp
NCRF++, a Neural Sequence Labeling Toolkit. Easy use to any sequence labeling tasks (e.g. NER, POS, Segmentation). It includes character LSTM/CNN, word LSTM/CNN and softmax/CRF components.
Stars: ✭ 1,767 (+51.67%)
Mutual labels:  natural-language-processing, crf
Libpostal
A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
Stars: ✭ 3,312 (+184.29%)
Mutual labels:  natural-language-processing, address
Convai Bot 1337
NIPS Conversational Intelligence Challenge 2017 Winner System: Skill-based Conversational Agent with Supervised Dialog Manager
Stars: ✭ 65 (-94.42%)
Mutual labels:  natural-language-processing
Unet Crf Rnn
Edge-aware U-Net with CRF-RNN layer for Medical Image Segmentation
Stars: ✭ 63 (-94.59%)
Mutual labels:  crf
Slate
A Super-Lightweight Annotation Tool for Experts: Label text in a terminal with just Python
Stars: ✭ 61 (-94.76%)
Mutual labels:  natural-language-processing
Fromscratch
Stars: ✭ 61 (-94.76%)
Mutual labels:  natural-language-processing
Get started with deep learning for text with allennlp
Getting started with AllenNLP and PyTorch by training a tweet classifier
Stars: ✭ 69 (-94.08%)
Mutual labels:  natural-language-processing
Hackerrank
This is the Repository where you can find all the solution of the Problems which you solve on competitive platforms mainly HackerRank and HackerEarth
Stars: ✭ 68 (-94.16%)
Mutual labels:  natural-language-processing
Kor2vec
Library for Korean morpheme and word vector representation
Stars: ✭ 64 (-94.51%)
Mutual labels:  natural-language-processing
Repo 2017
Python codes in Machine Learning, NLP, Deep Learning and Reinforcement Learning with Keras and Theano
Stars: ✭ 1,123 (-3.61%)
Mutual labels:  natural-language-processing
Text Analytics With Python
Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, "Text Analytics with Python" published by Apress/Springer.
Stars: ✭ 1,132 (-2.83%)
Mutual labels:  natural-language-processing
Emnlp2018 nli
Repository for NLI models (EMNLP 2018)
Stars: ✭ 62 (-94.68%)
Mutual labels:  natural-language-processing
Touchdown
Cornell Touchdown natural language navigation and spatial reasoning dataset.
Stars: ✭ 69 (-94.08%)
Mutual labels:  natural-language-processing
How To Mine Newsfeed Data And Extract Interactive Insights In Python
A practical guide to topic mining and interactive visualizations
Stars: ✭ 61 (-94.76%)
Mutual labels:  natural-language-processing
Multilingual Latent Dirichlet Allocation Lda
A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.
Stars: ✭ 64 (-94.51%)
Mutual labels:  natural-language-processing
Intent classifier
Stars: ✭ 67 (-94.25%)
Mutual labels:  natural-language-processing
Gpt2
PyTorch Implementation of OpenAI GPT-2
Stars: ✭ 64 (-94.51%)
Mutual labels:  natural-language-processing
Languagetoys
Random fun with statistical language models.
Stars: ✭ 63 (-94.59%)
Mutual labels:  natural-language-processing

usaddress

Build StatusBuild status

usaddress is a Python library for parsing unstructured address strings into address components, using advanced NLP methods. Try it out on our web interface! For those who aren't Python developers, we also have an API.

What this can do: Using a probabilistic model, it makes (very educated) guesses in identifying address components, even in tricky cases where rule-based parsers typically break down.

What this cannot do: It cannot identify address components with perfect accuracy, nor can it verify that a given address is correct/valid.

It also does not normalize the address. However, this library built on top of usaddress does.

How to use the usaddress python library

  1. Install usaddress with pip, a tool for installing and managing python packages (beginner's guide here).

In the terminal,

pip install usaddress
  1. Parse some addresses!

usaddress

Note that parse and tag are different methods:

import usaddress
addr='123 Main St. Suite 100 Chicago, IL'

# The parse method will split your address string into components, and label each component.
# expected output: [(u'123', 'AddressNumber'), (u'Main', 'StreetName'), (u'St.', 'StreetNamePostType'), (u'Suite', 'OccupancyType'), (u'100', 'OccupancyIdentifier'), (u'Chicago,', 'PlaceName'), (u'IL', 'StateName')]
usaddress.parse(addr)

# The tag method will try to be a little smarter
# it will merge consecutive components, strip commas, & return an address type
# expected output: (OrderedDict([('AddressNumber', u'123'), ('StreetName', u'Main'), ('StreetNamePostType', u'St.'), ('OccupancyType', u'Suite'), ('OccupancyIdentifier', u'100'), ('PlaceName', u'Chicago'), ('StateName', u'IL')]), 'Street Address')
usaddress.tag(addr)

How to use this development code (for the nerds)

usaddress uses parserator, a library for making and improving probabilistic parsers - specifically, parsers that use python-crfsuite's implementation of conditional random fields. Parserator allows you to train the usaddress parser's model (a .crfsuite settings file) on labeled training data, and provides tools for adding new labeled training data.

Building & testing the code in this repo

To build a development version of usaddress on your machine, run the following code in your command line:

git clone https://github.com/datamade/usaddress.git  
cd usaddress  
pip install -r requirements.txt  
python setup.py develop  
parserator train training/labeled.xml usaddress  

Then run the testing suite to confirm that everything is working properly:

nosetests .

Having trouble building the code? Open an issue and we'd be glad to help you troubleshoot.

Adding new training data

If usaddress is consistently failing on particular address patterns, you can adjust the parser's behavior by adding new training data to the model. Follow our guide in the training directory, and be sure to make a pull request so that we can incorporate your contribution into our next release!

Important links

Team

Bad Parses / Bugs

Report issues in the issue tracker

If an address was parsed incorrectly, please let us know! You can either open an issue or (if you're adventurous) add new training data to improve the parser's model. When possible, please send over a few real-world examples of similar address patterns, along with some info about the source of the data - this will help us train the parser and improve its performance.

If something in the library is not behaving intuitively, it is a bug, and should be reported.

Note on Patches/Pull Requests

  • Fork the project.
  • Make your feature addition or bug fix.
  • Send us a pull request. Bonus points for topic branches!

Copyright

Copyright (c) 2014 Atlanta Journal Constitution. Released under the MIT License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].