All Projects â†’ datamade â†’ Probablepeople

datamade / Probablepeople

Licence: other
👪 a python library for parsing unstructured western names into name components.

Programming Languages

python
139335 projects - #7 most used programming language

Labels

Projects that are alternatives of or similar to Probablepeople

Swiftpascalinterpreter
Simple Swift interpreter for the Pascal language inspired by the Let’s Build A Simple Interpreter article series.
Stars: ✭ 270 (-36.77%)
Mutual labels:  parse
Omnifocus
Scripts for OmniFocus
Stars: ✭ 316 (-26%)
Mutual labels:  parse
Nlp
[UNMANTEINED] Extract values from strings and fill your structs with nlp.
Stars: ✭ 367 (-14.05%)
Mutual labels:  parse
Pubmed parser
📋 A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
Stars: ✭ 274 (-35.83%)
Mutual labels:  parse
Demoinfocs Golang
High performance CS:GO demo parser for Go (demoinfo)
Stars: ✭ 288 (-32.55%)
Mutual labels:  parse
Js Quantities
JavaScript library for quantity calculation and unit conversion
Stars: ✭ 335 (-21.55%)
Mutual labels:  parse
Parse Domain
Splits a hostname into subdomains, domain and (effective) top-level domains.
Stars: ✭ 261 (-38.88%)
Mutual labels:  parse
Gographviz
Parses the Graphviz DOT language in golang
Stars: ✭ 416 (-2.58%)
Mutual labels:  parse
Graph Viz D3 Js
Graphviz web D3.js renderer
Stars: ✭ 297 (-30.44%)
Mutual labels:  parse
Human Interval
Human readable time distances for javascript
Stars: ✭ 360 (-15.69%)
Mutual labels:  parse
Crawlertutorial
爬蟲極簡教學(fetch, parse, search, multiprocessing, API)- PTT 為例
Stars: ✭ 282 (-33.96%)
Mutual labels:  parse
Themer
Themer is a colorscheme generator and manager for your desktop.
Stars: ✭ 289 (-32.32%)
Mutual labels:  parse
Instagram
A simple imitation of Instagram  app .
Stars: ✭ 346 (-18.97%)
Mutual labels:  parse
Httparse
A push parser for the HTTP 1.x protocol in Rust.
Stars: ✭ 271 (-36.53%)
Mutual labels:  parse
Args
Toolkit for building command line interfaces
Stars: ✭ 399 (-6.56%)
Mutual labels:  parse
Angourimath
Open-source symbolic algebra library for C# and F#. One of the most powerful in .NET
Stars: ✭ 266 (-37.7%)
Mutual labels:  parse
Parse Torrent
Parse a torrent identifier (magnet uri, .torrent file, info hash)
Stars: ✭ 325 (-23.89%)
Mutual labels:  parse
Breakdance
It's time for your markup to get down! HTML to markdown converter. Breakdance is a highly pluggable, flexible and easy to use.
Stars: ✭ 418 (-2.11%)
Mutual labels:  parse
Parse Sdk Flutter
A Dart or Flutter plugin for Parse Server... Enjoy!
Stars: ✭ 407 (-4.68%)
Mutual labels:  parse
Nlp Cube
Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing
Stars: ✭ 353 (-17.33%)
Mutual labels:  parse

probablepeople

Build Status

probablepeople is a python library for parsing unstructured romanized name or company strings into components, using advanced NLP methods. This is based off usaddress, a python library for parsing addresses.

Try it out on our web interface! For those who aren't python developers, we also have an API.

What this can do: Using a probabilistic model, it makes (very educated) guesses in identifying name or corporation components, even in tricky cases where rule-based parsers typically break down.

What this cannot do: It cannot identify components with perfect accuracy, nor can it verify that a given name/company is correct/valid.

probablepeople learns how to parse names/companies through a body of training data. If you have examples of names/companies that stump this parser, please send them over! By adding more examples to the training data, probablepeople can continue to learn and improve.

How to use the probablepeople python library

  1. Install probablepeople with pip, a tool for installing and managing python packages (beginner's guide here)

    In the terminal,

    pip install probablepeople  
    
  2. Parse some names/companies!

    pp

    Note that parse and tag are differet methods:

    import probablepeople as pp
    name_str='Mr George "Gob" Bluth II'
    corp_str='Sitwell Housing Inc'
    
    # The parse method will split your string into components, and label each component.
    pp.parse(name_str) # expected output: [('Mr', 'PrefixMarital'), ('George', 'GivenName'), ('"Gob"', 'Nickname'), ('Bluth', 'Surname'), ('II', 'SuffixGenerational')]
    pp.parse(corp_str) # expected output: [('Sitwell', 'CorporationName'), ('Housing', 'CorporationName'), ('Inc', 'CorporationLegalType')]
    
    # The tag method will try to be a little smarter
    # it will merge consecutive components, strip commas, & return a string type
    pp.tag(name_str) # expected output: (OrderedDict([('PrefixMarital', 'Mr'), ('GivenName', 'George'), ('Nickname', '"Gob"'), ('Surname', 'Bluth'), ('SuffixGenerational', 'II')]), 'Person')
    pp.tag(corp_str) # expected output: (OrderedDict([('CorporationName', 'Sitwell Housing'), ('CorporationLegalType', 'Inc')]), 'Corporation')
    

Links:

For the nerds:

Probablepeople uses parserator, a library for making and improving probabilistic parsers - specifically, parsers that use python-crfsuite's implementation of conditional random fields. Parserator allows you to train probablepeople's model (a .crfsuite settings file) on labeled training data, and provides tools for easily adding new labeled training data.

Building & testing development code

git clone https://github.com/datamade/probablepeople.git  
cd probablepeople  
pip install -r requirements.txt  
python setup.py develop
make all
nosetests .  

Creating/adding labeled training data (.xml outfile) from unlabeled raw data (.csv infile)

If there are name/company formats that the parser isn't performing well on, you can add them to training data. As probablepeople continually learns about new cases, it will continually become smarter and more robust.

NOTE: The model doesn't need many examples to learn about new patterns - if you are trying to get probablepeople to perform better on a specific type of name, start with a few (<5) examples, check performance, and then add more examples as necessary.

For this parser, we are keeping person names and organization names separate in the training data. The two training files used to produce the model are:

  • name_data/labeled/labeled.xml for people
  • name_data/labeled/company_labeled.xml for organizations.

To add your own training examples, first put your unlabeled raw data in a csv. Then:

parserator label [infile] [outfile] probablepeople  

[infile] is your raw csv and [outfile] is the appropriate training file to write to. For example, if you put raw strings in my_companies.csv, you'd use parserator label my_companies.csv name_data/labeled/company_labeled.xml probablepeople

The parserator label command will start a console labeling task, where you will be prompted to label raw strings via the command line. For more info on using parserator, see the parserator documentation.

Re-training the model

If you've added new training data, you will need to re-train the model. To set multiple files as traindata, separate them with commas.

parserator train [traindata] probablepeople  

probablepeople allows for multiple model files - person for person names only, company for company names only, or generic (both). here are examples of commands for training models:

parserator train name_data/labeled/person_labeled.xml,name_data/labeled/company_labeled.xml probablepeople --modelfile=generic
parserator train name_data/labeled/person_labeled.xml probablepeople --modelfile=person
parserator train name_data/labeled/company_labeled.xml probablepeople --modelfile=company

Errors and Bugs

If something is not behaving intuitively, it is a bug and should be reported. Report it here by creating an issue: https://github.com/datamade/probablepeople/issues

Help us fix the problem as quickly as possible by following Mozilla's guidelines for reporting bugs.

Patches and Pull Requests

Your patches are welcome. Here's our suggested workflow:

  • Fork the project.
  • Add your labeled examples.
  • Send us a pull request with a description of your work.

Copyright

Copyright (c) 2014 Atlanta Journal Constitution. Released under the MIT License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].