All Projects → comtravo → ctparse

comtravo / ctparse

Licence: MIT license
Parse natural language time expressions in python

Programming Languages

python
139335 projects - #7 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to ctparse

RE2RNN
Source code for the EMNLP 2020 paper "Cold-Start and Interpretability: Turning Regular Expressions intoTrainable Recurrent Neural Networks"
Stars: ✭ 96 (+0%)
Mutual labels:  regular-expression
montre
The original timed regular expression matcher over temporal behaviors
Stars: ✭ 14 (-85.42%)
Mutual labels:  regular-expression
dregex
Dregex is a JVM library that implements a regular expression engine using deterministic finite automata (DFA). It supports some Perl-style features and yet retains linear matching time, and also offers set operations.
Stars: ✭ 37 (-61.46%)
Mutual labels:  regular-expression
path-to-regexp-php
PHP port of https://github.com/pillarjs/path-to-regexp
Stars: ✭ 21 (-78.12%)
Mutual labels:  regular-expression
termco
Regular Expression Counts of Terms and Substrings
Stars: ✭ 24 (-75%)
Mutual labels:  regular-expression
parsesig
A Telegram bot that forwards messages from one private/public channel to another after formatting
Stars: ✭ 40 (-58.33%)
Mutual labels:  regular-expression
BUFFY
Back Up Files For You
Stars: ✭ 19 (-80.21%)
Mutual labels:  regular-expression
expand-brackets
Expand POSIX bracket expressions (character classes) in glob patterns.
Stars: ✭ 26 (-72.92%)
Mutual labels:  regular-expression
es6-template-regex
Regular expression for matching es6 template delimiters in a string.
Stars: ✭ 15 (-84.37%)
Mutual labels:  regular-expression
riskybird
Regular expression authors best friend
Stars: ✭ 48 (-50%)
Mutual labels:  regular-expression
Regex
🔤 Swifty regular expressions
Stars: ✭ 311 (+223.96%)
Mutual labels:  regular-expression
regexp-expand
Show the ELisp regular expression at point in rx form.
Stars: ✭ 18 (-81.25%)
Mutual labels:  regular-expression
edit
A stand-alone implementation of the Acme text editor's command language.
Stars: ✭ 29 (-69.79%)
Mutual labels:  regular-expression
cregex
A small implementation of regular expression matching engine in C
Stars: ✭ 72 (-25%)
Mutual labels:  regular-expression
Data-Wrangling-with-Python
Simplify your ETL processes with these hands-on data sanitation tips, tricks, and best practices
Stars: ✭ 90 (-6.25%)
Mutual labels:  regular-expression
regex
Regular expressions for Prolog
Stars: ✭ 16 (-83.33%)
Mutual labels:  regular-expression
LLRegex
Regular expression library in Swift, wrapping NSRegularExpression.
Stars: ✭ 18 (-81.25%)
Mutual labels:  regular-expression
irrec
composable regular expressions based on Kleene algebras and recursion schemes
Stars: ✭ 14 (-85.42%)
Mutual labels:  regular-expression
effcee
Effcee is a C++ library for stateful pattern matching of strings, inspired by LLVM's FileCheck
Stars: ✭ 76 (-20.83%)
Mutual labels:  regular-expression
CVparser
CVparser is software for parsing or extracting data out of CV/resumes.
Stars: ✭ 28 (-70.83%)
Mutual labels:  regular-expression

ctparse - Parse natural language time expressions in python

PyPi Documentation Status

Background

The package ctparse is a pure python package to parse time expressions from natural language (i.e. strings). In many ways it builds on similar concepts as Facebook’s duckling package (https://github.com/facebook/duckling). However, for the time being it only targets times and only German and English text.

In principle ctparse can be used to detect time expressions in a text, however its main use case is the semantic interpretation of such expressions. Detecting time expressions in the first place can - to our experience - be done more efficiently (and precisely) using e.g. CRFs or other models targeted at this specific task.

ctparse is designed with the use case in mind where interpretation of time expressions is done under the following assumptions:

  • All expressions are relative to some pre-defined reference times
  • Unless explicitly specified in the time expression, valid resolutions are in the future relative to the reference time (i.e. 12.5. will be the next 12th of May, but 12.5.2012 should correctly resolve to the 12th of May 2012).
  • If in doubt, resolutions in the near future are more likely than resolutions in the far future (not implemented yet, but any resolution more than i.e. 3 month in the future is extremely unlikely).

The specific comtravo use-case is resolving time expressions in booking requests which almost always refer to some point in time within the next 4-8 weeks.

ctparse currently is language agnostic and supports German and English expressions. This might get an extension in the future. The main reason is that in real world communication more often than not people write in one language (their business language) but use constructs to express times that are based on their mother tongue and/or what they believe to be the way to express dates in the target language. This leads to text in German with English time expressions and vice-versa. Using a language detection upfront on the complete original text is for obvious no solution - rather it would make the problem worse.

Example

from ctparse import ctparse
from datetime import datetime

# Set reference time
ts = datetime(2018, 3, 12, 14, 30)
ctparse('May 5th 2:30 in the afternoon', ts=ts)

This should return a Time object represented as Time[0-29]{2018-05-05 14:30 (X/X)}, indicating that characters 0-29 were used in the resolution, that the resolved date time is the 5th of May 2018 at 14:30 and that this resolution is neither based on a day of week (first X) nor a part of day (second X).

Latent time

Normally, ctparse will anchor time expressions to the reference time. For example, when parsing the time expression 8:00 pm, ctparse will resolve the expression to 8 pm after the reference time as follows

parse = ctparse("8:00 pm", ts=datetime(2020, 1, 1, 7, 0), latent_time=True) # default
# parse.resolution -> Time(2020, 1, 1, 20, 00)

This behavior can be customized using the option latent_time=False, which will return a time resolution not anchored to a particular date

parse = ctparse("8:00 pm", ts=datetime(2020, 1, 1, 7, 0), latent_time=False)
# parse.resolution -> Time(None, None, None, 20, 00)

Implementation

ctparse - as duckling - is a mixture of a rule and regular expression based system + some probabilistic modeling. In this sense it resembles a PCFG.

Rules

At the core ctparse is a collection of production rules over sequences of regular expressions and (intermediate) productions.

Productions are either of type Time, Interval or Duration and can have certain predicates (e.g. whether a Time is a part of day like 'afternoon').

A typical rule than looks like this:

@rule(predicate('isDate'), dimension(Interval))

I.e. this rule is applicable when the intermediate production resulted in something that has a date, followed by something that is in interval (like e.g. in 'May 5th 9-10').

The actual production is a python function with the following signature:

@rule(predicate('isDate'), dimension(Interval))
def ruleDateInterval(ts, d, i):
  """
  param ts: datetime - the current refenrence time
  d: Time - a time that contains at least a full date
  i: Interval - some Interval
  """
  if not (i.t_from.isTOD and i.t_to.isTOD):
    return None
  return Interval(
    t_from=Time(year=d.year, month=d.month, day=d.day,
                hour=i.t_from.hour, minute=i.t_from.minute),
    t_to=Time(year=d.year, month=d.month, day=d.day,
              hour=i.t_to.hour, minute=i.t_to.minute))

This production will return a new interval at the date of predicate('isDate') spanning the time coded in dimension(Interval). If the latter does code for something else than a time of day (TOD), no production is returned, e.g. the rule matched but failed.

Technical Background

Some observations on the problem:

  • Each rule is a combination of regular expressions and productions.
  • Consequently, each production must originate in a sequence of regular expressions that must have matched (parts of) the text.
  • Hence, only subsequence of all regular expressions in all rules can lead to a successful production.

To this end the algorithm proceeds as follows:

  1. Input a string and a reference time
  2. Find all matches of all regular expressions from all rules in the input strings. Each regular expression is assigned an identifier.
  3. Find all distinct sequences of these matches where two matches do not overlap nor have a gap inbetween
  4. To each such subsequence apply all rules at all possible positions until no further rules can be applied - in which case one solution is produced

Obviously, not all sequences of matching expressions and not all sequences of rules applied on top lead to meaningful results. Here the PCFG kicks in:

  • Based on example data (corpus.py) a model is calibrated to predict how likely a production is to lead to a/the correct result. Instead of doing a breadth first search, the most promising productions are applied first.
  • Resolutions are produced until there are no more resolutions or a timeout is hit.
  • Based on the same model from all resolutions the highest scoring is returned.

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].