Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → 5j9 → Wikitextparser

5j9 / Wikitextparser

Licence: gpl-3.0

A simple WikiText parsing library for MediaWiki

Programming Languages

python

139335 projects - #7 most used programming language

Labels

parsing text-analysis mediawiki

Projects that are alternatives of or similar to Wikitextparser

Chevrotain

Parser Building Toolkit for JavaScript

Stars: ✭ 1,795 (+1104.7%)

Mutual labels: parsing

Govaluate

Arbitrary expression evaluation for golang

Stars: ✭ 2,130 (+1329.53%)

Mutual labels: parsing

Parjs

JavaScript parser-combinator library

Stars: ✭ 145 (-2.68%)

Mutual labels: parsing

Padatious

A neural network intent parser

Stars: ✭ 124 (-16.78%)

Mutual labels: text-analysis

Coregpx

A library for parsing and creation of GPX location files. Purely Swift.

Stars: ✭ 132 (-11.41%)

Mutual labels: parsing

Serde Xml Rs

xml-rs based deserializer for Serde (compatible with 1.0+)

Stars: ✭ 141 (-5.37%)

Mutual labels: parsing

Mwoffliner

Scrape any online Mediawiki motorised wiki (like Wikipedia) to your local filesystem

Stars: ✭ 121 (-18.79%)

Mutual labels: mediawiki

Infoboxer

Wikipedia information extraction library

Stars: ✭ 147 (-1.34%)

Mutual labels: mediawiki

Alma

ALgoloid with MAcros -- a language with Algol-family syntax where macros take center stage

Stars: ✭ 132 (-11.41%)

Mutual labels: parsing

Huggle3 Qt Lx

Huggle is an anti-vandalism tool for use on MediaWiki based projects

Stars: ✭ 143 (-4.03%)

Mutual labels: mediawiki

Parser

Simple Parser + Nice Error Messages

Stars: ✭ 125 (-16.11%)

Mutual labels: parsing

Smltar

Manuscript of the book "Supervised Machine Learning for Text Analysis in R" by Emil Hvitfeldt and Julia Silge

Stars: ✭ 125 (-16.11%)

Mutual labels: text-analysis

Stanza Old

Stanford NLP group's shared Python tools.

Stars: ✭ 142 (-4.7%)

Mutual labels: text-analysis

Dan Jurafsky Chris Manning Nlp

My solution to the Natural Language Processing course made by Dan Jurafsky, Chris Manning in Winter 2012.

Stars: ✭ 124 (-16.78%)

Mutual labels: parsing

Rosie Pattern Language

Rosie Pattern Language (RPL) and the Rosie Pattern Engine have MOVED!

Stars: ✭ 146 (-2.01%)

Mutual labels: parsing

Whois Parser

Go(Golang) module for domain whois information parsing.

Stars: ✭ 123 (-17.45%)

Mutual labels: parsing

Untrusted

Safe, fast, zero-panic, zero-crashing, zero-allocation parsing of untrusted inputs in Rust.

Stars: ✭ 133 (-10.74%)

Mutual labels: parsing

Gostradamus

Gostradamus: Better DateTimes for Go 🕰️

Stars: ✭ 148 (-0.67%)

Mutual labels: parsing

Qdap

Quantitative Discourse Analysis Package: Bridging the gap between qualitative data and quantitative analysis

Stars: ✭ 146 (-2.01%)

Mutual labels: text-analysis

Down

Blazing fast Markdown / CommonMark rendering in Swift, built upon cmark.

Stars: ✭ 1,895 (+1171.81%)

Mutual labels: parsing

View All Similar Projects ➔

.. image:: https://travis-ci.org/5j9/wikitextparser.svg?branch=master :target: https://travis-ci.org/5j9/wikitextparser .. image:: https://codecov.io/github/5j9/wikitextparser/coverage.svg?branch=master :target: https://codecov.io/github/5j9/wikitextparser .. image:: https://readthedocs.org/projects/wikitextparser/badge/?version=latest :target: http://wikitextparser.readthedocs.io/en/latest/?badge=latest

============== WikiTextParser

.. Quick Start Guid

A simple to use WikiText parsing library for MediaWiki <https://www.mediawiki.org/wiki/MediaWiki>_.

The purpose is to allow users easily extract and/or manipulate templates, template parameters, parser functions, tables, external links, wikilinks, lists, etc. found in wikitexts.

.. contents:: Table of Contents

Installation

Python 3.5+ is required
pip install 'setuptools>=36.2.1'
pip install wikitextparser

Usage

.. code:: python

>>> import wikitextparser as wtp

WikiTextParser can detect sections, parser functions, templates, wiki links, external links, arguments, tables, wiki lists, and comments in your wikitext. The following sections are a quick overview of some of these functionalities.

You may also want to have a look at the test modules for more examples and probable pitfalls (expected failures).

Templates

.. code:: python

>>> parsed = wtp.parse("{{text|value1{{text|value2}}}}")
>>> parsed.templates
[Template('{{text|value1{{text|value2}}}}'), Template('{{text|value2}}')]
>>> parsed.templates[0].arguments
[Argument("|value1{{text|value2}}")]
>>> parsed.templates[0].arguments[0].value = 'value3'
>>> print(parsed)
{{text|value3}}

The pformat method returns a pretty-print formatted string for templates:

.. code:: python

>>> parsed = wtp.parse('{{t1 |b=b|c=c| d={{t2|e=e|f=f}} }}')
>>> t1, t2 = parsed.templates
>>> print(t2.pformat())
{{t2
    | e = e
    | f = f
}}
>>> print(t1.pformat())
{{t1
    | b = b
    | c = c
    | d = {{t2
        | e = e
        | f = f
    }}
}}

Template.rm_dup_args_safe and Template.rm_first_of_dup_args methods can be used to clean-up pages using duplicate arguments in template calls <https://en.wikipedia.org/wiki/Category:Pages_using_duplicate_arguments_in_template_calls>_:

.. code:: python

>>> t = wtp.Template('{{t|a=a|a=b|a=a}}')
>>> t.rm_dup_args_safe()
>>> t
Template('{{t|a=b|a=a}}')
>>> t = wtp.Template('{{t|a=a|a=b|a=a}}')
>>> t.rm_first_of_dup_args()
>>> t
Template('{{t|a=a}}')

Template parameters:

.. code:: python

>>> param = wtp.parse('{{{a|b}}}').parameters[0]
>>> param.name
'a'
>>> param.default
'b'
>>> param.default = 'c'
>>> param
Parameter('{{{a|c}}}')
>>> param.append_default('d')
>>> param
Parameter('{{{a|{{{d|c}}}}}}')

WikiLinks

.. code:: python

>>> wl = wtp.parse('... [[title#fragmet|text]] ...').wikilinks[0]
>>> wl.title = 'new_title'
>>> wl.fragment = 'new_fragmet'
>>> wl.text = 'X'
>>> wl
WikiLink('[[new_title#new_fragmet|X]]')
>>> del wl.text
>>> wl
WikiLink('[[new_title#new_fragmet]]')

All WikiLink properties support get, set, and delete operations.

Sections

.. code:: python

>>> parsed = wtp.parse("""
... == h2 ==
... t2
... === h3 ===
... t3
... === h3 ===
... t3
... == h22 ==
... t22
... {{text|value3}}
... [[Z|X]]
... """)
>>> parsed.sections
[Section('\n'),
 Section('== h2 ==\nt2\n=== h3 ===\nt3\n=== h3 ===\nt3\n'),
 Section('=== h3 ===\nt3\n'),
 Section('=== h3 ===\nt3\n'),
 Section('== h22 ==\nt22\n{{text|value3}}\n[[Z|X]]\n')]
>>> parsed.sections[1].title = 'newtitle'
>>> print(parsed)

==newtitle==
t2
=== h3 ===
t3
=== h3 ===
t3
== h22 ==
t22
{{text|value3}}
[[Z|X]]
>>> del parsed.sections[1].title
>>>> print(parsed)

t2
=== h3 ===
t3
=== h3 ===
t3
== h22 ==
t22
{{text|value3}}
[[Z|X]]

Tables

Extracting cell values of a table:

.. code:: python

>>> p = wtp.parse("""{|
... |  Orange    ||   Apple   ||   more
... |-
... |   Bread    ||   Pie     ||   more
... |-
... |   Butter   || Ice cream ||  and more
... |}""")
>>> p.tables[0].data()
[['Orange', 'Apple', 'more'],
 ['Bread', 'Pie', 'more'],
 ['Butter', 'Ice cream', 'and more']]

By default, values are arranged according to colspan and rowspan attributes:

.. code:: python

>>> t = wtp.Table("""{| class="wikitable sortable"
... |-
... ! a !! b !! c
... |-
... !colspan = "2" | d || e
... |-
... |}""")
>>> t.data()
[['a', 'b', 'c'], ['d', 'd', 'e']]
>>> t.data(span=False)
[['a', 'b', 'c'], ['d', 'e']]

Calling the cells method of a Table returns table cells as Cell objects. Cell objects provide methods for getting or setting each cell's attributes or values individually:

.. code:: python

>>> cell = t.cells(row=1, column=1)
>>> cell.attrs
{'colspan': '2'}
>>> cell.set('colspan', '3')
>>> print(t)
{| class="wikitable sortable"
|-
! a !! b !! c
|-
!colspan = "3" | d || e
|-
|}

HTML attributes of Table, Cell, and Tag objects are accessible via get_attr, set_attr, has_attr, and del_attr methods.

Lists

The get_lists method provides access to lists within the wikitext.

.. code:: python

>>> parsed = wtp.parse(
...     'text\n'
...     '* list item a\n'
...     '* list item b\n'
...     '** sub-list of b\n'
...     '* list item c\n'
...     '** sub-list of b\n'
...     'text'
... )
>>> wikilist = parsed.get_lists()[0]
>>> wikilist.items
[' list item a', ' list item b', ' list item c']

The sublists method can be used to get all sub-lists of the current list or just sub-lists of specific items:

.. code:: python

>>> wikilist.sublists()
[WikiList('** sub-list of b\n'), WikiList('** sub-list of b\n')]
>>> wikilist.sublists(1)[0].items
[' sub-list of b']

It also has an optional pattern argument that works similar to lists, except that the current list pattern will be automatically added to it as a prefix:

.. code:: python

>>> wikilist = wtp.WikiList('#a\n#b\n##ba\n#*bb\n#:bc\n#c', '\#')
>>> wikilist.sublists()
[WikiList('##ba\n'), WikiList('#*bb\n'), WikiList('#:bc\n')]
>>> wikilist.sublists(pattern='\*')
[WikiList('#*bb\n')]

Convert one type of list to another using the convert method. Specifying the starting pattern of the desired lists can facilitate finding them and improves the performance:

.. code:: python

    >>> wl = wtp.WikiList(
    ...     ':*A1\n:*#B1\n:*#B2\n:*:continuing A1\n:*A2',
    ...     pattern=':\*'
    ... )
    >>> print(wl)
    :*A1
    :*#B1
    :*#B2
    :*:continuing A1
    :*A2
    >>> wl.convert('#')
    >>> print(wl)
    #A1
    ##B1
    ##B2
    #:continuing A1
    #A2

Miscellaneous

parent and ancestors methods can be used to access a node's parent or ancestors respectively:

.. code:: python

>>> template_d = parse("{{a|{{b|{{c|{{d}}}}}}}}").templates[3]
>>> template_d.ancestors()
[Template('{{c|{{d}}}}'),
 Template('{{b|{{c|{{d}}}}}}'),
 Template('{{a|{{b|{{c|{{d}}}}}}}}')]
>>> template_d.parent()
Template('{{c|{{d}}}}')
>>> _.parent()
Template('{{b|{{c|{{d}}}}}}')
>>> _.parent()
Template('{{a|{{b|{{c|{{d}}}}}}}}')
>>> _.parent()  # Returns None

Use the optional type_ argument if looking for ancestors of a specific type:

.. code:: python

>>> parsed = parse('{{a|{{#if:{{b{{c<!---->}}}}}}}}')
>>> comment = parsed.comments[0]
>>> comment.ancestors(type_='ParserFunction')
[ParserFunction('{{#if:{{b{{c<!---->}}}}}}')]

To delete/remove any object from its parents use del object[:] or del object.string.

The remove_markup function or plain_text method can be used to remove wiki markup:

.. code:: python

>>> from wikitextparser import remove_markup, parse
>>> s = "'''a'''<!--comment--> [[b|c]] [[d]]"
>>> remove_markup(s)
'a c d'
>>> parse(s).plain_text()
'a c d'

Compared with mwparserfromhell

mwparserfromhell <https://github.com/earwig/mwparserfromhell>_ is a mature and widely used library with nearly the same purposes as wikitextparser. The main reason leading me to create wikitextparser was that mwparserfromhell could not parse wikitext in certain situations that I needed it for. See mwparserfromhell's issues 40 <https://github.com/earwig/mwparserfromhell/issues/40>, 42 <https://github.com/earwig/mwparserfromhell/issues/42>, 88 <https://github.com/earwig/mwparserfromhell/issues/88>_, and other related issues. In many of those situation wikitextparser may be able to give you more acceptable results.

Also note that wikitextparser is still using 0.x.y version meaning <https://semver.org/>_ that the API is not stable and may change in the future versions.

The tokenizer in mwparserfromhell is written in C. Tokenization in wikitextparser is mostly done using the regex library which is also in C. I have not rigorously compared the two libraries in terms of performance, i.e. execution time and memory usage. In my limited experience, wikitextparser has a decent performance in realistic cases and should be able to compete and may even have little performance benefits in some situations.

If you have had a chance to compare these libraries in terms of performance or capabilities please share your experience by opening an issue on github.

Some of the unique features of wikitextparser are: Providing access to individual cells of each table, pretty-printing templates, a WikiList class with rudimentary methods to work with lists <https://www.mediawiki.org/wiki/Help:Lists>_, and a few other functions.

Known issues and limitations

The contents of templates/parameters are not known to offline parsers. For example an offline parser cannot know if the markup [[{{z|a}}]] should be treated as wikilink or not, it depends on the inner-workings of the {{z}} template. In these situations wikitextparser tries to use a best guess. [[{{z|a}}]] is treated as a wikilink (why else would anyone call a template inside wikilink markup, and even if it is not a wikilink, usually no harm is done).
Localized namespace names are unknown, so for example [[File:...]] links are treated as normal wikilinks. mwparserfromhell has similar issue, see #87 <https://github.com/earwig/mwparserfromhell/issues/87>_ and #136 <https://github.com/earwig/mwparserfromhell/issues/136>. As a workaround, Pywikibot <https://www.mediawiki.org/wiki/Manual:Pywikibot> can be used for determining the namespace.
Linktrails <https://www.mediawiki.org/wiki/Help:Links>_ are language dependant and are not supported. Also not supported by mwparserfromhell <https://github.com/earwig/mwparserfromhell/issues/82>_. However given the trail pattern and knowing that wikilink.span[1] is the ending position of a wikilink, it is possible to compute a WikiLink's linktrail.
Templates adjacent to external links are never considered part of the link. In reality, this depends on the contents of the template. Example: parse('http://example.com{{dead link}}').external_links[0].url == 'http://example.com'
List of valid extension tags <https://www.mediawiki.org/wiki/Parser_extension_tags>_ depends on the extensions intalled on the wiki. The tags method currently only supports the ones on English Wikipedia. A configuration option might be added in the future to address this issue.
wikitextparser currently does not provide an ast.walk <https://docs.python.org/3/library/ast.html#ast.walk>_-like method yielding all descendant nodes.
Parser functions <https://www.mediawiki.org/wiki/Help:Extension:ParserFunctions>_ and magic words <https://www.mediawiki.org/wiki/Help:Magic_words>_ are not evaluated.

Credits

python <https://www.python.org/>_
regex <https://bitbucket.org/mrabarnett/mrab-regex/>_
wcwidth <https://github.com/jquast/wcwidth>_

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 149

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (6) 🔗

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

5j9 / Wikitextparser

Programming Languages

Labels

Projects that are alternatives of or similar to Wikitextparser

============== WikiTextParser

Installation

Usage

Templates

WikiLinks

Sections

Tables

Lists

Tags

Miscellaneous

Compared with mwparserfromhell

Known issues and limitations

Credits