gambolputty / wiktionary-de-parser

Licence: MIT license

Extract data from German Wiktionary XML files. Allows you to add your own extraction methods 🚀

Programming Languages

python

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to wiktionary-de-parser

CISTEM

Stemmer for German

Stars: ✭ 33 (+50%)

Mutual labels: german, german-language

german-tutorial

德语零基础教程

Stars: ✭ 52 (+136.36%)

Mutual labels: german, german-language

german-nouns

A list of ~100,000 German nouns and their grammatical properties compiled from WiktionaryDE as CSV file. Plus a module to look up the data and parse compound words.

Stars: ✭ 101 (+359.09%)

Mutual labels: wiktionary, german-language

l2kurz

German short introduction to LaTeX

Stars: ✭ 19 (-13.64%)

Mutual labels: german

TheoLog

Vorlesungsunterlagen "Theoretische Informatik und Logik", Fakultät Informatik, TU Dresden

Stars: ✭ 20 (-9.09%)

Mutual labels: german

Legal-Entity-Recognition

A Dataset of German Legal Documents for Named Entity Recognition

Stars: ✭ 98 (+345.45%)

Mutual labels: german

covid-ampel-widget

🚦 Ampel Widget, um die aktuellen 🦠Corona-Zahlen (Inzidenz) des RKI für die Landkreise in 🇩🇪 Deutschland auf dem Smartphone anzuzeigen

Stars: ✭ 15 (-31.82%)

Mutual labels: german-language

NumberRush

A number based React game to help you learn German numbers! 🇩🇪

Stars: ✭ 20 (-9.09%)

Mutual labels: german

DAnki

DAnki: Automate deck creation for Anki to learn german

Stars: ✭ 16 (-27.27%)

Mutual labels: german

Twelveish

🕛 Twelveish - Android Wear/Wear OS Watch Face

Stars: ✭ 29 (+31.82%)

Mutual labels: german

docs

blaulichtSMS API (Schnittstellenbeschreibung)

Stars: ✭ 15 (-31.82%)

Mutual labels: german

destatiscleanr

Imports and cleans data from official German statistical offices to jump-start the data analysis

Stars: ✭ 47 (+113.64%)

Mutual labels: german

Deutsch-ohne-Tottasten

A german keyboard layout without dead keys

Stars: ✭ 26 (+18.18%)

Mutual labels: german

urteile-gesetze-web

Web-Frontend des juristischen Informationssystems urteile-gesetze.de

Stars: ✭ 16 (-27.27%)

Mutual labels: german

de.javascript.info

Modern JavaScript Tutorial in German

Stars: ✭ 33 (+50%)

Mutual labels: german

paywallr

🔓 Web extension for reading articles locked behind paywalls of over 50 german newspapers, e.g. Frankfurter Allgemeine Zeitung, Leipziger Volkszeitung & Hamburger Abendblatt

Stars: ✭ 63 (+186.36%)

Mutual labels: german

10kGNAD

Ten Thousand German News Articles Dataset for Topic Classification

Stars: ✭ 63 (+186.36%)

Mutual labels: german

GENADEV OS

An AArch64 hobbyist OS for the Raspberry Pi 3 B+

Stars: ✭ 14 (-36.36%)

Mutual labels: german

sequence tagging

Named Entity Recognition (LSTM + CRF + FastText) with models for [historic] German

Stars: ✭ 25 (+13.64%)

Mutual labels: german

zork-german

German-Language Translation of Zork (Unreleased) (Infocom)

Stars: ✭ 30 (+36.36%)

Mutual labels: german

View All Similar Projects ➔

wiktionary-de-parser

This is a Python module to extract data from German Wiktionary XML files (for Python 3.7+). It allows you to add your own extraction methods.

Installation

pip install wiktionary-de-parser

Features

Extracts flexion tables, genus, IPA, language, lemma, part of speech (basic), syllables, raw Wikitext
Allows you to add your own extraction methods (pass them as argument)
Yields per section, not per page (a word can have multiple meanings --> multiple sections of a Wiktionary pages)

Usage

from bz2 import BZ2File
from wiktionary_de_parser import Parser

bzfile_path = '/tmp/dewiktionary-latest-pages-articles-multistream.xml.bz2'
bz_file = BZ2File(bzfile_path)

for record in Parser(bz_file):
    if 'lang_code' not in record or record['lang_code'] != 'de':
      continue
    # do stuff with 'record'

Note: In this example we load a compressed Wiktionary dump file that was obtained from here.

Adding new extraction methods

An extraction method takes the following arguments:

title (string): The title of the current Wiktionary page
text (string): The Wikitext of the current word entry/section
current_record (Dict): A dictionary with all values of the current iteration (e. g. current_record['lang_code'])

It must return a Dict with the results or False if the record was processed unsuccesfully.

# Create a new extraction method
def my_method(title, text, current_record):
  # do stuff
  return {'my_field': my_data} if my_data else False

# Pass a list with all extraction methods to the class constructor:
for record in Parser(bz_file, custom_methods=[my_method]):
    print(record['my_field'])

Output

Example output for the word "Abend":

{'flexion': {'Akkusativ Plural': 'Abende',
             'Akkusativ Singular': 'Abend',
             'Dativ Plural': 'Abenden',
             'Dativ Singular': 'Abend',
             'Genitiv Plural': 'Abende',
             'Genitiv Singular': 'Abends',
             'Genus': 'm',
             'Nominativ Plural': 'Abende',
             'Nominativ Singular': 'Abend'},
 'inflected': False,
 'ipa': ['ˈaːbn̩t', 'ˈaːbm̩t'],
 'lang': 'Deutsch',
 'lang_code': 'de',
 'lemma': 'Abend',
 'pos': {'Substantiv': []},
 'rhymes': ['aːbn̩t'],
 'syllables': ['Abend'],
 'title': 'Abend'}

Development

This project uses Poetry.

Install Poetry.
Clone this repository
Run poetry install inside of the project folder to install dependencies.
Change wiktionary_de_parser/run.py to your needs.
Run poetry run python wiktionary_de_parser/run.py to run the parser. Or poetry run pytest to run tests.

License

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

gambolputty / wiktionary-de-parser

Programming Languages

Labels

Projects that are alternatives of or similar to wiktionary-de-parser

wiktionary-de-parser

Installation

Features

Usage

Adding new extraction methods

Output

Development

License