All Projects → gambolputty → wiktionary-de-parser

gambolputty / wiktionary-de-parser

Licence: MIT license
Extract data from German Wiktionary XML files. Allows you to add your own extraction methods 🚀

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to wiktionary-de-parser

CISTEM
Stemmer for German
Stars: ✭ 33 (+50%)
Mutual labels:  german, german-language
german-tutorial
德语零基础教程
Stars: ✭ 52 (+136.36%)
Mutual labels:  german, german-language
german-nouns
A list of ~100,000 German nouns and their grammatical properties compiled from WiktionaryDE as CSV file. Plus a module to look up the data and parse compound words.
Stars: ✭ 101 (+359.09%)
Mutual labels:  wiktionary, german-language
l2kurz
German short introduction to LaTeX
Stars: ✭ 19 (-13.64%)
Mutual labels:  german
TheoLog
Vorlesungsunterlagen "Theoretische Informatik und Logik", Fakultät Informatik, TU Dresden
Stars: ✭ 20 (-9.09%)
Mutual labels:  german
Legal-Entity-Recognition
A Dataset of German Legal Documents for Named Entity Recognition
Stars: ✭ 98 (+345.45%)
Mutual labels:  german
covid-ampel-widget
🚦 Ampel Widget, um die aktuellen 🦠Corona-Zahlen (Inzidenz) des RKI für die Landkreise in 🇩🇪 Deutschland auf dem Smartphone anzuzeigen
Stars: ✭ 15 (-31.82%)
Mutual labels:  german-language
NumberRush
A number based React game to help you learn German numbers! 🇩🇪
Stars: ✭ 20 (-9.09%)
Mutual labels:  german
DAnki
DAnki: Automate deck creation for Anki to learn german
Stars: ✭ 16 (-27.27%)
Mutual labels:  german
Twelveish
🕛 Twelveish - Android Wear/Wear OS Watch Face
Stars: ✭ 29 (+31.82%)
Mutual labels:  german
docs
blaulichtSMS API (Schnittstellenbeschreibung)
Stars: ✭ 15 (-31.82%)
Mutual labels:  german
destatiscleanr
Imports and cleans data from official German statistical offices to jump-start the data analysis
Stars: ✭ 47 (+113.64%)
Mutual labels:  german
Deutsch-ohne-Tottasten
A german keyboard layout without dead keys
Stars: ✭ 26 (+18.18%)
Mutual labels:  german
urteile-gesetze-web
Web-Frontend des juristischen Informationssystems urteile-gesetze.de
Stars: ✭ 16 (-27.27%)
Mutual labels:  german
de.javascript.info
Modern JavaScript Tutorial in German
Stars: ✭ 33 (+50%)
Mutual labels:  german
paywallr
🔓 Web extension for reading articles locked behind paywalls of over 50 german newspapers, e.g. Frankfurter Allgemeine Zeitung, Leipziger Volkszeitung & Hamburger Abendblatt
Stars: ✭ 63 (+186.36%)
Mutual labels:  german
10kGNAD
Ten Thousand German News Articles Dataset for Topic Classification
Stars: ✭ 63 (+186.36%)
Mutual labels:  german
GENADEV OS
An AArch64 hobbyist OS for the Raspberry Pi 3 B+
Stars: ✭ 14 (-36.36%)
Mutual labels:  german
sequence tagging
Named Entity Recognition (LSTM + CRF + FastText) with models for [historic] German
Stars: ✭ 25 (+13.64%)
Mutual labels:  german
zork-german
German-Language Translation of Zork (Unreleased) (Infocom)
Stars: ✭ 30 (+36.36%)
Mutual labels:  german

wiktionary-de-parser

This is a Python module to extract data from German Wiktionary XML files (for Python 3.7+). It allows you to add your own extraction methods.

Installation

pip install wiktionary-de-parser

Features

  • Extracts flexion tables, genus, IPA, language, lemma, part of speech (basic), syllables, raw Wikitext
  • Allows you to add your own extraction methods (pass them as argument)
  • Yields per section, not per page (a word can have multiple meanings --> multiple sections of a Wiktionary pages)

Usage

from bz2 import BZ2File
from wiktionary_de_parser import Parser

bzfile_path = '/tmp/dewiktionary-latest-pages-articles-multistream.xml.bz2'
bz_file = BZ2File(bzfile_path)

for record in Parser(bz_file):
    if 'lang_code' not in record or record['lang_code'] != 'de':
      continue
    # do stuff with 'record'

Note: In this example we load a compressed Wiktionary dump file that was obtained from here.

Adding new extraction methods

An extraction method takes the following arguments:

  • title (string): The title of the current Wiktionary page
  • text (string): The Wikitext of the current word entry/section
  • current_record (Dict): A dictionary with all values of the current iteration (e. g. current_record['lang_code'])

It must return a Dict with the results or False if the record was processed unsuccesfully.

# Create a new extraction method
def my_method(title, text, current_record):
  # do stuff
  return {'my_field': my_data} if my_data else False

# Pass a list with all extraction methods to the class constructor:
for record in Parser(bz_file, custom_methods=[my_method]):
    print(record['my_field'])

Output

Example output for the word "Abend":

{'flexion': {'Akkusativ Plural': 'Abende',
             'Akkusativ Singular': 'Abend',
             'Dativ Plural': 'Abenden',
             'Dativ Singular': 'Abend',
             'Genitiv Plural': 'Abende',
             'Genitiv Singular': 'Abends',
             'Genus': 'm',
             'Nominativ Plural': 'Abende',
             'Nominativ Singular': 'Abend'},
 'inflected': False,
 'ipa': ['ˈaːbn̩t', 'ˈaːbm̩t'],
 'lang': 'Deutsch',
 'lang_code': 'de',
 'lemma': 'Abend',
 'pos': {'Substantiv': []},
 'rhymes': ['aːbn̩t'],
 'syllables': ['Abend'],
 'title': 'Abend'}

Development

This project uses Poetry.

  1. Install Poetry.
  2. Clone this repository
  3. Run poetry install inside of the project folder to install dependencies.
  4. Change wiktionary_de_parser/run.py to your needs.
  5. Run poetry run python wiktionary_de_parser/run.py to run the parser. Or poetry run pytest to run tests.

License

MIT © Gregor Weichbrodt

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].