Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

yohasebe / Wp2txt

Licence: mit

WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.

Programming Languages

ruby

36898 projects - #4 most used programming language

Labels

nlp wikipedia corpus

Projects that are alternatives of or similar to Wp2txt

Sejong Corpus

Korean sejong corpus download and simple analysis

Stars: ✭ 116 (-20%)

Mutual labels: corpus

Khcoder

KH Coder: for Quantitative Content Analysis or Text Mining

Stars: ✭ 126 (-13.1%)

Mutual labels: corpus

Git Wiki Theme

A revolutionary full-featured wiki for github pages and jekyll. You don't need to compile it!

Stars: ✭ 139 (-4.14%)

Mutual labels: wikipedia

Mwoffliner

Scrape any online Mediawiki motorised wiki (like Wikipedia) to your local filesystem

Stars: ✭ 121 (-16.55%)

Mutual labels: wikipedia

Dialog corpus

用于训练中英文对话系统的语料库 Datasets for Training Chatbot System

Stars: ✭ 1,662 (+1046.21%)

Mutual labels: corpus

Awesome Chatbot

Awesome Chatbot Projects,Corpus,Papers,Tutorials.Chinese Chatbot =>:

Stars: ✭ 1,785 (+1131.03%)

Mutual labels: corpus

Datasets

Poetry-related datasets developed by THUAIPoet (Jiuge) group.

Stars: ✭ 111 (-23.45%)

Mutual labels: corpus

Ultimate Java Resources

Java programming. All in one Java Resource for learning. Updated every day and up to date. All Algorithms and DS along with Development in Java. Beginner to Advanced. Join the Discord link.

Stars: ✭ 143 (-1.38%)

Mutual labels: wikipedia

Cluedatasetsearch

搜索所有中文NLP数据集，附常用英文NLP数据集

Stars: ✭ 2,112 (+1356.55%)

Mutual labels: corpus

Prosody

Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text

Stars: ✭ 139 (-4.14%)

Mutual labels: corpus

Isbntools

python app/framework for 'all things ISBN' including metadata, descriptions, covers...

Stars: ✭ 122 (-15.86%)

Mutual labels: wikipedia

Awesome Hungarian Nlp

A curated list of NLP resources for Hungarian

Stars: ✭ 121 (-16.55%)

Mutual labels: corpus

Code Docstring Corpus

Preprocessed Python functions and docstrings for automated code documentation (code2doc) and automated code generation (doc2code) tasks.

Stars: ✭ 137 (-5.52%)

Mutual labels: corpus

Mediawiker

Mediawiker is a plugin for Sublime Text editor that adds possibility to use it as Wiki Editor on Mediawiki based sites like Wikipedia and many other.

Stars: ✭ 120 (-17.24%)

Mutual labels: wikipedia

Wikit

Wikipedia summaries from the command line

Stars: ✭ 141 (-2.76%)

Mutual labels: wikipedia

Colibri Core

Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.

Stars: ✭ 112 (-22.76%)

Mutual labels: corpus

Kiwix Js

Full portable & lightweight ZIM reader in Javascript

Stars: ✭ 130 (-10.34%)

Mutual labels: wikipedia

Huggle3 Qt Lx

Huggle is an anti-vandalism tool for use on MediaWiki based projects

Stars: ✭ 143 (-1.38%)

Mutual labels: wikipedia

Clue

中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard

Stars: ✭ 2,425 (+1572.41%)

Mutual labels: corpus

Gossiping Chinese Corpus

PTT 八卦版問答中文語料

Stars: ✭ 137 (-5.52%)

Mutual labels: corpus

View All Similar Projects ➔

WP2TXT

Wikipedia dump file to text converter

IMPORTANT: This is a project still work in progress and it could be slow, unstable, and even destructive! It should be used with caution.

About

WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata. It is originally intended to be useful for researchers who look for an easy way to obtain open-source multi-lingual corpora, but may be handy for other purposes.

UPDATE: Version 0.9.1 has added a new option num-threads, which improves the performance significantly . Note also that --category option is enabled by default, resulting with output format somewhat different from previous versions. Check out the new format using test data in data/output_samples folder before going on to convert a huge wikipedia dump.

Features

Convert dump files of Wikipedia of various languages (I hope).
Create output files of specified size.
Allow users to specify text elements to be extracted/converted (page titles, section titles, lists, and tables).

Installation

$ gem install wp2txt

Usage

Obtain a Wikipedia dump file (from here) with a file name such as:

xxwiki-yyyymmdd-pages-articles.xml.bz2

where xx is language code such as "en (English)" or "ja (Japanese)", and yyyymmdd is the date of creation (e.g. 20120601).

Command line options are as follows:

Usage: wp2txt [options]
where [options] are:
           --input-file, -i:   Wikipedia dump file with .bz2 (compressed) or
                               .txt (uncompressed) format
       --output-dir, -o <s>:   Output directory (default: current directory)
--convert, --no-convert, -c:   Output in plain text (converting from XML)
                               (default: true)
      --list, --no-list, -l:   Show list items in output (default: true)
--heading, --no-heading, -d:   Show section titles in output (default: true)
    --title, --no-title, -t:   Show page titles in output (default: true)
                --table, -a:   Show table source code in output
               --inline, -n:   leave inline template notations unmodified
            --multiline, -m:   leave multiline template notations unmodified
                  --ref, -r:   leave reference notations in the format
                               [ref]...[/ref]
             --redirect, -e:   Show redirect destination
  --marker, --no-marker, -k:   Show symbols prefixed to list items,
                               definitions, etc. (Default: true)
             --category, -g:   Show article category information
        --file-size, -f <i>:   Approximate size (in MB) of each output file
                               (default: 10)
      -u, --num-threads=<i>:   Number of threads to be spawned (capped to the number of CPU cores; 
                               set 99 to spawn max num of threads) (default: 4)
              --version, -v:   Print version and exit
                 --help, -h:   Show this message

Caveats

Certain types of data such as mathematical equations and computer source code are not be properly converted. Please remember this software is originally intended for correcting “sentences” for linguistic studies.
Extraction of normal text data could sometimes fail for various reasons (e.g. illegal matching of begin/end tags, language-specific conventions of formatting, etc).
Conversion process can take far more than you would expect. It could take several hours or more when dealing with a huge data set such as the English Wikipedia on a low-spec environments.
Because of nature of the task, WP2TXT needs much machine power and consumes a lot of memory/storage resources. The process thus could halt unexpectedly. It may even get stuck, in the worst case, without getting gracefully terminated. Please understand this and use the software at your own risk.

Useful Link

Wikipedia Database backup dumps

Author

Yoichiro Hasebe ([email protected])

References

The author will appreciate your mentioning one of these in your research.

Yoichiro HASEBE. 2006. Method for using Wikipedia as Japanese corpus. Doshisha Studies in Language and Culture 9(2), 373-403.
長谷部陽一郎. 2006. Wikipedia日本語版をコーパスとして用いた言語研究の手法. 『言語文化』9(2), 373-403.

License

This software is distributed under the MIT License. Please see the LICENSE file.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 145

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (2) 🔗