All Projects → marcusklang → Wikiforia

marcusklang / Wikiforia

Licence: gpl-2.0
A Utility Library for Wikipedia dumps

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to Wikiforia

cti-stix-elevator
OASIS Cyber Threat Intelligence (CTI) TC Open Repository: Convert STIX 1.2 XML to STIX 2.x JSON
Stars: ✭ 42 (+35.48%)
Mutual labels:  converter, xml
Gelatin
Transform text files to XML, JSON, or YAML
Stars: ✭ 150 (+383.87%)
Mutual labels:  xml, converter
I7j Pdfhtml
pdfHTML is an iText 7 add-on for Java that allows you to easily convert HTML and CSS into standards compliant PDFs that are accessible, searchable and usable for indexing.
Stars: ✭ 104 (+235.48%)
Mutual labels:  xml, converter
Node Js2xmlparser
Popular Node.js module for parsing JavaScript objects into XML
Stars: ✭ 171 (+451.61%)
Mutual labels:  xml, converter
json2xml
json to xml converter in python3
Stars: ✭ 76 (+145.16%)
Mutual labels:  converter, xml
wikitable2csv
A web tool to convert Wiki tables to CSV 📈
Stars: ✭ 112 (+261.29%)
Mutual labels:  converter, wikipedia
Goxml2json
XML to JSON converter written in Go (no schema, no structs)
Stars: ✭ 170 (+448.39%)
Mutual labels:  xml, converter
FigmaConvertXib
FigmaConvertXib is a tool for exporting design elements from figma.com and generating files to a projects iOS .xib / Android .xml
Stars: ✭ 111 (+258.06%)
Mutual labels:  converter, xml
Wikiteam
Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2020, WikiTeam has preserved more than 250,000 wikis.
Stars: ✭ 404 (+1203.23%)
Mutual labels:  wikipedia, xml
Asciidoctor Pdf
📃 Asciidoctor PDF: A native PDF converter for AsciiDoc based on Asciidoctor and Prawn, written entirely in Ruby.
Stars: ✭ 868 (+2700%)
Mutual labels:  converter
Raun
Tool to watch the recent changes of Wikimedia Foundation projects, live.
Stars: ✭ 15 (-51.61%)
Mutual labels:  wikipedia
Xlconverter
Convert Excel File to Objects
Stars: ✭ 11 (-64.52%)
Mutual labels:  converter
Wiki Degrees
Calculator for finding the degrees of separation and the shortest path between two Wikipedia articles.
Stars: ✭ 12 (-61.29%)
Mutual labels:  wikipedia
Apps Android Wikiedudashboard
Access WikiEdu Dashboard from Android App.
Stars: ✭ 20 (-35.48%)
Mutual labels:  wikipedia
Locator Tool
Tool to add {{Location}} or {{Object location}} to images on Wikimedia Commons
Stars: ✭ 11 (-64.52%)
Mutual labels:  wikipedia
Tdeskdroid
Telegram Desktop to Android theme converter
Stars: ✭ 28 (-9.68%)
Mutual labels:  converter
Arf Converter
Bulk ARF file converter
Stars: ✭ 10 (-67.74%)
Mutual labels:  converter
Gulp Xslt
XSLT transformation plugin for gulp
Stars: ✭ 9 (-70.97%)
Mutual labels:  xml
Evreflection
Reflection based (Dictionary, CKRecord, NSManagedObject, Realm, JSON and XML) object mapping with extensions for Alamofire and Moya with RxSwift or ReactiveSwift
Stars: ✭ 954 (+2977.42%)
Mutual labels:  xml
Yaidom
Yet another immutable XML DOM-like API
Stars: ✭ 27 (-12.9%)
Mutual labels:  xml

Wikiforia

What is it?

It is a library and a tool for parsing Wikipedia XML dumps and converting them into plain text for other tools to use.

Why use it?

Subjectivly generates good results and is reasonably fast, on my laptop (4 physical cores, 8 logical threads, 2.3 Ghz Haswell Core i7) it achieves an average of 6000 pages/sec or 10 minutes for a 2014-08-18 Swedish Wikipedia dump. Your results may of course vary.

How to use?

Download a multistreamed wikipedia bzip2 dump. It consists of two files: one index and one with the pages.

For a Swedish Wikipedia dump 2014-08-18 it has the following file names:

svwiki-20140818-pages-articles-multistream-index.txt.bz2
svwiki-20140818-pages-articles-multistream.xml.bz2

Make sure their names are intact because otherwise the automatic language resolving does not work. The default language is English and it does affect the parsing quality.

Both compressed files must be placed in the directory for the command below to work properly.

To run it all: go to the dist/ directory in your terminal and run

java -jar wikiforia-1.0-SNAPSHOT.jar 
     -pages [path to the file ending with multistream.xml.bz2] 
     -output [output xml path]

This will run with default settings i.e. the number of cores you have and a batch size of 100. These settings can be overriden, for a full listing just run:

java -jar wikiforia-1.0-SNAPSHOT.jar

Output

The output from the tool is an XML with the following structure (example data)

<?xml version="1.0" encoding="utf-8"?>
<pages>

<page id="4" title="Alfred" revision="1386155063000" type="text/x-wiki" ns-id="0" ns-name="">Alfred, 
with a new line</page>

<page id="10" title="Template:Infobox" revision="1386155040000" type="text/x-wiki" ns-id="10" ns-name="Template">Template stuff</page>
</pages>

Attribute information

id
Wikipedia Page id
title
The title of the Wikipedia page
revision
The revision as given by the dump, but in milliseconds since UNIX epoch time
type
the format, will always be text/x-wiki in this version of the tool
ns-id
The namespace id, 0 is the principal namespace which contains all articles, take a look at Namespaces at [Wikipedia for more information](http://en.wikipedia.org/wiki/Wikipedia:Namespace)
ns-name
Localized name for the namspace, for 0 it is usually just an empty string

Plaintext export

Contributed by @smhumayun, support for Plain Text output format on top of existing XML format. Use Case: extract text only from the Wikipedia e.g in order to use it as a Corpus for different Machine Learning experiments.

To run it: Download wikiforia-x.y.z.jar from dist/ directory, open your terminal, go/cd to download location and run

java -jar wikiforia-x.y.z.jar 
     -pages [path to the file ending with multistream.xml.bz2] 
     -output [output xml path]
     -outputformat plain-text

Remarks

Empty articles, for which no text could be found is not included. This includes redirects and most of the templates and categories, because they have no useful text. If you use the API you can extract this bit of information.

Language support

270 language specific configurations have been generated from the Wikimedia source tree that is publicly available. The quality of these autogenerations are uncertain as they are not tested. Kindly confirm or report if your language does not work so that I could possibly mitigate the issue.

The English language is used as fallback when parsing.

API

The code can also be used directly to extract more information.

More information about this will be added, but for now take a look at se.lth.cs.nlp.wikiforia.App and the convert method to get an idea of how to use the code.

Credits

Peter Exner, the author of KOSHIK. The Sweble code is partially based on the KOSHIK version.

Sweble, developed by the Open Source Research Group at the Friedrich-Alexander-University of Erlangen-Nuremberg. This library is used to parse the Wikimarkup.

Woodstox, Quick XML parser, used to parse the XML and write XML output.

Apache Commons, a collection of useful and excellent libraries. Used CLI for the options.

Wikipedia, without it, this project would be useless. Testdata has been extracted from Swedish Wikipedia and is covered by CC BY-SA 3.0 licence.

Licence

The licence is GPLv2.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].