All Projects → spencermountain → Wtf_wikipedia

spencermountain / Wtf_wikipedia

Licence: mit
a pretty-committed wikipedia markup parser

Programming Languages

javascript
184084 projects - #8 most used programming language

Labels

Projects that are alternatives of or similar to Wtf wikipedia

semantic-document-relations
Implementation, trained models and result data for the paper "Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles"
Stars: ✭ 21 (-95.58%)
Mutual labels:  wikipedia
OA-signalling
A project to coordinate implementing a system to signal whether references cited on Wikipedia are free to reuse
Stars: ✭ 19 (-96%)
Mutual labels:  wikipedia
Jivesearch
A search engine that doesn't track you.
Stars: ✭ 364 (-23.37%)
Mutual labels:  wikipedia
wikicrush
Processor scripts for Wikipedia dumps to crush them into a dense binary format that is easy to pathfind with.
Stars: ✭ 46 (-90.32%)
Mutual labels:  wikipedia
verssion
RSS feeds of stable release versions, as found in Wikipedia.
Stars: ✭ 15 (-96.84%)
Mutual labels:  wikipedia
Wit
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.
Stars: ✭ 271 (-42.95%)
Mutual labels:  wikipedia
TWLight
Library Card Platform for The Wikipedia Library
Stars: ✭ 55 (-88.42%)
Mutual labels:  wikipedia
Wikiteam
Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2020, WikiTeam has preserved more than 250,000 wikis.
Stars: ✭ 404 (-14.95%)
Mutual labels:  wikipedia
copyvios
A copyright violation detector running on Wikimedia Cloud Services
Stars: ✭ 32 (-93.26%)
Mutual labels:  wikipedia
Adam qas
ADAM - A Question Answering System. Inspired from IBM Watson
Stars: ✭ 330 (-30.53%)
Mutual labels:  wikipedia
ngx proxy wiki
Wikipedia Reverse Proxy
Stars: ✭ 44 (-90.74%)
Mutual labels:  wikipedia
WikimediaUI-Style-Guide
Wikimedia Design Style Guide with user interface focus, authored by Wikimedia Foundation Design team.
Stars: ✭ 93 (-80.42%)
Mutual labels:  wikipedia
Wikipedia Map
A web app for visualizing the connections between Wikipedia pages.
Stars: ✭ 302 (-36.42%)
Mutual labels:  wikipedia
pywikibot-scripts
Own pywikibot scripts (for Wikimedia projects)
Stars: ✭ 16 (-96.63%)
Mutual labels:  wikipedia
Wptools
Wikipedia tools (for Humans): easily extract data from Wikipedia, Wikidata, and other MediaWikis
Stars: ✭ 371 (-21.89%)
Mutual labels:  wikipedia
CiteUnseen
https://en.wikipedia.org/wiki/User:SuperHamster/CiteUnseen
Stars: ✭ 13 (-97.26%)
Mutual labels:  wikipedia
Wikipediakit
Wikipedia API Client Framework for Swift on macOS, iOS, watchOS, and tvOS
Stars: ✭ 270 (-43.16%)
Mutual labels:  wikipedia
Mwparserfromhell
A Python parser for MediaWiki wikicode
Stars: ✭ 440 (-7.37%)
Mutual labels:  wikipedia
Kiwix Android
Kiwix for Android
Stars: ✭ 390 (-17.89%)
Mutual labels:  wikipedia
Fel
Fast Entity Linker Toolkit for training models to link entities to KnowledgeBase (Wikipedia) in documents and queries.
Stars: ✭ 319 (-32.84%)
Mutual labels:  wikipedia
wtf_wikipedia
parse data from wikipedia
npm install wtf_wikipedia
it is very, very hard.         we're not joking.
const wtf = require('wtf_wikipedia')

wtf.fetch('Toronto Raptors').then((doc) => {
  doc.sentences(0).text()
  //'The Toronto Raptors are a Canadian professional basketball team ...'

  let coach = doc.infobox().get('coach')
  coach.text() //'Nick Nurse'
})

.text

get clean plaintext:

let str = `[[Greater_Boston|Boston]]'s [[Fenway_Park|baseball field]] has a {{convert|37|ft}} wall. <ref>Field of our Fathers: By Richard Johnson</ref>`
wtf(str).text()
// "Boston's baseball field has a 37ft wall."
let doc = await wtf.fetch('Glastonbury', 'en')
doc.text()
// 'Glastonbury is a town and civil parish in Somerset, England, situated at a dry point ...'

.json

get all the data from a page:

let doc = await wtf.fetch('Whistling')

doc.json()
// { categories: ['Oral communication', 'Vocal skills'], sections: [{ title: 'Techniques' }], ...}

the default json output is really verbose, but you can cherry-pick things like this:

// get just the links:
doc.links().map((link) => link.json())
//[{ page: 'Theatrical superstitions', text: 'supersitions' }]

// just the images:
doc.images(0).json()
// { file: 'Image:Duveneck Whistling Boy.jpg', url: 'https://commons.wiki...' }

// json for a particular section:
doc.sections('see also').links(0).json()
// { page: 'Slide Whistle' }

run it on the client-side:

<script src="https://unpkg.com/wtf_wikipedia"></script>
<script>
  // follow a redirect:
  wtf.fetch('On a Friday', function (err, doc) {
    let members = doc.infobox().get('current members')
    members.links().map((l) => l.page())
    //['Thom Yorke', 'Jonny Greenwood', 'Colin Greenwood'...]
  })
</script>

or from Deno/typescript/webpack:

import spacetime from 'https://unpkg.com/spacetime/builds/spacetime.mjs'

full wikipedia dumps

With this library, in conjunction with dumpster-dive, you can parse the whole english wikipedia in an aftertoon.

npm install -g dumpster-dive

Tutorials

Plugins

these add all sorts of new functionality:

wtf.extend(require('wtf-plugin-classify'))
wtf.fetch('Toronto Raptors').then((doc) => doc.classify())
// 'Organization/SportsTeam'

wtf.extend(require('wtf-plugin-summary'))
wtf.fetch('Pulp Fiction').then((doc) => doc.summary())
// 'a 1994 American crime film'

wtf.extend(require('wtf-plugin-person'))
wtf.fetch('David Bowie').then((doc) => doc.birthDate())
// {year:1947, date:8, month:1}

wtf.extend(require('wtf-plugin-i18n'))
wtf.fetch('Ziggy Stardust', 'fr').then((doc) => {
  doc.infobox().json()
  //{ nom:{text:"Ziggy Stardust"}, oeuvre:{text:"The Rise and Fall of Ziggy Stardust"} }
})
Plugin
classify person/place/thing
summary short description text
person birth/death information
category parse all articles in a category
i18n improves multilingual template coverage
wtf-mlb fetch baseball data
wtf-nhl fetch hockey data
nsfw flag sexual/graphic/adult articles
image additional methods for .images()
html output html
wikitext output wikitext
markdown output markdown
latex output latex

Ok first, 🛀

Wikitext is no small thing.

Consider:

this library supports many recursive shenanigans, depreciated and obscure template variants, and illicit wiki-shorthands.

What it does:

  • Detects and parses redirects and disambiguation pages
  • Parse infoboxes into a formatted key-value object
  • Handles recursive templates and links- like [[.. [[...]] ]]
  • Per-sentence plaintext and link resolution
  • Parse and format internal links
  • creates image thumbnail urls from File:XYZ.png filenames
  • Properly resolve dynamic templates like {{CURRENTMONTH}} and {{CONVERT ..}}
  • Parse images, headings, and categories
  • converts 'DMS-formatted' (59°12'7.7"N) geo-coordinates to lat/lng
  • parse and combine citation and reference metadata
  • Eliminate xml, latex, css, and table-sorting cruft

What doesn't do:

  • external 'transcluded' page data [1]
  • AST output
  • smart (or 'pretty') formatting of html in infoboxes or galleries [1]
  • maintain perfect page order [1]
  • per-sentence references (by 'section' element instead)
  • maintain template or infobox css styling
  • large tables that span different sections [1]

It is built to be as flexible as possible. In all cases, tries to fail in considerate ways.

How about html scraping..?

Wikimedia's official parser turns wikitext ➔ HTML.

if you prefer this screen-scraping workflow, you can pluck at parts of a page like that.

that's cool!

getting structured data this way is still a complex, weird process. Manually spelunking the html is sometimes just as tricky and error-prone as scanning the wikitext itself.

The contributors to this library have come to that conclusion, as many others have.

This library has (lovingly) borrowed a lot of code and data from the parsoid project, and is gracious to those contributors.

enough chat.

flip your wikitext into a Doc object

import wtf from 'wtf_wikipedia'

let txt = `
==Wood in Popular Culture==
* harry potter's wand
* the simpsons fence
`
wtf(txt)
// Document {text(), json(), lists()...}

doc.links()

let str = `Whistling is featured in a number of television shows, such as [[Lassie (1954 TV series)|''Lassie'']], and the title theme for ''[[The X-Files]]''.`
let doc = wtf(str)
doc.links().map((l) => l.page())
// [ 'Lassie (1954 TV series)',  'The X-Files' ]

doc.text()

returns nice plain-text of the article

var wiki =
  "[[Greater_Boston|Boston]]'s [[Fenway_Park|baseball field]] has a {{convert|37|ft}} wall.<ref>{{cite web|blah}}</ref>"
var text = wtf(wiki).text()
//"Boston's baseball field has a 37ft wall."

doc.sections():

a section is a heading '==Like This=='

wtf(page).sections(1).children() //traverse nested sections
wtf(page).sections('see also').remove() //delete one

doc.sentences()

s = wtf(page).sentences(4)
s.links()
s.bolds()
s.italics()

doc.categories()

let doc = await wtf.fetch('Whistling')
doc.categories()
//['Oral communication', 'Vocal music', 'Vocal skills']

doc.images()

img = wtf(page).images(0)
img.url() // the full-size wikimedia-hosted url
img.thumbnail() // 300px, by default
img.format() // jpg, png, ..

Fetch

This library can grab, and automatically-parse articles from any wikimedia api. This includes any language, any wiki-project, and most 3rd-party wikis.

// 3rd-party wiki
let doc = await wtf.fetch('https://muppet.fandom.com/wiki/Miss_Piggy')

// wikipedia français
doc = await wtf.fetch('Tony Hawk', 'fr')
doc.sentences(0).text() // 'Tony Hawk est un skateboarder professionnel et un acteur ...'

// accept an array, or wikimedia pageIDs
let docs = wtf.fetch(['Whistling', 2983], { follow_redirects: false })

// article from german wikivoyage
wtf.fetch('Toronto', { lang: 'de', wiki: 'wikivoyage' }).then((doc) => {
  console.log(doc.sentences(0).text()) // 'Toronto ist die Hauptstadt der Provinz Ontario'
})

you may also pass the wikipedia page id as parameter instead of the page title:

let doc = await wtf.fetch(64646, 'de')

the fetch method follows redirects.

fetch categories:

wtf.category(title, [lang], [options | callback])

retrieves all pages and sub-categories belonging to a given category:

let result = await wtf.category('Category:Politicians_from_Paris')
//{
//  pages: [{title: 'Paul Bacon', pageid: 1266127 }, ...],
//  categories: [ {title: 'Category:Mayors of Paris' } ]
//}

to fetch and parse all pages in a category, in an optimized way, see wtf-plugin-category

fetch random article:

wtf.random([lang], [options], [callback])

fetches a random wikipedia article, from a given language or domain

wtf.random().then((doc) => {
  console.log(doc.title(), doc.categories())
  //'Whistling'  ['Oral communication', 'Vocal skills']
})

Good practice:

The wikipedia api is pretty welcoming though recommends three things, if you're going to hit it heavily -

  • pass a Api-User-Agent as something so they can use to easily throttle bad scripts
  • bundle multiple pages into one request as an array (say, groups of 5?)
  • run it serially, or at least, slowly.
wtf
  .fetch(['Royal Cinema', 'Aldous Huxley'], 'en', {
    'Api-User-Agent': '[email protected]',
  })
  .then((docList) => {
    let links = docList.map((doc) => doc.links())
    console.log(links)
  })

API

  • .title() - get/set the title of the page from the first-sentence
  • .pageID() - get/set the wikimedia id of the page, if we have it.
  • .wikidata() - get/set the wikidata id of the page, if we have it.
  • .domain() - get/set the domain of the wiki we're on, if we have it.
  • .url() - (try to) generate the url for the current article
  • .lang() - get/set the current language (used for url method)
  • .namespace() - get/set the wikimedia namespace of the page, if we have it
  • .isRedirect() - if the page is just a redirect to another page
  • .redirectTo() - the page this redirects to
  • .isDisambiguation() - is this a placeholder page to direct you to one-of-many possible pages
  • .categories() -
  • .sections() - return a list, or given-index of the Document's sections
  • .paragraphs() - return a list, or given-index of Paragraphs, in all sections
  • .sentences() - return a list, or given-index of all sentences in the document
  • .images() -
  • .links() - return a list, or given-index of all links, in all parts of the document
  • .lists() - sections in a page where each line begins with a bullet point
  • .tables() - return a list, or given-index of all structured tables in the document
  • .templates() - any type of structured-data elements, typically wrapped in like {{this}}
  • .infoboxes() - specific type of template, that appear on the top-right of the page
  • .references() - return a list, or given-index of 'citations' in the document
  • .coordinates() - geo-locations that appear on the page
  • .text() - plaintext, human-readable output for the page
  • .json() - a 'stringifyable' output of the page's main data

Section

  • .title() - the name of the section, between ==these tags==
  • .index() - which number section is this, in the whole document.
  • .indentation() - how many steps deep into the table of contents it is
  • .sentences() - return a list, or given-index, of sentences in this section
  • .paragraphs() - return a list, or given-index, of paragraphs in this section
  • .links() -
  • .tables() -
  • .templates() -
  • .infoboxes() -
  • .coordinates() -
  • .lists() -
  • .interwiki() - any links to other language wikis
  • .images() - return a list, or given index, of any images in this section
  • .references() - return a list, or given index, of 'citations' in this section
  • .remove() - remove the current section from the document
  • .nextSibling() - a section following this one, under the current parent: eg. 1920s → 1930s
  • .lastSibling() - a section before this one, under the current parent: eg. 1930s → 1920s
  • .children() - any sections more specific than this one: eg. History → [PreHistory, 1920s, 1930s]
  • .parent() - the section, broader than this one: eg. 1920s → History
  • .text() -
  • .json() -

Paragraph

  • .sentences() -
  • .references() -
  • .lists() -
  • .images() -
  • .links() -
  • .interwiki() -
  • .text() - generate readable plaintext for this paragraph
  • .json() - generate some generic data for this paragraph in JSON format

Sentence

  • .links() -
  • .bolds() -
  • .italics() -
  • .json() -

Image

  • .url() - return url to full size image
  • .thumbnail() - return url to thumbnail (pass size to customize)
  • .links() - any links from the caption (if present)
  • .format() - get file format (e.g. jpg)
  • .json() - return some generic metadata for this image
  • .text() - does nothing

Infobox

  • .links() -
  • .keyValue() - generate simple key:value strings from this infobox
  • .image() - grab the main image from this infobox
  • .get() - lookup properties from their key
  • .template() - which infobox, eg 'Infobox Person'
  • .text() - generate readable plaintext for this infobox
  • .json() - generate some generic 'stringifyable' data for this infobox

List

  • .lines() - get an array of each member of the list
  • .links() - get all links mentioned in this list
  • .text() - generate readable plaintext for this list
  • .json() - generate some generic easily-parsable data for this list

Reference

  • .title() - generate human-facing text for this reference
  • .links() - get any links mentioned in this reference
  • .text() - returns nothing
  • .json() - generate some generic metadata data for this reference

Table

  • .links() - get any links mentioned in this table
  • .keyValue() - generate a simple list of key:value objects for this table
  • .text() - returns nothing
  • .json() - generate some useful metadata data for this table

Configuration

Adding new methods:

you can add new methods to any class of the library, with wtf.extend()

wtf.extend((models) => {
  // throw this method in there...
  models.Doc.prototype.isPerson = function () {
    return this.categories().find((cat) => cat.match(/people/))
  }
})

await wtf.fetch('Stephen Harper').isPerson() //hmm?

Adding new templates:

does your wiki use a {{foo}} template? Add a custom parser for it:

wtf.extend((models, templates) => {
  // create a custom parser function
  templates.foo = (text, data) => {
    data.templates.push({ name: 'foo', cool: true })
    return 'new-text'
  }

  // array-syntax allows easy-labeling of parameters
  templates.foo = ['a', 'b', 'c']

  // number-syntax for returning by param # '{{name|zero|one|two}}'
  templates.baz = 0

  // replace the template with a string '{{asterisk}}' -> '*'
  templates.asterisk = '*'
})

Notes:

3rd-party wikis

by default, a public API is provided by a installed mediawiki application. This means that most wikis have an open api, even if they don't realize it. Some wikis may turn this feature off.

It can usually be found by visiting http://mywiki.com/api.php

to fetch pages from a 3rd-party wiki:

wtf.fetch('Kermit', { domain: 'muppet.fandom.com' }).then((doc) => {
  console.log(doc.text())
})

some wikis will change the path of their API, from ./api.php to elsewhere. If your api has a different path, you can set it like so:

wtf.fetch('[email protected]_FIL,_Lisbon', { domain: 'www.mixesdb.com', path: 'db/api.php' }).then((doc) => {
  console.log(doc.templates('player'))
})

for image-urls to work properly, the wiki should also have Special:Redirect enabled. Some wikis, (like wikia) have intentionally disabled this.

i18n and multi-language:

wikitext is (amazingly) used across all languages, wikis, and even in right-to-left languages. This parser actually does an okay job at it too.

Wikipedia I18n langauge information for Redirects, Infoboxes, Categories, and Images are included in the library, with pretty-decent coverage.

To improve coverage of i18n templates, use wtf-plugin-i18n

Please make a PR if you see something missing for your language.

Builds:

this library ships seperate client-side and server-side builds, to preserve filesize.

the browser version uses fetch() and the server version uses require('https').

Performance:

It is not the fastest parser, and is very unlikely to beat a single-pass parser in C or Java.

Using dumpster-dive, this library can parse a full english wikipedia in around 4 hours on a macbook.

That's about 100 pages/second, per thread.

See also:

alternative javascript parsers:

and many more!

MIT

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].