All Projects → molybdenum-99 → Infoboxer

molybdenum-99 / Infoboxer

Licence: mit
Wikipedia information extraction library

Programming Languages

ruby
36898 projects - #4 most used programming language

Projects that are alternatives of or similar to Infoboxer

Mwoffliner
Scrape any online Mediawiki motorised wiki (like Wikipedia) to your local filesystem
Stars: ✭ 121 (-17.69%)
Mutual labels:  wikipedia, mediawiki
Huggle3 Qt Lx
Huggle is an anti-vandalism tool for use on MediaWiki based projects
Stars: ✭ 143 (-2.72%)
Mutual labels:  wikipedia, mediawiki
wikibot
Some MediaWiki bot examples including wikipedia, wikidata using MediaWiki module of CeJS library. 採用 CeJS MediaWiki 自動化作業用程式庫來製作 MediaWiki (維基百科/維基數據) 機器人的範例。
Stars: ✭ 26 (-82.31%)
Mutual labels:  mediawiki, wikipedia
Mediawiki
MediaWiki API wrapper in python http://pymediawiki.readthedocs.io/en/latest/
Stars: ✭ 89 (-39.46%)
Mutual labels:  wikipedia, mediawiki
Apps Android Wikipedia
📱The official Wikipedia app for Android!
Stars: ✭ 1,350 (+818.37%)
Mutual labels:  wikipedia, mediawiki
discord-wiki-bot
Wiki-Bot is a bot with the purpose to easily search for and link to wiki pages. Wiki-Bot shows short descriptions and additional info about the pages and is able to resolve redirects and follow interwiki links.
Stars: ✭ 69 (-53.06%)
Mutual labels:  mediawiki, wikipedia
DiscordWikiBot
Discord bot for Wikimedia projects and MediaWiki wiki sites
Stars: ✭ 30 (-79.59%)
Mutual labels:  mediawiki, wikipedia
Wikipedia Mirror
🌐 Guide and tools to run a full offline mirror of Wikipedia.org with three different approaches: Nginx caching proxy, Kimix + ZIM dump, and MediaWiki/XOWA + XML dump
Stars: ✭ 160 (+8.84%)
Mutual labels:  wikipedia, mediawiki
Wikiteam
Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2020, WikiTeam has preserved more than 250,000 wikis.
Stars: ✭ 404 (+174.83%)
Mutual labels:  wikipedia, mediawiki
Wptools
Wikipedia tools (for Humans): easily extract data from Wikipedia, Wikidata, and other MediaWikis
Stars: ✭ 371 (+152.38%)
Mutual labels:  wikipedia, mediawiki
Linq To Wiki
.Net library to access MediaWiki API
Stars: ✭ 93 (-36.73%)
Mutual labels:  wikipedia, mediawiki
Jwiki
📖 A library for effortlessly interacting with Wikipedia/MediaWiki
Stars: ✭ 69 (-53.06%)
Mutual labels:  wikipedia, mediawiki
Mwclient
Python client library to interface with the MediaWiki API
Stars: ✭ 221 (+50.34%)
Mutual labels:  wikipedia, mediawiki
cassandra-GLAM-tools
Support GLAMs in monitoring and evaluating their cooperation with Wikimedia projects
Stars: ✭ 17 (-88.44%)
Mutual labels:  mediawiki, wikipedia
Mediawiki
🌻 The collaborative editing software that runs Wikipedia. Mirror from https://gerrit.wikimedia.org/g/mediawiki/core. See https://mediawiki.org/wiki/Developer_access for contributing.
Stars: ✭ 2,752 (+1772.11%)
Mutual labels:  wikipedia, mediawiki
wikiapi
JavaScript MediaWiki API for node.js
Stars: ✭ 28 (-80.95%)
Mutual labels:  mediawiki, wikipedia
copyvios
A copyright violation detector running on Wikimedia Cloud Services
Stars: ✭ 32 (-78.23%)
Mutual labels:  mediawiki, wikipedia
Mwparserfromhell
A Python parser for MediaWiki wikicode
Stars: ✭ 440 (+199.32%)
Mutual labels:  wikipedia, mediawiki
Mediawiker
Mediawiker is a plugin for Sublime Text editor that adds possibility to use it as Wiki Editor on Mediawiki based sites like Wikipedia and many other.
Stars: ✭ 120 (-18.37%)
Mutual labels:  wikipedia, mediawiki
Isbntools
python app/framework for 'all things ISBN' including metadata, descriptions, covers...
Stars: ✭ 122 (-17.01%)
Mutual labels:  wikipedia

Infoboxer

Gem Version Build Status Coverage Status Code Climate Infoboxer Gitter

Infoboxer is pure-Ruby Wikipedia (and generic MediaWiki) client and parser, targeting information extraction (hence the name).

It can be useful in tasks like:

  • get a plaintext abstract of an article (paragraphs before first heading);
  • get structured data variables from page's infobox;
  • list page's sections and count paragraphs, images and tables in them;
  • convert some huge "comparison table" to data;
  • and much, much more!

The whole idea is: you can have any Wikipedia page as a parsed tree with obvious structure, you can navigate that tree easily, and you have a bunch of hi-level helpers method, so typical information extraction tasks should be super-easy, one-liners in best cases.

(For those already thinking "Why should you do this, we already have DBPedia?" -- please, read "Reasons" page in our wiki.)

Showcase

Infoboxer.wikipedia.
  get('Breaking Bad (season 1)').
  sections('Episodes').templates(name: 'Episode table').
  fetch('episodes').templates(name: /^Episode list/).
  fetch_hashes('EpisodeNumber', 'EpisodeNumber2', 'Title', 'ShortSummary')
# => [{"EpisodeNumber"=>#<Var(EpisodeNumber): 1>, "EpisodeNumber2"=>#<Var(EpisodeNumber2): 1>, "Title"=>#<Var(Title): Pilot>, "ShortSummary"=>#<Var(ShortSummary): Walter White, a 50-year old che...>},
#     {"EpisodeNumber"=>#<Var(EpisodeNumber): 2>, "EpisodeNumber2"=>#<Var(EpisodeNumber2): 2>, "Title"=>#<Var(Title): Cat's in the Bag...>, "ShortSummary"=>#<Var(ShortSummary): Walt and Jesse try to dispose o...>},
#     ...and so on

Do you feel it now?

You also can take a look at Showcase.

Usage

Install gem

Install it as usual: gem 'infoboxer' in your Gemfile, then bundle install.

Or just [sudo] gem install infoboxer if you prefer.

Grab the page

# From English Wikipedia
page = Infoboxer.wikipedia.get('Argentina')
# or
page = Infoboxer.wp.get('Argentina')

# From other language Wikipedia:
page = Infoboxer.wikipedia('fr').get('Argentina')

# From any wiki with the same engine:
page = Infoboxer.wiki('http://companywiki.com').get('Our Product')

See more examples and options at Retrieving pages

Play with page

Basically, page is a tree of Nodes, you can think of it as some kind of DOM.

So, you can navigate it:

# Simple traversing and inspect
node = page.children.first.children.first
node.to_tree
node.to_text

# Various lookups
page.lookup(:Template, name: /^Infobox/)

See Tree navigation basics.

On the top of the basic navigation Infoboxer adds some useful shortcuts for convenience and brevity, which allows things like this:

page.section('Episodes').tables.first

See Navigation shortcuts

To put it all in one piece, also take a look at Data extraction tips and tricks.

infoboxer executable

Just try infoboxer command.

Without any options, it starts IRB session with infoboxer required and included into main namespace.

With -w option, it provides a shortcut to MediaWiki instance you want. Like this:

$ infoboxer -w https://en.wikipedia.org/w/api.php
> get('Argentina')
 => #<Page(title: "Argentina", url: "https://en.wikipedia.org/wiki/Argentina"): ....

You can also use shortcuts like infoboxer -w wikipedia for common wikies (and, just for fun, infoboxer -wikipedia also).

Advanced topics

  • Reasons for Infoboxer creation;
  • Parsing quality (TL;DR: very good, but not ideal);
  • Performance (TL;DR: 0.1-0.4 sec for parsing hugest pages);
  • Localization (TL;DR: For now, you'll need some work to use Infoboxer's most advanced features with non-English or non-WikiMedia wikis; basic and mid-level features work always);
  • If you plan to use Wikipedia or sister projects data in production, please consider Wikipedia terms and conditions.

Compatibility

As of now, Infoboxer reported to be compatible with any MRI Ruby since 2.0.0 (1.9.3 previously, dropped since Infoboxer 0.2.0). In Travis-CI tests, JRuby is failing due to bug in old Java 7/Java 8 SSL certificate support (see here), and Rubinius failing 3 specs of 500 by mystery, which is uninvestigated yet.

Therefore, those Ruby versions are excluded from Travis config, though, they may still work for you.

Links

License

MIT.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].