google-research-datasets / Wit

Licence: other
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Projects that are alternatives of or similar to Wit

Alfred Searchio
Alfred workflow to auto-suggest search results from multiple search engines and languages.
Stars: ✭ 250 (-7.75%)
Mutual labels:  wikipedia, multilingual
OA-signalling
A project to coordinate implementing a system to signal whether references cited on Wikipedia are free to reuse
Stars: ✭ 19 (-92.99%)
Mutual labels:  wikipedia
pywikibot-scripts
Own pywikibot scripts (for Wikimedia projects)
Stars: ✭ 16 (-94.1%)
Mutual labels:  wikipedia
verssion
RSS feeds of stable release versions, as found in Wikipedia.
Stars: ✭ 15 (-94.46%)
Mutual labels:  wikipedia
academic
Jekyll theme with a focus on simplicity, typography and flexibility
Stars: ✭ 71 (-73.8%)
Mutual labels:  multilingual
copyvios
A copyright violation detector running on Wikimedia Cloud Services
Stars: ✭ 32 (-88.19%)
Mutual labels:  wikipedia
CiteUnseen
https://en.wikipedia.org/wiki/User:SuperHamster/CiteUnseen
Stars: ✭ 13 (-95.2%)
Mutual labels:  wikipedia
Qbr
A webcam-based 3x3x3 rubik's cube solver written in Python 3 and OpenCV.
Stars: ✭ 122 (-54.98%)
Mutual labels:  multilingual
blazor-ui-messages
Localization messages for Telerik UI for Blazor components: https://www.telerik.com/blazor-ui
Stars: ✭ 24 (-91.14%)
Mutual labels:  multilingual
WikimediaUI-Style-Guide
Wikimedia Design Style Guide with user interface focus, authored by Wikimedia Foundation Design team.
Stars: ✭ 93 (-65.68%)
Mutual labels:  wikipedia
kwx
BERT, LDA, and TFIDF based keyword extraction in Python
Stars: ✭ 33 (-87.82%)
Mutual labels:  multilingual
react-translator-component
React language translation module for developing a multilingual project.
Stars: ✭ 13 (-95.2%)
Mutual labels:  multilingual
few-shot-lm
The source code of "Language Models are Few-shot Multilingual Learners" (MRL @ EMNLP 2021)
Stars: ✭ 32 (-88.19%)
Mutual labels:  multilingual
wikicrush
Processor scripts for Wikipedia dumps to crush them into a dense binary format that is easy to pathfind with.
Stars: ✭ 46 (-83.03%)
Mutual labels:  wikipedia
Odio
odio is now Strimio!
Stars: ✭ 262 (-3.32%)
Mutual labels:  multilingual
semantic-document-relations
Implementation, trained models and result data for the paper "Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles"
Stars: ✭ 21 (-92.25%)
Mutual labels:  wikipedia
context-cards
Wikipedia page previews for any site
Stars: ✭ 29 (-89.3%)
Mutual labels:  wikipedia
nuclear
Polymorphic and Multilingual CMS powered by Laravel
Stars: ✭ 31 (-88.56%)
Mutual labels:  multilingual
Laravel Translator
An Eloquent translator for Laravel
Stars: ✭ 275 (+1.48%)
Mutual labels:  multilingual
Wikipediakit
Wikipedia API Client Framework for Swift on macOS, iOS, watchOS, and tvOS
Stars: ✭ 270 (-0.37%)
Mutual labels:  wikipedia

WIT : Wikipedia-based Image Text Dataset

Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models.

Key Advantages

A few unique advantages of WIT:

  • The largest multimodal dataset (time of this writing) by the number of image-text examples.
  • A massively multilingual dataset (first of its kind) with coverage for over 100+ languages.
  • A collection of diverse set of concepts and real world entities.
  • Brings forth challenging real-world test sets.

You can learn more about WIT Dataset from our arXiv paper.

WIT Example

Wikipedia Page

For example, let's take the Wikipedia page for Half Dome, Yosemite in CA.

WIT Wikipedia Half Dome Image

Wikipedia Page with Annotations of what we can Extract

From this page, we highlight the various key pieces of data that we can extract - images, their respective text snippets and some contextual metadata.

WIT Half Dome Page with Annotations

By extracting and filering these carefully, we get a clean high quality image-text example that can be used in multimodal modeling.

Motivation

Multimodal visio-linguistic models rely on a rich dataset to help them learn to model the relationship between images and texts. Having large image-text datasets can significantly improve performance, as shown by recent works. Furthermore the lack of language coverage in existing datasets (which are mostly only in English) also impedes research in the multilingual multimodal space – we consider this a lost opportunity given the potential shown in leveraging images (as a language-agnostic medium) to help improve our multilingual textual understanding.

To address these challenges and advance research on multilingual, multimodal learning we created the Wikipedia-based Image Text (WIT) Dataset. WIT is created by extracting multiple different texts associated with an image (e.g., as shown in the above image) from Wikipedia articles and Wikimedia image links. This was accompanied by rigorous filtering to only retain high quality image-text sets.

The resulting dataset contains over 37.6 million image-text sets – making WIT the largest multimodal dataset (at the time of this writing) with unparalleled multilingual coverage – with 12K+ examples in each of 108 languages (53 languages have 100K+ image-text pairs).

WIT: Dataset Numbers

Type Train Val Test Total / Unique
Rows / Tuples 37.13M 261.8K 210.7K 37.6M
Unique Images 11.4M 58K 57K 11.5M
Ref. Text 16.9M 150K 104K 17.2M / 16.7M
Attr. Text 34.8M 193K 200K 35.2M / 10.9M
Alt Text 5.3M 29K 29K 5.4M / 5.3M
Context Texts - - - 119.8M

WIT: Image-Text Stats by Language

Image-Text # Lang Uniq. Images # Lang
total > 1M 9 images > 1M 6
total > 500K 10 images > 500K 12
total > 100K 36 images > 100K 35
total > 50K 15 images > 50K 17
total > 14K 38 images > 13K 38

Get WIT

We believe that such a powerful diverse dataset will aid researchers in building better multimodal multilingual models and in identifying better learning and representation techniques leading to improvement of Machine Learning models in real-world tasks over visio-linguistic data.

Please stay tuned and we will share the details about how to download WIT dataset.

Availability

We are hoping to make the WIT dataset available for download by March 20th, 2021. (tentatively).

License

This data is available under the Creative Commons Attribution-ShareAlike 3.0 Unported license.

Contact

For any questions, please contact [email protected].

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].