All Projects → LuminosoInsight → exquisite-corpus

LuminosoInsight / exquisite-corpus

Licence: MIT License
Put together a multilingual corpus from a variety of sources. Used for wordfreq and word embeddings.

Programming Languages

python
139335 projects - #7 most used programming language
awk
318 projects
shell
77523 projects

This code represents the build process for wordfreq, among other things. I've made it public because it's good to know where the data in wordfreq comes from. However, I make no promises that you'll be able to run it if you don't work at Luminoso.

Dependencies

Exquisite Corpus makes use of various libraries and command-line tools to process data correctly and efficiently. As something that is run on a development machine, it uses the best, fastest libraries it can, though this leads to somewhat complex system requirements.

You will need these programming environments installed:

  • Python 3.4 or later
  • Haskell, installed with haskell-stack, used to compile and run wikiparsec

You also need certain tools to be available:

  • The C library for mecab (apt install libmecab-dev)
  • The ICU Unicode libraries (apt install libicu-dev)
  • The JSON processor jq (apt install jq)
  • The XML processor xml2 (apt install xml2)
  • The HTTP downloader curl (apt install curl)
  • wikiparsec (https://github.com/LuminosoInsight/wikiparsec)

Installation

Some steps here probably need to be filled in better.

  • Install system-level dependencies:
apt install python3-dev haskell-stack libmecab-dev libicu-dev jq xml2 curl
  • Clone, build, and install wikiparsec:
git clone https://github.com/LuminosoInsight/wikiparsec
cd wikiparsec
stack install
  • If building alignment files to get alignments for parallel corpus:

    • Compile fast_align by following the instructions at https://github.com/clab/fast_align
    • Create a symbolic link to executable fast_align inside this directory (executable fast_align is found in the directory where fast_align was compiled)
  • Finally, return to this directory and install exquisite-corpus itself, along with the Python dependencies it manages:

pip install -e .

Getting data

Most of the data in Exquisite Corpus will be downloaded from places where it can be found on the Web. However, one input must be downloaded separately: Twitter data cannot be distributed due to the Twitter API's terms of use.

If you have a collection of tweets, put their text in data/raw/twitter-2015.txt, one tweet per line. Or just put an empty file there.

Building

Make sure you have lots of disk space available in the data directory, which may have to be a symbolic link to an external hard disk.

Run:

snakemake -j 8

...and wait a day or two for results, or a crash that may tell you what you need to fix.

To build parallel corpus, run ./build.sh parallel. If you want alignment files for already built parallel corpus or want to build parallel corpus and alignment together, run ./build.sh alignment.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].