All Projects → high-moctane → Nextword Data

high-moctane / Nextword Data

Licence: other
Dataset for nextword.

Labels

Projects that are alternatives of or similar to Nextword Data

analytics.en
Collaborative technical documentation for Adobe Analytics
Stars: ✭ 12 (-60%)
Mutual labels:  english
Afinn
AFINN sentiment analysis in Python
Stars: ✭ 356 (+1086.67%)
Mutual labels:  english
Rhvoice
a free and open source speech synthesizer for Russian and other languages
Stars: ✭ 750 (+2400%)
Mutual labels:  english
The Economist
The Economist 经济学人,持续更新
Stars: ✭ 2,995 (+9883.33%)
Mutual labels:  english
Ruby Hacking Guide.github.com
Ruby Hacking Guide Translation
Stars: ✭ 305 (+916.67%)
Mutual labels:  english
Silero Models
Silero Models: pre-trained STT models and benchmarks made embarrassingly simple
Stars: ✭ 522 (+1640%)
Mutual labels:  english
similar-english-words
Give me a word and I’ll give you an array of words that differ by a single letter.
Stars: ✭ 25 (-16.67%)
Mutual labels:  english
Meetup Presentations dc
R-Ladies Washington DC Chapter meetup presentations
Stars: ✭ 18 (-40%)
Mutual labels:  english
Dsmr Reader
DSMR-protocol reader, telegram data storage and energy consumption visualizer. Can be used for reading the smart meter DSMR (Dutch Smart Meter Requirements) P1 port yourself at your home. You will need a cable and hardware that can run Linux software. Free for non-commercial use. A Docker implementation can be found here: https://github.com/xirixiz/dsmr-reader-docker
Stars: ✭ 327 (+990%)
Mutual labels:  english
Mouse Dictionary
📘A super fast dictionary for Chrome/Firefox
Stars: ✭ 670 (+2133.33%)
Mutual labels:  english
Bertweet
BERTweet: A pre-trained language model for English Tweets (EMNLP-2020)
Stars: ✭ 282 (+840%)
Mutual labels:  english
Borgert Cms
Borgert is a CMS Open Source created with Laravel Framework 5.6
Stars: ✭ 298 (+893.33%)
Mutual labels:  english
Awesome English
A collection of awesome study resources for learners of English.
Stars: ✭ 560 (+1766.67%)
Mutual labels:  english
kengdic
Joe Speigle's Korean/English dictionary database
Stars: ✭ 76 (+153.33%)
Mutual labels:  english
Awesome Berlin
🇩🇪 A guide aiming to help newcomers to have a successful start in Berlin!
Stars: ✭ 753 (+2410%)
Mutual labels:  english
go-pluralize
Pluralize and singularize any word (golang adaptation of https://www.npmjs.com/package/pluralize)
Stars: ✭ 60 (+100%)
Mutual labels:  english
Most Frequent Technology English Words
程序员工作中常见的英语词汇
Stars: ✭ 4,711 (+15603.33%)
Mutual labels:  english
Natas
Python 3 library for processing historical English
Stars: ✭ 28 (-6.67%)
Mutual labels:  english
Diacritics Map
Map of more than 1,200 diacritics and ligatures to english alphabet equivalents.
Stars: ✭ 17 (-43.33%)
Mutual labels:  english
Chrome Extension Udemy Translate
Translate Udemy's subtitles into Chinese、English etc(Disneyplus+netflix+udemy+lynda+hulu+hbo now+primevideo)
Stars: ✭ 553 (+1743.33%)
Mutual labels:  english

Nextword-data

A dataset for nextword.

Install

  1. (Recommended) Star this repository (`・ω・´)★

  2. Visit releases page.

  3. Download zip or tar.gz.

    You can choose larger or smaller one.

    Zip size Total size
    Small 152.2 MB 493.1 MB
    Large 483.3 MB 1.63 GB
  4. Decompress downloaded data.

  5. Set $NEXTWORD_DATA_PATH environment variable.

    Example:

    export NEXTWORD_DATA_PATH=/path/to/nextword-data
    

Uninstall

  1. Remove $NEXTWORD_DATA_PATH environment variable.

  2. Remove nextword-data directory.

Format

(n-1)gram tab candidates newline

Candidates are sorted by appearance order.

Example

You can find the line

empty milk	bottles carton bottle cartons cans

at line 59349 in file 3gram-e.txt.

This line describes the word "bottles" is the most likely word after "empty milk" and "carton" is the next.

Recipe

  1. Fetch data.

    $ mkdir fetch
    $ nwgen-fetch fetch
    
  2. Run xonsh script.

    dstdir = "dstdir"
    mkdir -p @(dstdir)/format
    mkdir -p @(dstdir)/concat
    
    ls fetch | grep 1gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 10000 @(dstdir)/format/fname fetch/fname
    
    ls fetch | grep 2gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 2000 @(dstdir)/format/fname fetch/fname
    
    ls fetch | grep 3gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 400 @(dstdir)/format/fname fetch/fname
    
    ls fetch | grep 4gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 300 @(dstdir)/format/fname fetch/fname
    
    ls fetch | grep 5gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 200 @(dstdir)/format/fname fetch/fname
    
    nwgen-concat @(dstdir)/concat/1gram.txt.gz @(dstdir)/format/1gram*
    
    for n in [2,3,4,5]:
        for c in [chr(i) for i in range(97, 97+26)]:
            nwgen-concat @(dstdir)/concat/@(n)[email protected](c).txt.gz @(dstdir)/format/@(n)[email protected](c)*
    
    cp -R @(dstdir)/concat @(dstdir)/data
    
    gunzip @(dstdir)/data/*
    

Notice

Nextword-data is based on Google Books Ngram Viewer English Version 20120701 which is distributed under a Creative Commons Attribution 3.0 Unported. See NOTICE.txt.

License

Nextword-data is distributed under a Creative Commons Attribution 4.0 International. See LICENSE.txt.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].