high-moctane / Nextword Data
Licence: other
Dataset for nextword.
Projects that are alternatives of or similar to Nextword Data
analytics.enCollaborative technical documentation for Adobe Analytics
Stars: ✭ 12 (-60%)
Mutual labels: english
AfinnAFINN sentiment analysis in Python
Stars: ✭ 356 (+1086.67%)
Mutual labels: english
Rhvoicea free and open source speech synthesizer for Russian and other languages
Stars: ✭ 750 (+2400%)
Mutual labels: english
The EconomistThe Economist 经济学人,持续更新
Stars: ✭ 2,995 (+9883.33%)
Mutual labels: english
Silero ModelsSilero Models: pre-trained STT models and benchmarks made embarrassingly simple
Stars: ✭ 522 (+1640%)
Mutual labels: english
similar-english-wordsGive me a word and I’ll give you an array of words that differ by a single letter.
Stars: ✭ 25 (-16.67%)
Mutual labels: english
Meetup Presentations dcR-Ladies Washington DC Chapter meetup presentations
Stars: ✭ 18 (-40%)
Mutual labels: english
Dsmr ReaderDSMR-protocol reader, telegram data storage and energy consumption visualizer. Can be used for reading the smart meter DSMR (Dutch Smart Meter Requirements) P1 port yourself at your home. You will need a cable and hardware that can run Linux software. Free for non-commercial use. A Docker implementation can be found here: https://github.com/xirixiz/dsmr-reader-docker
Stars: ✭ 327 (+990%)
Mutual labels: english
Mouse Dictionary📘A super fast dictionary for Chrome/Firefox
Stars: ✭ 670 (+2133.33%)
Mutual labels: english
BertweetBERTweet: A pre-trained language model for English Tweets (EMNLP-2020)
Stars: ✭ 282 (+840%)
Mutual labels: english
Borgert CmsBorgert is a CMS Open Source created with Laravel Framework 5.6
Stars: ✭ 298 (+893.33%)
Mutual labels: english
Awesome EnglishA collection of awesome study resources for learners of English.
Stars: ✭ 560 (+1766.67%)
Mutual labels: english
kengdicJoe Speigle's Korean/English dictionary database
Stars: ✭ 76 (+153.33%)
Mutual labels: english
Awesome Berlin🇩🇪 A guide aiming to help newcomers to have a successful start in Berlin!
Stars: ✭ 753 (+2410%)
Mutual labels: english
go-pluralizePluralize and singularize any word (golang adaptation of https://www.npmjs.com/package/pluralize)
Stars: ✭ 60 (+100%)
Mutual labels: english
NatasPython 3 library for processing historical English
Stars: ✭ 28 (-6.67%)
Mutual labels: english
Diacritics MapMap of more than 1,200 diacritics and ligatures to english alphabet equivalents.
Stars: ✭ 17 (-43.33%)
Mutual labels: english
Chrome Extension Udemy TranslateTranslate Udemy's subtitles into Chinese、English etc(Disneyplus+netflix+udemy+lynda+hulu+hbo now+primevideo)
Stars: ✭ 553 (+1743.33%)
Mutual labels: english
Nextword-data
A dataset for nextword.
Install
-
(Recommended) Star this repository (`・ω・´)★
-
Visit releases page.
-
Download zip
or tar.gz
.
You can choose larger or smaller one.
|
Zip size |
Total size |
Small |
152.2 MB |
493.1 MB |
Large |
483.3 MB |
1.63 GB |
-
Decompress downloaded data.
-
Set $NEXTWORD_DATA_PATH
environment variable.
Example:
export NEXTWORD_DATA_PATH=/path/to/nextword-data
Uninstall
-
Remove $NEXTWORD_DATA_PATH
environment variable.
-
Remove nextword-data directory.
Format
(n-1)gram tab candidates newline
Candidates are sorted by appearance order.
Example
You can find the line
empty milk bottles carton bottle cartons cans
at line 59349 in file 3gram-e.txt
.
This line describes the word "bottles" is the most likely word after "empty milk"
and "carton" is the next.
Recipe
-
Fetch data.
$ mkdir fetch
$ nwgen-fetch fetch
-
Run xonsh script.
dstdir = "dstdir"
mkdir -p @(dstdir)/format
mkdir -p @(dstdir)/concat
ls fetch | grep 1gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 10000 @(dstdir)/format/fname fetch/fname
ls fetch | grep 2gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 2000 @(dstdir)/format/fname fetch/fname
ls fetch | grep 3gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 400 @(dstdir)/format/fname fetch/fname
ls fetch | grep 4gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 300 @(dstdir)/format/fname fetch/fname
ls fetch | grep 5gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 200 @(dstdir)/format/fname fetch/fname
nwgen-concat @(dstdir)/concat/1gram.txt.gz @(dstdir)/format/1gram*
for n in [2,3,4,5]:
for c in [chr(i) for i in range(97, 97+26)]:
nwgen-concat @(dstdir)/concat/@(n)[email protected](c).txt.gz @(dstdir)/format/@(n)[email protected](c)*
cp -R @(dstdir)/concat @(dstdir)/data
gunzip @(dstdir)/data/*
Notice
Nextword-data is based on
Google Books Ngram Viewer English Version 20120701
which is distributed under a Creative Commons Attribution 3.0 Unported.
See NOTICE.txt.
License
Nextword-data is distributed under a Creative Commons Attribution 4.0 International.
See LICENSE.txt.
Note that the project description data, including the texts, logos, images, and/or trademarks,
for each open source project belongs to its rightful owner.
If you wish to add or remove any projects, please contact us at
[email protected].