All Projects → soskek → Bookcorpus

soskek / Bookcorpus

Licence: mit
Crawl BookCorpus

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Bookcorpus

bots-zoo
No description or website provided.
Stars: ✭ 59 (-86.68%)
Mutual labels:  crawler, scraper
Weibo terminator workflow
Update Version of weibo_terminator, This is Workflow Version aim at Get Job Done!
Stars: ✭ 259 (-41.53%)
Mutual labels:  crawler, scraper
weibo-scraper
Simple Weibo Scraper
Stars: ✭ 50 (-88.71%)
Mutual labels:  crawler, scraper
papercut
Papercut is a scraping/crawling library for Node.js built on top of JSDOM. It provides basic selector features together with features like Page Caching and Geosearch.
Stars: ✭ 15 (-96.61%)
Mutual labels:  crawler, scraper
Xcrawler
快速、简洁且强大的PHP爬虫框架
Stars: ✭ 344 (-22.35%)
Mutual labels:  crawler, scraper
arachnod
High performance crawler for Nodejs
Stars: ✭ 17 (-96.16%)
Mutual labels:  crawler, scraper
lightnovel epub
🍭 epub generator for (light)novels (轻) 小说 epub 生成器,支持站点:轻之国度、轻小说文库
Stars: ✭ 89 (-79.91%)
Mutual labels:  crawler, scraper
Ruiji.net
crawler framework, distributed crawler extractor
Stars: ✭ 220 (-50.34%)
Mutual labels:  crawler, scraper
Autoscraper
A Smart, Automatic, Fast and Lightweight Web Scraper for Python
Stars: ✭ 4,077 (+820.32%)
Mutual labels:  crawler, scraper
Hquery.php
An extremely fast web scraper that parses megabytes of invalid HTML in a blink of an eye. PHP5.3+, no dependencies.
Stars: ✭ 295 (-33.41%)
Mutual labels:  crawler, scraper
Polite
Be nice on the web
Stars: ✭ 253 (-42.89%)
Mutual labels:  crawler, scraper
Gosint
OSINT Swiss Army Knife
Stars: ✭ 401 (-9.48%)
Mutual labels:  crawler, scraper
Skrape.it
A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.
Stars: ✭ 231 (-47.86%)
Mutual labels:  crawler, scraper
dijnet-bot
Az összes számlád még egy helyen :)
Stars: ✭ 17 (-96.16%)
Mutual labels:  crawler, scraper
Annie
👾 Fast and simple video download library and CLI tool written in Go
Stars: ✭ 16,369 (+3595.03%)
Mutual labels:  crawler, scraper
MyCrawler
我的爬虫合集
Stars: ✭ 55 (-87.58%)
Mutual labels:  crawler, scraper
Media Scraper
Scrapes all photos and videos in a web page / Instagram / Twitter / Tumblr / Reddit / pixiv / TikTok
Stars: ✭ 206 (-53.5%)
Mutual labels:  crawler, scraper
Goose Parser
Universal scrapping tool, which allows you to extract data using multiple environments
Stars: ✭ 211 (-52.37%)
Mutual labels:  crawler, scraper
Rcrawler
An R web crawler and scraper
Stars: ✭ 274 (-38.15%)
Mutual labels:  crawler, scraper
Freshonions Torscraper
Fresh Onions is an open source TOR spider / hidden service onion crawler hosted at zlal32teyptf4tvi.onion
Stars: ✭ 348 (-21.44%)
Mutual labels:  crawler, scraper

Homemade BookCorpus

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

Clawling could be difficult due to some issues of the website. Also, please consider another option such as using publicly available files at your own risk.

For example,

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@


These are scripts to reproduce BookCorpus by yourself.

BookCorpus is a popular large-scale text corpus, espetially for unsupervised learning of sentence encoders/decoders. However, BookCorpus is no longer distributed...

This repository includes a crawler collecting data from smashwords.com, which is the original source of BookCorpus. Collected sentences may partially differ but the number of them will be larger or almost the same. If you use the new corpus in your work, please specify that it is a replica.

How to use

Prepare URLs of available books. However, this repository already has a list as url_list.jsonl which was a snapshot I (@soskek) collected on Jan 19-20, 2019. You can use it if you'd like.

python -u download_list.py > url_list.jsonl &

Download their files. Downloading is performed for txt files if possible. Otherwise, this tries to extract text from epub. The additional argument --trash-bad-count filters out epub files whose word count is largely different from its official stat (because it may imply some failure).

python download_files.py --list url_list.jsonl --out out_txts --trash-bad-count

The results are saved into the directory of --out (here, out_txts).

Postprocessing

Make concatenated text with sentence-per-line format.

python make_sentlines.py out_txts > all.txt

If you want to tokenize them into segmented words by Microsoft's BlingFire, run the below. You can use another choices for this by yourself.

python make_sentlines.py out_txts | python tokenize_sentlines.py > all.tokenized.txt

Disclaimer

For example, you can refer to terms of smashwords.com. Please use the code responsibly and adhere to respective copyright and related laws. I am not responsible for any plagiarism or legal implication that rises as a result of this repository.

Requirement

  • python3 is recommended
  • beautifulsoup4
  • progressbar2
  • blingfire
  • html2text
  • lxml
pip install -r requirements.txt

Note on Errors

  • It is expected some error messages are shown, e.g., Failed: epub and txt, File is not a zip file or Failed to open. But, the number of failures will be much less than one of successes. Don't worry.

Acknowledgement

epub2txt.py is derived and modified from https://github.com/kevinxiong/epub2txt/blob/master/epub2txt.py

Citation

If you found this code useful, please cite it with the URL.

@misc{soskkobayashi2018bookcorpus,
    author = {Sosuke Kobayashi},
    title = {Homemade BookCorpus},
    howpublished = {\url{https://github.com/BIGBALLON/cifar-10-cnn}},
    year = {2018}
}

Also, the original papers which made the original BookCorpus are as follows:

Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, Sanja Fidler. "Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books." arXiv preprint arXiv:1506.06724, ICCV 2015.

@InProceedings{Zhu_2015_ICCV,
    title = {Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books},
    author = {Zhu, Yukun and Kiros, Ryan and Zemel, Rich and Salakhutdinov, Ruslan and Urtasun, Raquel and Torralba, Antonio and Fidler, Sanja},
    booktitle = {The IEEE International Conference on Computer Vision (ICCV)},
    month = {December},
    year = {2015}
}
@inproceedings{moviebook,
    title = {Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books},
    author = {Yukun Zhu and Ryan Kiros and Richard Zemel and Ruslan Salakhutdinov and Raquel Urtasun and Antonio Torralba and Sanja Fidler},
    booktitle = {arXiv preprint arXiv:1506.06724},
    year = {2015}
}

Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. "Skip-Thought Vectors." arXiv preprint arXiv:1506.06726, NIPS 2015.

@article{kiros2015skip,
    title={Skip-Thought Vectors},
    author={Kiros, Ryan and Zhu, Yukun and Salakhutdinov, Ruslan and Zemel, Richard S and Torralba, Antonio and Urtasun, Raquel and Fidler, Sanja},
    journal={arXiv preprint arXiv:1506.06726},
    year={2015}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].