All Projects → catseye → Guten-gutter

catseye / Guten-gutter

Licence: Unlicense license
Strips boilerplate from Project Gutenberg text files

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to Guten-gutter

neji
Flexible and powerful platform for biomedical information extraction from text
Stars: ✭ 37 (+131.25%)
Mutual labels:  text-mining
TRUNAJOD2.0
An easy-to-use library to extract indices from texts.
Stars: ✭ 18 (+12.5%)
Mutual labels:  text-mining
text-mining-corona-articles
Text Mining for Indonesian Online News Articles About Corona
Stars: ✭ 15 (-6.25%)
Mutual labels:  text-mining
rulr
📐 Validation and unit conversion errors in TypeScript at compile-time. Started in 2016.
Stars: ✭ 43 (+168.75%)
Mutual labels:  sanitization
converse
Conversational text Analysis using various NLP techniques
Stars: ✭ 147 (+818.75%)
Mutual labels:  text-mining
Adjutant
Runs a pubmed query, returns results and allows user to explore high-level structure of returned documents
Stars: ✭ 59 (+268.75%)
Mutual labels:  text-mining
tf-idf-python
Term frequency–inverse document frequency for Chinese novel/documents implemented in python.
Stars: ✭ 98 (+512.5%)
Mutual labels:  text-mining
TextDatasetCleaner
🔬 Очистка датасетов от мусора (нормализация, препроцессинг)
Stars: ✭ 27 (+68.75%)
Mutual labels:  text-mining
thrones2vec
Using Word2Vec to explore semantic similarities between the entities of "A Song of Ice and Fire" ("Game of Thrones").
Stars: ✭ 27 (+68.75%)
Mutual labels:  text-mining
civicmine
Text mining cancer biomarkers for the CIVIC database
Stars: ✭ 19 (+18.75%)
Mutual labels:  text-mining
reader
Distant Reader, a tool for using & understanding a corpus
Stars: ✭ 18 (+12.5%)
Mutual labels:  text-mining
misinfo
📊 Tools to Perform ‘Misinformation’ Analysis on a Text Corpus (wrapper for methods in https://github.com/PDXBek/Misinformation)
Stars: ✭ 17 (+6.25%)
Mutual labels:  text-mining
learning2hash.github.io
Website for "A survey of learning to hash for Computer Vision" https://learning2hash.github.io
Stars: ✭ 14 (-12.5%)
Mutual labels:  text-mining
TabInOut
Framework for information extraction from tables
Stars: ✭ 37 (+131.25%)
Mutual labels:  text-mining
Introduction-to-text-mining-with-Python
Lectures in Urban Data Science Lab, Seoul
Stars: ✭ 25 (+56.25%)
Mutual labels:  text-mining
rosette-elasticsearch-plugin
Document Enrichment plugin for Elasticsearch
Stars: ✭ 25 (+56.25%)
Mutual labels:  text-mining
R.TeMiS
R.TeMiS: R Text Mining Solution
Stars: ✭ 21 (+31.25%)
Mutual labels:  text-mining
restaurant-finder-featureReviews
Build a Flask web application to help users retrieve key restaurant information and feature-based reviews (generated by applying market-basket model – Apriori algorithm and NLP on user reviews).
Stars: ✭ 21 (+31.25%)
Mutual labels:  text-mining
Quran-and-Arabic-Language-Repository
Projects & Libraries related to Quran & Arabic Language
Stars: ✭ 26 (+62.5%)
Mutual labels:  text-mining
lda2vec
Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec from this paper https://arxiv.org/abs/1605.02019
Stars: ✭ 27 (+68.75%)
Mutual labels:  text-mining

*** NOTE: Discontinued. *** This project is discontinued. The author now considers this to be a poor method. A superior method is to extract the text from HTML files instead. See the Anne of Green Garbles notes for more information.


Guten-gutter

Guten-gutter is a command-line filter for stripping the boilerplate off of text files from Project Gutenberg. I was using gutenizer for this purpose, but it has some shortcomings and there were several Project Gutenberg texts which it failed to properly strip, so I wrote this as a more robust replacement. It's also (like Project Gutenberg texts themselves) in the public domain.

Usage

If you want to get just the book's text out of a Project Gutenberg text file:

script/guten-gutter pg10662.txt > The_Night_Land.txt

If you want to do that to an entire collection of Project Gutenberg files:

mkdir cleaned
script/guten-gutter ../gutenberg/*.txt --output-dir=cleaned

To use Guten-gutter from any working directory, add the script directory in this repository to your PATH. For example, you might add this line to your .bashrc:

export PATH=/path/to/this/repo/script:$PATH

An easy way to accomplish this is to dock Guten-gutter using shelf:

shelf_dockgh catseye/Guten-gutter

Tests

A small test script, test.sh, is included with this distribution.

TODO

Rewrite ProducedByProcessor as a StartSentinelProcessor (or otherwise have it ignore the end sentinel)

Make IllustrationProcessor handle multiple lines

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].