Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → dbashford → Textract

dbashford / Textract

Licence: mit

node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!

Labels

html nodejs extraction

Projects that are alternatives of or similar to Textract

SevenZipSharp

Fork of SevenZipSharp on CodePlex

Stars: ✭ 171 (-87.47%)

Mutual labels: extraction

Uritemplate

PHP URI Template (RFC 6570) supports both URI expansion & extraction

Stars: ✭ 310 (-77.29%)

Mutual labels: extraction

Tika Python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

Stars: ✭ 997 (-26.96%)

Mutual labels: extraction

Stanford-NER-Python

Stanford Named Entity Recognizer (NER) - Python Wrapper

Stars: ✭ 63 (-95.38%)

Mutual labels: extraction

coq-simple-io

IO for Gallina

Stars: ✭ 21 (-98.46%)

Mutual labels: extraction

Stanford Openie Python

Stanford Open Information Extraction made simple!

Stars: ✭ 348 (-74.51%)

Mutual labels: extraction

ti recover

Appcelerator Titanium APK source code recovery tool

Stars: ✭ 17 (-98.75%)

Mutual labels: extraction

Email Extractor

The main functionality is to extract all the emails from one or several URLs - La funcionalidad principal es extraer todos los correos electrónicos de una o varias Url

Stars: ✭ 81 (-94.07%)

Mutual labels: extraction

tabula-sharp

Extract tables from PDF files (port of tabula-java)

Stars: ✭ 38 (-97.22%)

Mutual labels: extraction

Puree

Metadata extraction from the Pure Research Information System.

Stars: ✭ 8 (-99.41%)

Mutual labels: extraction

Table-Detection-Extraction

Detect the tables in a form and extract the tables as well as the cells of the tables.

Stars: ✭ 35 (-97.44%)

Mutual labels: extraction

AutoIt-Ripper

Extract AutoIt scripts embedded in PE binaries

Stars: ✭ 101 (-92.6%)

Mutual labels: extraction

Garbro

Visual Novels resource browser

Stars: ✭ 764 (-44.03%)

Mutual labels: extraction

H2PC TagExtraction

A application made to extract assets from cache files of H2v using BlamLib by KornnerStudios.

Stars: ✭ 12 (-99.12%)

Mutual labels: extraction

Locky

Stars: ✭ 61 (-95.53%)

Mutual labels: extraction

RDMP

Research Data Management Platform (RDMP) is an open source application for the loading,linking,anonymisation and extraction of datasets stored in relational databases.

Stars: ✭ 20 (-98.53%)

Mutual labels: extraction

Unrpa

A program to extract files from the RPA archive format.

Stars: ✭ 313 (-77.07%)

Mutual labels: extraction

Florentino

Fast Static File Analysis Framework

Stars: ✭ 92 (-93.26%)

Mutual labels: extraction

Stegextract

Detect hidden files and text in images

Stars: ✭ 79 (-94.21%)

Mutual labels: extraction

Ppe

Probabilistic plane extraction

Stars: ✭ 16 (-98.83%)

Mutual labels: extraction

View All Similar Projects ➔

textract

A text extraction node module.

Currently Extracts...

HTML, HTM
ATOM, RSS
Markdown
EPUB
XML, XSL
PDF
DOC, DOCX
ODT, OTT (experimental, feedback needed!)
RTF
XLS, XLSX, XLSB, XLSM, XLTX
CSV
ODS, OTS
PPTX, POTX
ODP, OTP
ODG, OTG
PNG, JPG, GIF
DXF
application/javascript
All text/* mime-types.

In almost all cases above, what textract cares about is the mime type. So .html and .htm, both possessing the same mime type, will be extracted. Other extensions that share mime types with those above should also extract successfully. For example, application/vnd.ms-excel is the mime type for .xls, but also for 5 other file types.

Does textract not extract from files of the type you need? Add an issue or submit a pull request. It many cases textract is already capable, it is just not paying attention to the mime type you may be interested in.

Install

npm install textract

Extraction Requirements

Note, if any of the requirements below are missing, textract will run and extract all files for types it is capable. Not having these items installed does not prevent you from using textract, it just prevents you from extracting those specific files.

PDF extraction requires pdftotext be installed, link
DOC extraction requires antiword be installed, link, unless on OSX in which case textutil (installed by default) is used.
RTF extraction requires unrtf be installed, link, unless on OSX in which case textutil (installed by default) is used.
PNG, JPG and GIF require tesseract to be available, link. Images need to be pretty clear, high DPI and made almost entirely of just text for tesseract to be able to accurately extract the text.
DXF extraction requires drawingtotext be available, link

Configuration

Configuration can be passed into textract. The following configuration options are available

preserveLineBreaks: When using the command line this is set to true to preserve stdout readability. When using the library via node this is set to false. Pass this in as true and textract will not strip any line breaks.
preserveOnlyMultipleLineBreaks: Some extractors, like PDF, insert line breaks at the end of every line, even if the middle of a sentence. If this option (default false) is set to true, then any instances of a single line break are removed but multiple line breaks are preserved. Check your output with this option, though, this doesn't preserve paragraphs unless there are multiple breaks.
exec: Some extractors (dxf) use node's exec functionality. This setting allows for providing config to exec execution. One reason you might want to provide this config is if you are dealing with very large files. You might want to increase the exec maxBuffer setting.
[ext].exec: Each extractor can take specific exec config. Keep in mind many extractors are responsible for extracting multiple types, so, for instance, the odt extractor is what you would configure for odt and odg/odt etc. Check the extractors to see which you want to specifically configure. At the bottom of each is a list of types for which the extractor is responsible.
tesseract.lang: A pass-through to tesseract allowing for setting of language for extraction. ex: { tesseract: { lang:"chi_sim" } }
tesseract.cmd: tesseract.lang allows a quick means to provide the most popular tesseract option, but if you need to configure more options, you can simply pass cmd. cmd is the string that matches the command-line options you want to pass to tesseract. For instance, to provide language and psm, you would pass { tesseract: { cmd:"-l chi_sim -psm 10" } }
pdftotextOptions: This is a proxy options object to the library textract uses for pdf extraction: pdf-text-extract. Options include ownerPassword, userPassword if you are extracting text from password protected PDFs. IMPORTANT: textract modifies the pdf-text-extract layout default so that, instead of layout: layout, it uses layout:raw. It is not suggested you modify this without understanding what trouble that might get you in. See this GH issue for why textract overrides that library's default.
typeOverride: Used with fromUrl, if set, rather than using the content-type from the URL request, will use the provided typeOverride.
includeAltText: When extracting HTML, whether or not to include alt text with the extracted text. By default this is false.

To use this configuration at the command line, prefix each open with a --.

Ex: textract image.png --tesseract.lang=deu

Usage

Commmand Line

If textract is installed gloablly, via npm install -g textract, then the following command will write the extracted text to the console for a file on the file system.

$ textract pathToFile

Flags

Configuration flags can be passed into textract via the command line.

textract pathToFile --preserveLineBreaks false

Parameters like exec.maxBuffer can be passed as you'd expect.

textract pathToFile --exec.maxBuffer 500000

And multiple flags can be used together.

textract pathToFile --preserveLineBreaks false --exec.maxBuffer 500000

Node

Import

var textract = require('textract');

APIs

There are several ways to extract text. For all methods, the extracted text and an error object are passed to a callback.

error will contain informative text about why the extraction failed. If textract does not currently extract files of the type provided, a typeNotFound flag will be tossed on the error object.

File

textract.fromFileWithPath(filePath, function( error, text ) {})

textract.fromFileWithPath(filePath, config, function( error, text ) {})

File + mime type

textract.fromFileWithMimeAndPath(type, filePath, function( error, text ) {})

textract.fromFileWithMimeAndPath(type, filePath, config, function( error, text ) {})

Buffer + mime type

textract.fromBufferWithMime(type, buffer, function( error, text ) {})

textract.fromBufferWithMime(type, buffer, config, function( error, text ) {})

Buffer + file name/path

textract.fromBufferWithName(name, buffer, function( error, text ) {})

textract.fromBufferWithName(name, buffer, config, function( error, text ) {})

URL

When passing a URL, the URL can either be a string, or a node.js URL object. Using the URL object allows fine grained control over the URL being used.

textract.fromUrl(url, function( error, text ) {})

textract.fromUrl(url, config, function( error, text ) {})

Testing Notes

Running Tests on a Mac?

sudo port install tesseract-chi-sim
sudo port install tesseract-eng
You will also want to disable textract's usage of textutil as the tests are based on output from antiword.
- Go into /lib/extractors/{doc|doc-osx|rtf} and modify the code under if ( os.platform() === 'darwin' ) {. Uncommented the commented lines in these sections.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 1,365

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (53) 🔗