All Projects → sajari → Docconv

sajari / Docconv

Licence: mit
Converts PDF, DOC, DOCX, XML, HTML, RTF, etc to plain text

Programming Languages

go
31211 projects - #10 most used programming language

Projects that are alternatives of or similar to Docconv

Gotenberg
A Docker-powered stateless API for PDF files.
Stars: ✭ 3,272 (+345.17%)
Mutual labels:  docx, pdf, conversion, word
Net Core Docx Html To Pdf Converter
.NET Core library to create custom reports based on Word docx or HTML documents and convert to PDF
Stars: ✭ 133 (-81.9%)
Mutual labels:  word, docx, pdf, pdf-converter
Gotenberg Php Client
PHP client for the Gotenberg API
Stars: ✭ 80 (-89.12%)
Mutual labels:  word, pdf, pdf-converter
Hrconvert2
A self-hosted, drag-and-drop, & nosql file conversion server that supports 62x file formats.
Stars: ✭ 132 (-82.04%)
Mutual labels:  docx, conversion, pdf-converter
Superfileview
基于腾讯浏览服务Tbs,使用X5Webkit内核,实现文件的展示功能,支持多种文件格式
Stars: ✭ 1,115 (+51.7%)
Mutual labels:  word, docx, pdf
Phpstamp
The XSL-way templating library for MS Office Word DOCX documents.
Stars: ✭ 150 (-79.59%)
Mutual labels:  xml, word, docx
Gotenberg Go Client
Go client for the Gotenberg API
Stars: ✭ 35 (-95.24%)
Mutual labels:  word, pdf, pdf-converter
Etherpad Lite
Etherpad: A modern really-real-time collaborative document editor.
Stars: ✭ 11,937 (+1524.08%)
Mutual labels:  word, docx, pdf
Koodo Reader
A modern ebook manager and reader with sync and backup capacities for Windows, macOS, Linux and Web
Stars: ✭ 2,938 (+299.73%)
Mutual labels:  pdf, xml, docx
Js Word
✒️ Word Processing Document Library
Stars: ✭ 1,203 (+63.67%)
Mutual labels:  xml, word, docx
Docx
Easily generate .docx files with JS/TS with a nice declarative API. Works for Node and on the Browser.
Stars: ✭ 2,150 (+192.52%)
Mutual labels:  docs, word, docx
Unioffice
Pure go library for creating and processing Office Word (.docx), Excel (.xlsx) and Powerpoint (.pptx) documents
Stars: ✭ 3,111 (+323.27%)
Mutual labels:  word, docx
Pdf Flipbook
Browse PDF document like a book turning its pages
Stars: ✭ 279 (-62.04%)
Mutual labels:  pdf, pdf-converter
Structured Text Tools
A list of command line tools for manipulating structured text data
Stars: ✭ 6,180 (+740.82%)
Mutual labels:  xml, conversion
Rplos
R client for the PLoS Journals API
Stars: ✭ 289 (-60.68%)
Mutual labels:  xml, pdf
Deck
Slide Decks
Stars: ✭ 261 (-64.49%)
Mutual labels:  xml, pdf
Docx
a ruby library/gem for interacting with .docx files
Stars: ✭ 288 (-60.82%)
Mutual labels:  word, docx
Python Automation Scripts
Simple yet powerful automation stuffs.
Stars: ✭ 292 (-60.27%)
Mutual labels:  pdf, pdf-converter
E Books
IT technical related e-books and PPT information, continuous updating. For those in need, Keep real, peace and love.
Stars: ✭ 357 (-51.43%)
Mutual labels:  docs, pdf
Fiduswriter
Fidus Writer is an online collaborative editor for academics.
Stars: ✭ 405 (-44.9%)
Mutual labels:  word, docx

docconv

GoDoc Build Status

A Go wrapper library to convert PDF, DOC, DOCX, XML, HTML, RTF, ODT, Pages documents and images (see optional dependencies below) to plain text.

Note for returning users: the Go import path for this package been moved to code.sajari.com/docconv.

Installation

If you haven't setup Go before, you first need to install Go.

To fetch and build the code:

$ go get code.sajari.com/docconv/...

This will also build the command line tool docd into $GOPATH/bin. Make sure that $GOPATH/bin is in your PATH environment variable.

Dependencies

tidy, wv, popplerutils, unrtf, https://github.com/JalfResi/justext

Example install of dependencies (not all systems):

$ sudo apt-get install poppler-utils wv unrtf tidy
$ go get github.com/JalfResi/justext

Optional dependencies

To add image support to the docconv library you first need to install and build gosseract.

Now you can add -tags ocr to any go command when building/fetching/testing docconv to include support for processing images:

$ go get -tags ocr code.sajari.com/docconv/...

This may complain on macOS, which you can fix by installing tesseract via brew:

$ brew install tesseract

docd tool

The docd tool runs as either:

  1. a service on port 8888 (by default)

    Documents can be sent as a multipart POST request and the plain text (body) and meta information are then returned as a JSON object.

  2. a service exposed from within a Docker container

    This also runs as a service, but from within a Docker container. There are three build scripts:

    The debian version uses the Debian package repository which can vary with builds. The alpine version uses a very cut down Linux distribution to produce a container ~40MB. It also locks the dependency versions for consistency, but may miss out on future updates. The appengine version is a flex based custom runtime for Google Cloud.

  3. via the command line.

    Documents can be sent as an argument, e.g.

    $ docd -input document.pdf
    

Optional flags

  • addr - the bind address for the HTTP server, default is ":8888"
  • log-level
    • 0: errors & critical info
    • 1: inclues 0 and logs each request as well
    • 2: include 1 and logs the response payloads
  • readability-length-low - sets the readability length low if the ?readability=1 parameter is set
  • readability-length-high - sets the readability length high if the ?readability=1 parameter is set
  • readability-stopwords-low - sets the readability stopwords low if the ?readability=1 parameter is set
  • readability-stopwords-high - sets the readability stopwords high if the ?readability=1 parameter is set
  • readability-max-link-density - sets the readability max link density if the ?readability=1 parameter is set
  • readability-max-heading-distance - sets the readability max heading distance if the ?readability=1 parameter is set
  • readability-use-classes - comma separated list of readability classes to use if the ?readability=1 parameter is set

How to start the service

$ # This will only log errors and critical info
$ docd -log-level 0

$ # This will run on port 8000 and log each request
$ docd -addr :8000 -log-level 1

Example usage (code)

Some basic code is shown below, but normally you would accept the file by HTTP or open it from the file system.

This should be enough to get you started though.

Use case 1: run locally

Note: this assumes you have the dependencies installed.

package main

import (
	"fmt"
	"log"

	"code.sajari.com/docconv"
)

func main() {
	res, err := docconv.ConvertPath("your-file.pdf")
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(res)
}

Use case 2: request over the network

package main

import (
	"fmt"
	"log"

	"code.sajari.com/docconv/client"
)

func main() {
	// Create a new client, using the default endpoint (localhost:8888)
	c := client.New()

	res, err := client.ConvertPath(c, "your-file.pdf")
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(res)
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].