Pure Go Full Text Search of PDF Files

This library implements full text search for PDFs.

The public APIs are in index_search.go.

The are some command lines programs that demonstrate the library's functionality.

examples/pdf_search_demo.go demonstrates the main APIs.
examples/index.go builds an index over a set of PDFs.
examples/search.go searches the index build by examples/index.go.

Binary versions (executables) of these three programs are available in releases. There are 64-bit binaries for Windows, Mac and Linux. The binaries do not require a UniDoc license.

Installation

git clone https://github.com/PaperCutSoftware/pdfsearch

Replace uniDocLicenseKey and companyName in unidoc_glue.go with valid UniDoc license fields.

cd pdfsearch/examples
go build pdf_search_demo.go
go build index.go
go build search.go

examples/pdf_search_demo.go

Usage: ./pdf_search_demo -f <PDF path> <search term>

Example: ./pdf_search_demo -f PDF32000_2008.pdf cubic Bézier curve

The example will search PDF32000_2008.pdf for cubic Bézier curve.

pdf_search_demo.go shows how to use the APIs in index_search.go to

create indexes over PDFs,
search those indexes using full-text search, and
mark up PDFs with the locations of the search matches on pages.

examples/index.go

Usage: ./index <file pattern>

Example: ./index ~/climate/**/*.pdf

The example creates an on-disk index over the PDFs in ~/climate/ and its subdirectories.

examples/search.go

Usage: ./search <search term>

Example: ./search integrated assessment model

The example searches the on-disk index created by examples/index.go for integrated assessment model.

Libraries

index_search.go uses UniDoc for PDF parsing and bleve for search.

Talks about this library

GopherCon AU 2019

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

PaperCutSoftware / pdfsearch