All Projects → PaperCutSoftware → pdfsearch

PaperCutSoftware / pdfsearch

Licence: other
A full text search library for PDFs.

Programming Languages

go
31211 projects - #10 most used programming language

Pure Go Full Text Search of PDF Files

This library implements full text search for PDFs.

The are some command lines programs that demonstrate the library's functionality.

Binary versions (executables) of these three programs are available in releases. There are 64-bit binaries for Windows, Mac and Linux. The binaries do not require a UniDoc license.

Installation

git clone https://github.com/PaperCutSoftware/pdfsearch

Replace uniDocLicenseKey and companyName in unidoc_glue.go with valid UniDoc license fields.

cd pdfsearch/examples
go build pdf_search_demo.go
go build index.go
go build search.go

examples/pdf_search_demo.go

Usage: ./pdf_search_demo -f <PDF path> <search term>

Example: ./pdf_search_demo -f PDF32000_2008.pdf cubic Bézier curve

The example will search PDF32000_2008.pdf for cubic Bézier curve.

pdf_search_demo.go shows how to use the APIs in index_search.go to

  • create indexes over PDFs,
  • search those indexes using full-text search, and
  • mark up PDFs with the locations of the search matches on pages.

examples/index.go

Usage: ./index <file pattern>

Example: ./index ~/climate/**/*.pdf

The example creates an on-disk index over the PDFs in ~/climate/ and its subdirectories.

examples/search.go

Usage: ./search <search term>

Example: ./search integrated assessment model

The example searches the on-disk index created by examples/index.go for integrated assessment model.

Libraries

index_search.go uses UniDoc for PDF parsing and bleve for search.

Talks about this library

GopherCon AU 2019

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].