All Projects → vaites → Php Apache Tika

vaites / Php Apache Tika

Licence: mit
Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats

Projects that are alternatives of or similar to Php Apache Tika

Image Text Localization Recognition
A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約
Stars: ✭ 788 (+936.84%)
Mutual labels:  ocr, text-recognition, text-extraction
ocr
Simple app to extract text from pictures using Tesseract
Stars: ✭ 98 (+28.95%)
Mutual labels:  ocr, text-extraction, text-recognition
Crnn
Convolutional recurrent neural network for scene text recognition or OCR in Keras
Stars: ✭ 68 (-10.53%)
Mutual labels:  ocr, text-recognition
Vedastr
A scene text recognition toolbox based on PyTorch
Stars: ✭ 290 (+281.58%)
Mutual labels:  ocr, text-recognition
Awesome Ocr Resources
A collection of resources (including the papers and datasets) of OCR (Optical Character Recognition).
Stars: ✭ 335 (+340.79%)
Mutual labels:  ocr, text-recognition
insightocr
MXNet OCR implementation. Including text recognition and detection.
Stars: ✭ 100 (+31.58%)
Mutual labels:  ocr, text-recognition
MLKit
🌝 MLKit是一个强大易用的工具包。通过ML Kit您可以很轻松的实现文字识别、条码识别、图像标记、人脸检测、对象检测等功能。
Stars: ✭ 294 (+286.84%)
Mutual labels:  ocr, text-recognition
Megreader
A research project for text detection and recognition using PyTorch 1.2.
Stars: ✭ 332 (+336.84%)
Mutual labels:  ocr, text-recognition
lego-mindstorms-51515-jetson-nano
Combines the LEGO Mindstorms 51515 with the NVIDIA Jetson Nano
Stars: ✭ 31 (-59.21%)
Mutual labels:  ocr, text-recognition
Aster.pytorch
ASTER in Pytorch
Stars: ✭ 473 (+522.37%)
Mutual labels:  ocr, text-recognition
Cnn lstm ctc ocr
Tensorflow-based CNN+LSTM trained with CTC-loss for OCR
Stars: ✭ 464 (+510.53%)
Mutual labels:  ocr, text-recognition
Tr
Free Offline OCR 离线的中文文本检测+识别SDK
Stars: ✭ 598 (+686.84%)
Mutual labels:  ocr, text-recognition
CRNN
Convolutional recurrent neural network for scene text recognition or OCR in Keras
Stars: ✭ 96 (+26.32%)
Mutual labels:  ocr, text-recognition
EverTranslator
Translate text anytime and everywhere, even you are gaming!
Stars: ✭ 59 (-22.37%)
Mutual labels:  ocr, text-recognition
NLP-image-to-text
code to extract text from images
Stars: ✭ 28 (-63.16%)
Mutual labels:  ocr, text-recognition
Chineseaddress ocr
Photographing Chinese-Address OCR implemented using CTPN+CTC+Address Correction. 拍照文档中文地址文字识别。
Stars: ✭ 309 (+306.58%)
Mutual labels:  ocr, text-recognition
Ocr.pytorch
A pure pytorch implemented ocr project including text detection and recognition
Stars: ✭ 196 (+157.89%)
Mutual labels:  ocr, text-recognition
doctr
docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.
Stars: ✭ 1,409 (+1753.95%)
Mutual labels:  ocr, text-recognition
React Native Tesseract Ocr
Tesseract OCR wrapper for React Native
Stars: ✭ 384 (+405.26%)
Mutual labels:  ocr, text-recognition
Tika Python
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Stars: ✭ 997 (+1211.84%)
Mutual labels:  text-recognition, text-extraction

Current release Package at Packagist Build status Code coverage Code quality Code insight License

PHP Apache Tika

This tool provides Apache Tika bindings for PHP, allowing to extract text and metadata from documents, images and other formats.

The following modes are supported:

Server mode is recommended because is 5 times faster, but some shared hosts don't allow run processes in background.

Although the library contains a list of supported versions, any version of Apache Tika should be compatible as long as backward compatibility is maintained by Tika team. Therefore, it is not necessary to wait for an update of the library to work with the new versions of the tool.

Features

  • Simple class interface to Apache Tika features:
    • Text and HTML extraction
    • Metadata extraction
    • OCR recognition
  • Standarized metadata for documents
  • Support for local and remote resources
  • No heavyweight library dependencies
  • Compatible with Apache Tika 1.15 or greater
    • Tested up to 1.25
  • Works on Linux, macOS, Windows and probably on FreeBSD

Requirements

NOTE: the supported PHP version will remain synced with the latest supported by PHP team

Installation

Install using Composer:

composer require vaites/php-apache-tika

If you want to use OCR you must install Tesseract:

  • Fedora/CentOS: sudo yum install tesseract (use dnf instead of yum on Fedora 22 or greater)
  • Debian/Ubuntu: sudo apt-get install tesseract-ocr
  • macOS: brew install tesseract (using Homebrew)
  • Windows: scoop install tesseract (using Scoop)

The library assumes tesseract binary is in path, so you can compile it yourself or install using any other method.

Usage

Start Apache Tika server with caution:

java -jar tika-server-x.xx.jar

If you are using JRE instead of JDK, you must run if you have Java 9 or greater:

java --add-modules java.se.ee -jar tika-server-x.xx.jar

Instantiate the class, checking if JAR exists or server is running:

$client = \Vaites\ApacheTika\Client::make('localhost', 9998);           // server mode (default)
$client = \Vaites\ApacheTika\Client::make('/path/to/tika-app.jar');     // app mode 

If you want to use dependency injection, serialize the class or just delay the check:

$client = \Vaites\ApacheTika\Client::prepare('localhost', 9998);
$client = \Vaites\ApacheTika\Client::prepare('/path/to/tika-app.jar'); 

You can use an URL too:

$client = \Vaites\ApacheTika\Client::make('http://localhost:9998');
$client = \Vaites\ApacheTika\Client::prepare('http://localhost:9998');

Use the class to extract text from documents:

$language = $client->getLanguage('/path/to/your/document');
$metadata = $client->getMetadata('/path/to/your/document');

$html = $client->getHTML('/path/to/your/document');
$text = $client->getText('/path/to/your/document');

Or use to extract text from images:

$client = \Vaites\ApacheTika\Client::make($host, $port);
$metadata = $client->getMetadata('/path/to/your/image');

$text = $client->getText('/path/to/your/image');

You can use an URL instead of a file path and the library will download the file and pass it to Apache Tika. There's no need to add -enableUnsecureFeatures -enableFileUrl to command line when starting the server, as described here.

Methods

Here are the full list of available methods

Common

Tika file related methods:

$client->getMetadata($file);
$client->getRecursiveMetadata($file, 'text');
$client->getLanguage($file);
$client->getMIME($file);
$client->getHTML($file);
$client->getXHTML($file); // only CLI mode
$client->getText($file);
$client->getMainText($file);

Other Tika related methods:

$client->getSupportedMIMETypes();
$client->getIsMIMETypeSupported('application/pdf');
$client->getAvailableDetectors();
$client->getAvailableParsers();
$client->getVersion();

Encoding methods:

$client->getEncoding();
$client->setEncoding('UTF-8');

Supported versions related methods:

$client->getSupportedVersions();
$client->isVersionSupported($version);

Set/get a callback for sequential read of response:

$client->setCallback($callback);
$client->getCallback();

Set/get the chunk size for secuential read:

$client->setChunkSize($size);
$client->getChunkSize();

Enable/disable the internal remote file downloader:

$client->setDownloadRemote(true);
$client->getDownloadRemote();

Command line client

Set/get JAR/Java paths (only CLI mode):

$client->setPath($path);
$client->getPath();

$client->setJava($java);
$client->getJava();

$client->setJavaArgs('-JXmx4g');
$client->getJavaArgs();

$client->setEnvVars(['LANG' => 'es_ES.UTF-8']);
$client->getEnvVars();

Web client

Set/get host properties

$client->setHost($host);
$client->getHost();

$client->setPort($port);
$client->getPort();

$client->setUrl($url);
$client->getUrl();

$client->setRetries($retries);
$client->getRetries();

Set/get cURL client options

$client->setOptions($options);
$client->getOptions();
$client->setOption($option, $value);
$client->getOption($option);

Set/get cURL client common options:

$client->setTimeout($seconds);
$client->getTimeout();

Breaking changes

Since 1.0 version there are some breaking changes:

  • Apache Tika versions prior to 1.15 are not supported (use 0.x version for 1.14 and older)
  • PHP minimum requirement is 7.2 or greater (use 0.x version for 7.1 and older)
  • $client->getRecursiveMetadata() returns an array as expected
  • Client::getSupportedVersions() and Client::isVersionSupported() methods cannot be called statically
  • Values returned by Client::getAvailableDetectors() and Client::getAvailableParsers() are identical and have a new definition

See CHANGELOG.md for more details.

Troubleshooting

Empty responses or unexpected results

This library is only a proxy so if you get an empy responses or unexpected results the most common cause is Tika itself. A simple test is using the GUI to check the response:

  1. Run the Tika app without arguments: java -jar tika-app-x.xx.jar
  2. Drop your file or select it using File -> Open
  3. Wait until the metadata appears
  4. Get the text or HTML using View menu

If the results are the same, you must take a look into Tika's Jira and open an issue if necessary.

Encoding

By default the returned text is encoded with UTF-8 but there are some issues with the encoding when using the app mode. The Client::setEncoding() method allows to set the expected encoding (this will be fixed in the upcoming 1.0 release).

Tests

Tests are designed to cover all features for all supported versions of Apache Tika in app mode and server mode. There are a few samples to test against:

  • sample1: document metadata and text extraction
  • sample2: image metadata
  • sample3: text recognition
  • sample4: unsupported media
  • sample5: huge text for callbacks
  • sample6: remote calls
  • sample7: text encoding
  • sample8: recursive metadatata

Known issues

There are some issues found during tests, not related with this library:

  • Tesseract slows down document parsing as described in TIKA-2359

Integrations

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].