All Projects → shadiakiki1986 → Docker Fscrawler

shadiakiki1986 / Docker Fscrawler

Dockerfile for https://github.com/dadoonet/fscrawler

Projects that are alternatives of or similar to Docker Fscrawler

Demo Scene
👾Scripts and samples to support Confluent Demos and Talks. ⚠️Might be rough around the edges ;-) 👉For automated tutorials and QA'd code, see https://github.com/confluentinc/examples/
Stars: ✭ 806 (+3258.33%)
Mutual labels:  elasticsearch
Elasticsearch Query Builder
Build query for an ElasticSearch client using a fluent interface
Stars: ✭ 18 (-25%)
Mutual labels:  elasticsearch
Elasticsearch Readonlyrest Plugin
Free Elasticsearch security plugin and Kibana security plugin: super-easy Kibana multi-tenancy, Encryption, Authentication, Authorization, Auditing
Stars: ✭ 917 (+3720.83%)
Mutual labels:  elasticsearch
Szt Bigdata
深圳地铁大数据客流分析系统🚇🚄🌟
Stars: ✭ 826 (+3341.67%)
Mutual labels:  elasticsearch
Great Big Example Application
A full-stack example app built with JHipster, Spring Boot, Kotlin, Angular 4, ngrx, and Webpack
Stars: ✭ 899 (+3645.83%)
Mutual labels:  elasticsearch
Hugo Elasticsearch
Generate Elasticsearch indexes for Hugo static sites by parsing front matter
Stars: ✭ 19 (-20.83%)
Mutual labels:  elasticsearch
Springbootexamples
Spring Boot 学习教程
Stars: ✭ 794 (+3208.33%)
Mutual labels:  elasticsearch
Kafka Connect Elastic Sink
Kafka connect Elastic sink connector, with just in time index/delete behaviour.
Stars: ✭ 23 (-4.17%)
Mutual labels:  elasticsearch
Laravel Docker Elasticsearch
This is a simple repo for practicing elasticsearch with laravel and docker.
Stars: ✭ 18 (-25%)
Mutual labels:  elasticsearch
Kafka Connect Elasticsearch Source
Kafka Connect Elasticsearch Source
Stars: ✭ 22 (-8.33%)
Mutual labels:  elasticsearch
Complete Guide To Elasticsearch
Contains all of the queries used within the Complete Guide to Elasticsearch course.
Stars: ✭ 829 (+3354.17%)
Mutual labels:  elasticsearch
Scalable Image Matching
This is a image matching system for scalable and efficient matching of images from a large database. The basic idea is to compute perceptural hash value for each image and compare the similarity based on the pHash computed. Searching are scalable with the elasticsearch as the backend database.
Stars: ✭ 17 (-29.17%)
Mutual labels:  elasticsearch
Odsc 2020 nlp
Repository for ODSC talk related to Deep Learning NLP
Stars: ✭ 20 (-16.67%)
Mutual labels:  elasticsearch
Datastream.io
An open-source framework for real-time anomaly detection using Python, ElasticSearch and Kibana
Stars: ✭ 814 (+3291.67%)
Mutual labels:  elasticsearch
Docker Kibana
Kibana Docker image including search-guard
Stars: ✭ 22 (-8.33%)
Mutual labels:  elasticsearch
Serverless Appsync Plugin
serverless plugin for appsync
Stars: ✭ 804 (+3250%)
Mutual labels:  elasticsearch
Elasticsearchdemo
ElasticSearch+Springboot的例子,对本机的文本等文件进行全文检索
Stars: ✭ 18 (-25%)
Mutual labels:  elasticsearch
Elastic Muto
Easy expressive search queries for Elasticsearch
Stars: ✭ 24 (+0%)
Mutual labels:  elasticsearch
Search Spring Boot Starter
ElasticSearch封装基于ES版本6.4.2,极大简化了ES操作难度
Stars: ✭ 23 (-4.17%)
Mutual labels:  elasticsearch
Fscrawler
Elasticsearch File System Crawler (FS Crawler)
Stars: ✭ 906 (+3675%)
Mutual labels:  elasticsearch

docker-fscrawler Build Status

Dockerfile for fscrawler

Published on docker hub here.

Mostly inspired by elasticsearch's alpine dockerfile

Supported tags

  • 2.2 with fscrawler version 2.2 and alpine 3.5
  • 2.4 with fscrawler 2.4 and alpine 3.5
  • 2.5 with fscrawler 2.5 and ubuntu 16.04
  • 2.6 with fscrawler 2.6 and ubuntu 20.04
    • Note: the binary name fscrawler-es5 is compatible with elasticsearch version 5, versus fscrawler and fscrawler-es6 with version 6
  • (WIP) 2.7-SNAPSHOT-v20201204
    • Note: the binary name fscrawler-es6 is compatible with elasticsearch version 6, versus fscrawler and fscrawler-es7 with version 7

Dockerfile includes tesseract (via ubuntu 20.04)

Usage Instructions

stand-alone docker

Given you have good docker-fu skills, to run fscrawler docker image in folder indexing mode:

docker run \
  -it --rm --name my-fscrawler \
  -v <data folder>:/usr/share/fscrawler/data/:ro \
  -v <config folder>:/usr/share/fscrawler/config-mount/<project-name>:ro \
  shadiakiki1986/fscrawler \
  [CLI options]

where

  • data folder is the path to the folder with the files to index
  • config folder is the path to the host fscrawler config dir
    • make sure to use the proper URL reference in the config file to point to the elasticsearch instance
      • e.g. localhost:9200 if elasticsearch is running locally
  • if the config folder is not mounted from the host, the docker container will have an empty config folder, thus prompting the user for confirmation Y/N of creating the first project file
  • CLI options are documented here

An example set of CLI options is to run fscrawler in REST API mode:

docker run \
  ...
  -p <local port>:8080
  shadiakiki1986/fscrawler \
  --loop "0" --reset fscrawler_rest

with docker-compose (file 1)

Given you already have good docker-compose-fu skills, check docker-compose.yml.

To use

echo "vm.max_map_count=262144"| sudo tee -a /etc/sysctl.conf
docker-compose pull
docker-compose build
docker-compose up

with docker-compose (file 2)

Docker-fscrawler can be used in coordination with an elasticsearch docker container or an elasticsearch instance running natively on the host machine. To make coordination between the ES and fscrawler containers easy, it is recommended to use docker-compose, as described here.

Make sure you have set up vm.max_map_count=262144 by either putting it in /etc/sysctl.conf and running sudo sysctl -p, or whatever other means is convenient to you. This is necessary for elasticsearch. (see Ref)

Download

Download the following files from this git repository. Cloning the whole repository is not necessary.

  • docker-compose.yml (single-node) or docker-compose-deployment.yml (multi-node)
  • build/elasticsearch/docker-healthcheck

Make a new empty folder and put these two files in it. This directory will be the home of your configurations, and the location from which you can control your containers and make changes.

Change the name of docker-compose-deployment.yml to docker-compose.yml.

Optional: Configure Containers
  • Make a file here called .env. Here you can configure the docker containers.
  • Add the line TARGET_DIR=/path/to/directory/you/want/to/index. If you don't add this line, it will default to ./data/
  • Add the line JOB_NAME=name_to_give_your_index. This will be the name of the fscrawler job and the ES index. If you don't add this line, it will default to fscrawler_job.

Configure fscrawler

Now run

docker-compose run fscrawler

Respond with Y to the question of whether to create a new config.

Edit the newly created config/fscrawler_job/_settings.json file (you may need to use sudo, the folder name may be different if you are using .env). Change elasticsearch.nodes from 127.0.0.1 to elasticsearch1, so that it reads follows.

...
  "elasticsearch" : {
    "nodes" : [ {
      "host" : "elasticsearch1",
      "port" : 9200,
      "scheme" : "HTTP"
    } ],
    "bulk_size" : 100,
    "flush_interval" : "5s"
  },
...

For the rest of the settings in this file, can choose your own based on the options documented here. Do not change fs.url unless you also change the corresponding line in docker-compose.yml, or else fscrawler won't be able to find your files.

Test

Populate data/ or the directory you specified in .env with some files you would like to index.

Run the following.

docker-compose up -d elasticsearch1 elasticsearch2
docker-compose up -d fscrawler

fscrawler should then upload the test files you put in data/. To check that all is well, query the elasticsearch over http (substitute fscrawler_job if you gave it your own name in .env)

curl http://localhost:9200/fscrawler_job/_search | jq

If you see all your documents here, you should be good to go!

Troubleshooting

If you don't see all your documents, use the following command to get more detailed logs.

docker-compose run fscrawler --config_dir /usr/share/fscrawler/config fscrawler_job --restart --debug

Hopefully these logs will make it clear what went wrong. Failing that you can use --trace instead of --debug for even more detailed logs. You can also use --restart whenever you want to re-index everything (otherwise files are only reindexed when they are touched).

Additional options for docker-compose run fscrawler can be found here.

Additional Usage Examples

Example 1

Using docker-compose, startup elasticsearch and run fscrawler on files in test/data every 15 minutes:

docker-compose up elasticsearch1 fscrawler

For the remaining examples, the default config depends on having a running elasticsearch instance on the localhost at port 9200. Start one with:

# [Ref](https://github.com/docker-library/elasticsearch/issues/111)
sudo sysctl -w vm.max_map_count=262144

docker-compose run -p 9200:9200 -d elasticsearch1

For the versions of the docker-compose file, docker-compose, and docker, check the travis builds

Notice that the docker-compose fscrawler service is wired to wait for a healthcheck in elasticsearch. In the case of a manual launch of elasticsearch:

  • wait for around 15 seconds,
  • or watch the logs,
  • or check http://$host:9200/_cat/health?h=status where you need to wait for yellow or green, depending on your application

Example 2

To index the test files provided in this repo

docker run -it --rm \
  --net="host" \
  --name my-fscrawler \
  -v $PWD/test/data/:/usr/share/fscrawler/data/:ro \
  shadiakiki1986/fscrawler

Example 3

Same example above, but with loop=1 to run it only once

docker run -it --rm \
  --net="host" \
  --name my-fscrawler \
  -v $PWD/test/data/:/usr/share/fscrawler/data/:ro \
  -v $PWD/config/myjob:/usr/share/fscrawler/config-mount/myjob:ro \
  shadiakiki1986/fscrawler \
    --config_dir /usr/share/fscrawler/config \
    --loop 1 \
    --trace \
    myjob

Building locally

To build the docker image

git clone https://github.com/shadiakiki1986/docker-fscrawler
docker build -t shadiakiki1986/fscrawler:local . # or use version instead of "local"

To test against elasticsearch locally, follow steps in .travis.yml

Updating

To update fscrawler in this docker container:

  • install docker (instructions for linux: link)

  • install docker-compose (instructions for linux: link)

  • update the version numbers used in Dockerfile

    • (deprecated) also update the URL to the maven zip file to download
  • test can build

    • docker build -t shadiakiki1986/fscrawler:2.6 .
    • docker build -t shadiakiki1986/fscrawler:2.7-SNAPSHOT-20201204 .
  • test can run (check section above "Usage / with docker-compose (file 1)", or run tests in .travis.yml file)

  • commit, tag, push to github

To update the automated build on hub.docker.com

  • the "latest" tag will get re-built automatically with the push above
  • to add a new version tag, need to build settings and add it manually, then click save and trigger

To update elasticsearch in the docker-compose for the purpose of testing (e.g. .travis.yml)

  • edit build/elasticsearch/Dockerfile by changing FROM image
  • follow steps in .travis.yml

Changelog

Version 2.6 (2020-12-04)

  • update fscrawler from 2.6-SNAPSHOT to 2.6
  • update ubuntu base image from 16.04 to 20.04, etc
  • support fscrawler{,-es5,-es6}

Version 2.6-SNAPSHOT (2018-10-08)

  • update fscrawler from 2.5 to 2.6-SNAPSHOT (master branch as of today)

Version 2.5.2 (2018-10-08)

  • docker-compose.yml updates
    • update base elasticsearch image to be 6.4 from 6.1
    • bring back the file crawl service
    • elasticsearch healthcheck to target yellow as a "minimum" now that 6.4 shows green instead of yellow even if 1 node

Version 2.5.1 (2018-10-08)

  • using fscrawler 2.5

Version 2.4.2 (2018-10-04)

  • change the main base image to be ubuntu instead of alpine linux
    • move the alpine linux image into a "alpine" folder
    • move teh ubuntu linux image out of the "ubuntu" folder

Version 2.4 (2017-12-27)

Version 2.2 (2017-02-22)

  • use fscrawler 2.2
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].