Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → shadiakiki1986 → Docker Fscrawler

shadiakiki1986 / Docker Fscrawler

Dockerfile for https://github.com/dadoonet/fscrawler

Labels

dockerfile elasticsearch

Projects that are alternatives of or similar to Docker Fscrawler

Demo Scene

👾Scripts and samples to support Confluent Demos and Talks. ⚠️Might be rough around the edges ;-) 👉For automated tutorials and QA'd code, see https://github.com/confluentinc/examples/

Stars: ✭ 806 (+3258.33%)

Mutual labels: elasticsearch

Elasticsearch Query Builder

Build query for an ElasticSearch client using a fluent interface

Stars: ✭ 18 (-25%)

Mutual labels: elasticsearch

Elasticsearch Readonlyrest Plugin

Free Elasticsearch security plugin and Kibana security plugin: super-easy Kibana multi-tenancy, Encryption, Authentication, Authorization, Auditing

Stars: ✭ 917 (+3720.83%)

Mutual labels: elasticsearch

Szt Bigdata

深圳地铁大数据客流分析系统🚇🚄🌟

Stars: ✭ 826 (+3341.67%)

Mutual labels: elasticsearch

Great Big Example Application

A full-stack example app built with JHipster, Spring Boot, Kotlin, Angular 4, ngrx, and Webpack

Stars: ✭ 899 (+3645.83%)

Mutual labels: elasticsearch

Hugo Elasticsearch

Generate Elasticsearch indexes for Hugo static sites by parsing front matter

Stars: ✭ 19 (-20.83%)

Mutual labels: elasticsearch

Springbootexamples

Spring Boot 学习教程

Stars: ✭ 794 (+3208.33%)

Mutual labels: elasticsearch

Kafka Connect Elastic Sink

Kafka connect Elastic sink connector, with just in time index/delete behaviour.

Stars: ✭ 23 (-4.17%)

Mutual labels: elasticsearch

Laravel Docker Elasticsearch

This is a simple repo for practicing elasticsearch with laravel and docker.

Stars: ✭ 18 (-25%)

Mutual labels: elasticsearch

Kafka Connect Elasticsearch Source

Stars: ✭ 22 (-8.33%)

Mutual labels: elasticsearch

Complete Guide To Elasticsearch

Contains all of the queries used within the Complete Guide to Elasticsearch course.

Stars: ✭ 829 (+3354.17%)

Mutual labels: elasticsearch

Scalable Image Matching

This is a image matching system for scalable and efficient matching of images from a large database. The basic idea is to compute perceptural hash value for each image and compare the similarity based on the pHash computed. Searching are scalable with the elasticsearch as the backend database.

Stars: ✭ 17 (-29.17%)

Mutual labels: elasticsearch

Odsc 2020 nlp

Repository for ODSC talk related to Deep Learning NLP

Stars: ✭ 20 (-16.67%)

Mutual labels: elasticsearch

Datastream.io

An open-source framework for real-time anomaly detection using Python, ElasticSearch and Kibana

Stars: ✭ 814 (+3291.67%)

Mutual labels: elasticsearch

Docker Kibana

Kibana Docker image including search-guard

Stars: ✭ 22 (-8.33%)

Mutual labels: elasticsearch

Serverless Appsync Plugin

serverless plugin for appsync

Stars: ✭ 804 (+3250%)

Mutual labels: elasticsearch

Elasticsearchdemo

ElasticSearch+Springboot的例子，对本机的文本等文件进行全文检索

Stars: ✭ 18 (-25%)

Mutual labels: elasticsearch

Elastic Muto

Easy expressive search queries for Elasticsearch

Stars: ✭ 24 (+0%)

Mutual labels: elasticsearch

Search Spring Boot Starter

ElasticSearch封装基于ES版本6.4.2，极大简化了ES操作难度

Stars: ✭ 23 (-4.17%)

Mutual labels: elasticsearch

Fscrawler

Elasticsearch File System Crawler (FS Crawler)

Stars: ✭ 906 (+3675%)

Mutual labels: elasticsearch

View All Similar Projects ➔

docker-fscrawler

Dockerfile for fscrawler

Published on docker hub here.

Mostly inspired by elasticsearch's alpine dockerfile

Supported tags

2.2 with fscrawler version 2.2 and alpine 3.5
2.4 with fscrawler 2.4 and alpine 3.5
2.5 with fscrawler 2.5 and ubuntu 16.04
2.6 with fscrawler 2.6 and ubuntu 20.04
- Note: the binary name fscrawler-es5 is compatible with elasticsearch version 5, versus fscrawler and fscrawler-es6 with version 6
(WIP) 2.7-SNAPSHOT-v20201204
- Note: the binary name fscrawler-es6 is compatible with elasticsearch version 6, versus fscrawler and fscrawler-es7 with version 7

Dockerfile includes tesseract (via ubuntu 20.04)

Usage Instructions

stand-alone docker

Given you have good docker-fu skills, to run fscrawler docker image in folder indexing mode:

docker run \
  -it --rm --name my-fscrawler \
  -v <data folder>:/usr/share/fscrawler/data/:ro \
  -v <config folder>:/usr/share/fscrawler/config-mount/<project-name>:ro \
  shadiakiki1986/fscrawler \
  [CLI options]

where

data folder is the path to the folder with the files to index
config folder is the path to the host fscrawler config dir
- make sure to use the proper URL reference in the config file to point to the elasticsearch instance
  - e.g. localhost:9200 if elasticsearch is running locally
if the config folder is not mounted from the host, the docker container will have an empty config folder, thus prompting the user for confirmation Y/N of creating the first project file
CLI options are documented here

An example set of CLI options is to run fscrawler in REST API mode:

docker run \
  ...
  -p <local port>:8080
  shadiakiki1986/fscrawler \
  --loop "0" --reset fscrawler_rest

with docker-compose (file 1)

Given you already have good docker-compose-fu skills, check docker-compose.yml.

To use

echo "vm.max_map_count=262144"| sudo tee -a /etc/sysctl.conf
docker-compose pull
docker-compose build
docker-compose up

with docker-compose (file 2)

Docker-fscrawler can be used in coordination with an elasticsearch docker container or an elasticsearch instance running natively on the host machine. To make coordination between the ES and fscrawler containers easy, it is recommended to use docker-compose, as described here.

Make sure you have set up vm.max_map_count=262144 by either putting it in /etc/sysctl.conf and running sudo sysctl -p, or whatever other means is convenient to you. This is necessary for elasticsearch. (see Ref)

Download

Download the following files from this git repository. Cloning the whole repository is not necessary.

docker-compose.yml (single-node) or docker-compose-deployment.yml (multi-node)
build/elasticsearch/docker-healthcheck

Make a new empty folder and put these two files in it. This directory will be the home of your configurations, and the location from which you can control your containers and make changes.

Change the name of docker-compose-deployment.yml to docker-compose.yml.

Optional: Configure Containers

Make a file here called .env. Here you can configure the docker containers.
Add the line TARGET_DIR=/path/to/directory/you/want/to/index. If you don't add this line, it will default to ./data/
Add the line JOB_NAME=name_to_give_your_index. This will be the name of the fscrawler job and the ES index. If you don't add this line, it will default to fscrawler_job.

Configure fscrawler

Now run

docker-compose run fscrawler

Respond with Y to the question of whether to create a new config.

Edit the newly created config/fscrawler_job/_settings.json file (you may need to use sudo, the folder name may be different if you are using .env). Change elasticsearch.nodes from 127.0.0.1 to elasticsearch1, so that it reads follows.

...
  "elasticsearch" : {
    "nodes" : [ {
      "host" : "elasticsearch1",
      "port" : 9200,
      "scheme" : "HTTP"
    } ],
    "bulk_size" : 100,
    "flush_interval" : "5s"
  },
...

For the rest of the settings in this file, can choose your own based on the options documented here. Do not change fs.url unless you also change the corresponding line in docker-compose.yml, or else fscrawler won't be able to find your files.

Test

Populate data/ or the directory you specified in .env with some files you would like to index.

Run the following.

docker-compose up -d elasticsearch1 elasticsearch2
docker-compose up -d fscrawler

fscrawler should then upload the test files you put in data/. To check that all is well, query the elasticsearch over http (substitute fscrawler_job if you gave it your own name in .env)

curl http://localhost:9200/fscrawler_job/_search | jq

If you see all your documents here, you should be good to go!

Troubleshooting

If you don't see all your documents, use the following command to get more detailed logs.

docker-compose run fscrawler --config_dir /usr/share/fscrawler/config fscrawler_job --restart --debug

Hopefully these logs will make it clear what went wrong. Failing that you can use --trace instead of --debug for even more detailed logs. You can also use --restart whenever you want to re-index everything (otherwise files are only reindexed when they are touched).

Additional options for docker-compose run fscrawler can be found here.

Additional Usage Examples

Example 1

Using docker-compose, startup elasticsearch and run fscrawler on files in test/data every 15 minutes:

docker-compose up elasticsearch1 fscrawler

For the remaining examples, the default config depends on having a running elasticsearch instance on the localhost at port 9200. Start one with:

# [Ref](https://github.com/docker-library/elasticsearch/issues/111)
sudo sysctl -w vm.max_map_count=262144

docker-compose run -p 9200:9200 -d elasticsearch1

For the versions of the docker-compose file, docker-compose, and docker, check the travis builds

Notice that the docker-compose fscrawler service is wired to wait for a healthcheck in elasticsearch. In the case of a manual launch of elasticsearch:

wait for around 15 seconds,
or watch the logs,
or check http://$host:9200/_cat/health?h=status where you need to wait for yellow or green, depending on your application

Example 2

To index the test files provided in this repo

docker run -it --rm \
  --net="host" \
  --name my-fscrawler \
  -v $PWD/test/data/:/usr/share/fscrawler/data/:ro \
  shadiakiki1986/fscrawler

Example 3

Same example above, but with loop=1 to run it only once

docker run -it --rm \
  --net="host" \
  --name my-fscrawler \
  -v $PWD/test/data/:/usr/share/fscrawler/data/:ro \
  -v $PWD/config/myjob:/usr/share/fscrawler/config-mount/myjob:ro \
  shadiakiki1986/fscrawler \
    --config_dir /usr/share/fscrawler/config \
    --loop 1 \
    --trace \
    myjob

Building locally

To build the docker image

git clone https://github.com/shadiakiki1986/docker-fscrawler
docker build -t shadiakiki1986/fscrawler:local . # or use version instead of "local"

To test against elasticsearch locally, follow steps in .travis.yml

Updating

To update fscrawler in this docker container:

install docker (instructions for linux: link)
install docker-compose (instructions for linux: link)
update the version numbers used in Dockerfile
- (deprecated) also update the URL to the maven zip file to download
test can build
- docker build -t shadiakiki1986/fscrawler:2.6 .
- docker build -t shadiakiki1986/fscrawler:2.7-SNAPSHOT-20201204 .
test can run (check section above "Usage / with docker-compose (file 1)", or run tests in .travis.yml file)
commit, tag, push to github

To update the automated build on hub.docker.com

the "latest" tag will get re-built automatically with the push above
to add a new version tag, need to build settings and add it manually, then click save and trigger

To update elasticsearch in the docker-compose for the purpose of testing (e.g. .travis.yml)

edit build/elasticsearch/Dockerfile by changing FROM image
follow steps in .travis.yml

Changelog

Version 2.6 (2020-12-04)

update fscrawler from 2.6-SNAPSHOT to 2.6
update ubuntu base image from 16.04 to 20.04, etc
support fscrawler{,-es5,-es6}

Version 2.6-SNAPSHOT (2018-10-08)

update fscrawler from 2.5 to 2.6-SNAPSHOT (master branch as of today)

Version 2.5.2 (2018-10-08)

docker-compose.yml updates
- update base elasticsearch image to be 6.4 from 6.1
- bring back the file crawl service
- elasticsearch healthcheck to target yellow as a "minimum" now that 6.4 shows green instead of yellow even if 1 node

Version 2.5.1 (2018-10-08)

using fscrawler 2.5

Version 2.4.2 (2018-10-04)

change the main base image to be ubuntu instead of alpine linux
- move the alpine linux image into a "alpine" folder
- move teh ubuntu linux image out of the "ubuntu" folder

Version 2.4 (2017-12-27)

update fscrawler from 2.2 to 2.4
use config-mount for mounting config folder into fscrawler docker container
update elasticsearch service from 5.1.2 to 6.1.1
- elasticsearch 5.1.2 was not working with fscrawler 2.4 anyway because of https://github.com/dadoonet/fscrawler/issues/472
replace git submodule of my fork of elasticsearch-docker with just build/elasticsearch/Dockerfile
- the purpose of the fork was to push healthchecks into upstream, but my PR was rejected
- fork was at https://github.com/shadiakiki1986/elasticsearch-docker
- PR was at https://github.com/elastic/elasticsearch-docker/pull/27
- argumentation at https://github.com/elastic/elasticsearch-docker/issues/60
- proposed solution of just using docker-compose healthcheck would be too long in order to wait for "green" status

Version 2.2 (2017-02-22)

use fscrawler 2.2

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 24

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (8) 🔗