Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → miku → Esbulk

miku / Esbulk

Licence: gpl-3.0

Bulk indexing command line tool for elasticsearch

Programming Languages

31211 projects - #10 most used programming language

Labels

hacktoberfest elasticsearch indexing

Projects that are alternatives of or similar to Esbulk

Xapiand

Xapiand: A RESTful Search Engine

Stars: ✭ 347 (+47.66%)

Mutual labels: indexing, elasticsearch

Openwisp Monitoring

Network monitoring system written in Python and Django, designed to be extensible, programmable, scalable and easy to use by end users: once the system is configured, monitoring checks, alerts and metric collection happens automatically.

Stars: ✭ 37 (-84.26%)

Mutual labels: hacktoberfest, elasticsearch

Elasticsearch

The missing elasticsearch ORM for Laravel, Lumen and Native php applications

Stars: ✭ 375 (+59.57%)

Mutual labels: indexing, elasticsearch

Kibana

Your window into the Elastic Stack

Stars: ✭ 16,820 (+7057.45%)

Mutual labels: hacktoberfest, elasticsearch

Grafana

The open and composable observability and data visualization platform. Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB, Postgres and many more.

Stars: ✭ 45,930 (+19444.68%)

Mutual labels: hacktoberfest, elasticsearch

Vue Storefront Api

Vue.js storefront for Magento2 (and not only) - data backend

Stars: ✭ 328 (+39.57%)

Mutual labels: hacktoberfest, elasticsearch

Hugo Elasticsearch

Generate Elasticsearch indexes for Hugo static sites by parsing front matter

Stars: ✭ 19 (-91.91%)

Mutual labels: indexing, elasticsearch

Toshi

A full-text search engine in rust

Stars: ✭ 3,373 (+1335.32%)

Mutual labels: indexing, elasticsearch

Elasticsearch Analysis Openkoreantext

Korean analysis plugin that integrates open-korean-text module into elasticsearch.

Stars: ✭ 101 (-57.02%)

Mutual labels: hacktoberfest, elasticsearch

Dataengineeringproject

Example end to end data engineering project.

Stars: ✭ 82 (-65.11%)

Mutual labels: hacktoberfest, elasticsearch

Yii2 Elasticsearch

Yii 2 Elasticsearch extension

Stars: ✭ 401 (+70.64%)

Mutual labels: hacktoberfest, elasticsearch

Operators

Collection of Kubernetes Operators built with KUDO.

Stars: ✭ 175 (-25.53%)

Mutual labels: hacktoberfest, elasticsearch

Kafka Elasticsearch Injector

Golang app to read records from a set of kafka topics and write them to an elasticsearch cluster

Stars: ✭ 70 (-70.21%)

Mutual labels: hacktoberfest, elasticsearch

Exceptionless

Exceptionless server and jobs

Stars: ✭ 2,107 (+796.6%)

Mutual labels: hacktoberfest, elasticsearch

Elasticsearch Comrade

Elasticsearch admin panel built for ops and monitoring

Stars: ✭ 214 (-8.94%)

Mutual labels: hacktoberfest, elasticsearch

Rhino3dm

Libraries based on OpenNURBS with a RhinoCommon style

Stars: ✭ 232 (-1.28%)

Mutual labels: hacktoberfest

Eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch

Stars: ✭ 235 (+0%)

Mutual labels: elasticsearch

Docker Starter

🏗️ A skeleton to start a new web project with PHP, Docker and Invoke

Stars: ✭ 233 (-0.85%)

Mutual labels: hacktoberfest

Doc2pen

An open source project aimed at making your student life easier!

Stars: ✭ 226 (-3.83%)

Mutual labels: hacktoberfest

Training

🐝 A fast, easy and collaborative open source image annotation tool for teams and individuals.

Stars: ✭ 2,615 (+1012.77%)

Mutual labels: hacktoberfest

View All Similar Projects ➔

esbulk

Fast parallel command line bulk loading utility for elasticsearch. Data is read from a newline delimited JSON file or stdin and indexed into elasticsearch in bulk and in parallel. The shortest command would be:

$ esbulk -index my-index-name < file.ldj

Caveat: If indexing pressure on the bulk API is too high (dozens or hundreds of parallel workers, large batch sizes, depending on you setup), esbulk will halt and report an error:

$ esbulk -index my-index-name -w 100 file.ldj
2017/01/02 16:25:25 error during bulk operation, try less workers (lower -w value) or
                    increase thread_pool.bulk.queue_size in your nodes

Please note that, in such a case, some documents are indexed and some are not. Your index will be in an inconsistent state, since there is no transactional bracket around the indexing process.

However, using defaults (parallism: number of cores) on a single node setup will just work. For larger clusters, increase the number of workers until you see full CPU utilization. After that, more workers won't buy any more speed.

Installation

$ go get github.com/miku/esbulk/cmd/esbulk

For deb or rpm packages, see: https://github.com/miku/esbulk/releases

intenthq made available a Docker image at intenthq/esbulk-docker as well (thanks @albertpastrana), #25.

Run:

$ docker run -it --rm intenthq/esbulk-docker esbulk -v
0.5.1

Since 0.5.2 (May 2019) there is a Dockerfile included in the repo, it uses a multi-stage build and a FROM SCRATCH base, which allows for a lightweight 7.85MB image.

$ git clone https://github.com/miku/esbulk.git
$ cd esbulk
$ make image # use make rmi to cleanup
$ docker run -it --rm esbulk:0.5.2 -v
0.5.2

Or, via hub/cloud:

$ docker run -it --rm tirtir/esbulk -v
0.5.2

Usage

$ esbulk -h
Usage of esbulk:
  -0    set the number of replicas to 0 during indexing
  -cpuprofile string
        write cpu profile to file
  -id string
        name of field to use as id field, by default ids are autogenerated
  -index string
        index name
  -mapping string
        mapping string or filename to apply before indexing
  -memprofile string
        write heap profile to file
  -purge
        purge any existing index before indexing
  -r string
        Refresh interval after import (default "1s")
  -server value
        elasticsearch server, this works with https as well
  -size int
        bulk batch size (default 1000)
  -skipbroken
        skips broken json lines
  -type string
        elasticsearch doc type (default "default")
  -u string
        http basic auth username:password, like curl -u
  -v    prints current program version
  -verbose
        output basic progress
  -w int
        number of workers to use (default 4)
  -z    unzip gz'd file on the fly
  -p string
        pipeline to use to preprocess documents

To index a JSON file, that contains one document per line, just run:

$ esbulk -index example file.ldj

Where file.ldj is line delimited JSON, like:

{"name": "esbulk", "version": "0.2.4"}
{"name": "estab", "version": "0.1.3"}
...

By default esbulk will use as many parallel workers, as there are cores. To tweak the indexing process, adjust the -size and -w parameters.

You can index from gzipped files as well, using the -z flag:

$ esbulk -z -index example file.ldj.gz

Starting with 0.3.7 the preferred method to set a non-default server hostport is via -server, e.g.

$ esbulk -server https://0.0.0.0:9201

This way, you can use https as well, which was not possible before. Options -host and -port are gone as of esbulk 0.5.0.

Reusing IDs

Since version 0.3.8: If you want to reuse IDs from your documents in elasticsearch, you can specify the ID field via -id flag:

$ cat file.json
{"x": "doc-1", "db": "mysql"}
{"x": "doc-2", "db": "mongo"}

Here, we would like to reuse the ID from field x.

$ esbulk -id x -index throwaway -verbose file.json
...

$ curl -s http://localhost:9200/throwaway/_search | jq
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "throwaway",
        "_type": "default",
        "_id": "doc-2",
        "_score": 1,
        "_source": {
          "x": "doc-2",
          "db": "mongo"
        }
      },
      {
        "_index": "throwaway",
        "_type": "default",
        "_id": "doc-1",
        "_score": 1,
        "_source": {
          "x": "doc-1",
          "db": "mysql"
        }
      }
    ]
  }
}

Nested ID fields

Version 0.4.3 adds support for nested ID fields:

$ cat fixtures/pr-8-1.json
{"a": {"b": 1}}
{"a": {"b": 2}}
{"a": {"b": 3}}

$ esbulk -index throwaway -id a.b < fixtures/pr-8-1.json
...

Concatenated ID

Version 0.4.3 adds support for IDs that are the concatenation of multiple fields:

$ cat fixtures/pr-8-2.json
{"a": {"b": 1}, "c": "a"}
{"a": {"b": 2}, "c": "b"}
{"a": {"b": 3}, "c": "c"}

$ esbulk -index throwaway -id a.b,c < fixtures/pr-8-1.json
...

      {
        "_index": "xxx",
        "_type": "default",
        "_id": "1a",
        "_score": 1,
        "_source": {
          "a": {
            "b": 1
          },
          "c": "a"
        }
      },

Using X-Pack

Since 0.4.2: support for secured elasticsearch nodes:

$ esbulk -u elastic:changeme -index myindex file.ldj

A similar project has been started for solr, called solrbulk.

Contributors

and other.

Measurements

$ csvlook -I measurements.csv
| es    | esbulk | docs      | avg_b | nodes | cores | total_heap_gb | t_s   | docs_per_s | repl |
|-------|--------|-----------|-------|-------|-------|---------------|-------|------------|------|
| 6.1.2 | 0.4.8  | 138000000 | 2000  | 1     | 32    |  64           |  6420 |  22100     | 1    |
| 6.1.2 | 0.4.8  | 138000000 | 2000  | 1     |  8    |  30           | 27360 |   5100     | 1    |
| 6.1.2 | 0.4.8  |   1000000 | 2000  | 1     |  4    |   1           |   300 |   3300     | 1    |
| 6.1.2 | 0.4.8  |  10000000 |   26  | 1     |  4    |   8           |   122 |  81000     | 1    |
| 6.1.2 | 0.4.8  |  10000000 |   26  | 1     | 32    |  64           |    32 | 307000     | 1    |
| 6.2.3 | 0.4.10 | 142944530 | 2000  | 2     | 64    | 128           | 26253 |   5444     | 1    |
| 6.2.3 | 0.4.10 | 142944530 | 2000  | 2     | 64    | 128           | 11113 |  12831     | 0    |
| 6.2.3 | 0.4.13 |  15000000 | 6000  | 2     | 64    | 128           |  2460 |   6400     | 0    |

Why not add a row?

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 235

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (11) 🔗