All Projects → miku → Esbulk

miku / Esbulk

Licence: gpl-3.0
Bulk indexing command line tool for elasticsearch

Programming Languages

go
31211 projects - #10 most used programming language

Projects that are alternatives of or similar to Esbulk

Xapiand
Xapiand: A RESTful Search Engine
Stars: ✭ 347 (+47.66%)
Mutual labels:  indexing, elasticsearch
Openwisp Monitoring
Network monitoring system written in Python and Django, designed to be extensible, programmable, scalable and easy to use by end users: once the system is configured, monitoring checks, alerts and metric collection happens automatically.
Stars: ✭ 37 (-84.26%)
Mutual labels:  hacktoberfest, elasticsearch
Elasticsearch
The missing elasticsearch ORM for Laravel, Lumen and Native php applications
Stars: ✭ 375 (+59.57%)
Mutual labels:  indexing, elasticsearch
Kibana
Your window into the Elastic Stack
Stars: ✭ 16,820 (+7057.45%)
Mutual labels:  hacktoberfest, elasticsearch
Grafana
The open and composable observability and data visualization platform. Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB, Postgres and many more.
Stars: ✭ 45,930 (+19444.68%)
Mutual labels:  hacktoberfest, elasticsearch
Vue Storefront Api
Vue.js storefront for Magento2 (and not only) - data backend
Stars: ✭ 328 (+39.57%)
Mutual labels:  hacktoberfest, elasticsearch
Hugo Elasticsearch
Generate Elasticsearch indexes for Hugo static sites by parsing front matter
Stars: ✭ 19 (-91.91%)
Mutual labels:  indexing, elasticsearch
Toshi
A full-text search engine in rust
Stars: ✭ 3,373 (+1335.32%)
Mutual labels:  indexing, elasticsearch
Elasticsearch Analysis Openkoreantext
Korean analysis plugin that integrates open-korean-text module into elasticsearch.
Stars: ✭ 101 (-57.02%)
Mutual labels:  hacktoberfest, elasticsearch
Dataengineeringproject
Example end to end data engineering project.
Stars: ✭ 82 (-65.11%)
Mutual labels:  hacktoberfest, elasticsearch
Yii2 Elasticsearch
Yii 2 Elasticsearch extension
Stars: ✭ 401 (+70.64%)
Mutual labels:  hacktoberfest, elasticsearch
Operators
Collection of Kubernetes Operators built with KUDO.
Stars: ✭ 175 (-25.53%)
Mutual labels:  hacktoberfest, elasticsearch
Kafka Elasticsearch Injector
Golang app to read records from a set of kafka topics and write them to an elasticsearch cluster
Stars: ✭ 70 (-70.21%)
Mutual labels:  hacktoberfest, elasticsearch
Exceptionless
Exceptionless server and jobs
Stars: ✭ 2,107 (+796.6%)
Mutual labels:  hacktoberfest, elasticsearch
Elasticsearch Comrade
Elasticsearch admin panel built for ops and monitoring
Stars: ✭ 214 (-8.94%)
Mutual labels:  hacktoberfest, elasticsearch
Rhino3dm
Libraries based on OpenNURBS with a RhinoCommon style
Stars: ✭ 232 (-1.28%)
Mutual labels:  hacktoberfest
Eland
Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
Stars: ✭ 235 (+0%)
Mutual labels:  elasticsearch
Docker Starter
🏗️ A skeleton to start a new web project with PHP, Docker and Invoke
Stars: ✭ 233 (-0.85%)
Mutual labels:  hacktoberfest
Doc2pen
An open source project aimed at making your student life easier!
Stars: ✭ 226 (-3.83%)
Mutual labels:  hacktoberfest
Training
🐝 A fast, easy and collaborative open source image annotation tool for teams and individuals.
Stars: ✭ 2,615 (+1012.77%)
Mutual labels:  hacktoberfest

esbulk

Fast parallel command line bulk loading utility for elasticsearch. Data is read from a newline delimited JSON file or stdin and indexed into elasticsearch in bulk and in parallel. The shortest command would be:

$ esbulk -index my-index-name < file.ldj

Caveat: If indexing pressure on the bulk API is too high (dozens or hundreds of parallel workers, large batch sizes, depending on you setup), esbulk will halt and report an error:

$ esbulk -index my-index-name -w 100 file.ldj
2017/01/02 16:25:25 error during bulk operation, try less workers (lower -w value) or
                    increase thread_pool.bulk.queue_size in your nodes

Please note that, in such a case, some documents are indexed and some are not. Your index will be in an inconsistent state, since there is no transactional bracket around the indexing process.

However, using defaults (parallism: number of cores) on a single node setup will just work. For larger clusters, increase the number of workers until you see full CPU utilization. After that, more workers won't buy any more speed.

Project Status: Active – The project has reached a stable, usable state and is being actively developed. GitHub All Releases

Installation

$ go get github.com/miku/esbulk/cmd/esbulk

For deb or rpm packages, see: https://github.com/miku/esbulk/releases

intenthq made available a Docker image at intenthq/esbulk-docker as well (thanks @albertpastrana), #25.

Run:

$ docker run -it --rm intenthq/esbulk-docker esbulk -v
0.5.1

Since 0.5.2 (May 2019) there is a Dockerfile included in the repo, it uses a multi-stage build and a FROM SCRATCH base, which allows for a lightweight 7.85MB image.

$ git clone https://github.com/miku/esbulk.git
$ cd esbulk
$ make image # use make rmi to cleanup
$ docker run -it --rm esbulk:0.5.2 -v
0.5.2

Or, via hub/cloud:

$ docker run -it --rm tirtir/esbulk -v
0.5.2

Usage

$ esbulk -h
Usage of esbulk:
  -0    set the number of replicas to 0 during indexing
  -cpuprofile string
        write cpu profile to file
  -id string
        name of field to use as id field, by default ids are autogenerated
  -index string
        index name
  -mapping string
        mapping string or filename to apply before indexing
  -memprofile string
        write heap profile to file
  -purge
        purge any existing index before indexing
  -r string
        Refresh interval after import (default "1s")
  -server value
        elasticsearch server, this works with https as well
  -size int
        bulk batch size (default 1000)
  -skipbroken
        skips broken json lines
  -type string
        elasticsearch doc type (default "default")
  -u string
        http basic auth username:password, like curl -u
  -v    prints current program version
  -verbose
        output basic progress
  -w int
        number of workers to use (default 4)
  -z    unzip gz'd file on the fly
  -p string
        pipeline to use to preprocess documents

To index a JSON file, that contains one document per line, just run:

$ esbulk -index example file.ldj

Where file.ldj is line delimited JSON, like:

{"name": "esbulk", "version": "0.2.4"}
{"name": "estab", "version": "0.1.3"}
...

By default esbulk will use as many parallel workers, as there are cores. To tweak the indexing process, adjust the -size and -w parameters.

You can index from gzipped files as well, using the -z flag:

$ esbulk -z -index example file.ldj.gz

Starting with 0.3.7 the preferred method to set a non-default server hostport is via -server, e.g.

$ esbulk -server https://0.0.0.0:9201

This way, you can use https as well, which was not possible before. Options -host and -port are gone as of esbulk 0.5.0.

Reusing IDs

Since version 0.3.8: If you want to reuse IDs from your documents in elasticsearch, you can specify the ID field via -id flag:

$ cat file.json
{"x": "doc-1", "db": "mysql"}
{"x": "doc-2", "db": "mongo"}

Here, we would like to reuse the ID from field x.

$ esbulk -id x -index throwaway -verbose file.json
...

$ curl -s http://localhost:9200/throwaway/_search | jq
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "throwaway",
        "_type": "default",
        "_id": "doc-2",
        "_score": 1,
        "_source": {
          "x": "doc-2",
          "db": "mongo"
        }
      },
      {
        "_index": "throwaway",
        "_type": "default",
        "_id": "doc-1",
        "_score": 1,
        "_source": {
          "x": "doc-1",
          "db": "mysql"
        }
      }
    ]
  }
}

Nested ID fields

Version 0.4.3 adds support for nested ID fields:

$ cat fixtures/pr-8-1.json
{"a": {"b": 1}}
{"a": {"b": 2}}
{"a": {"b": 3}}

$ esbulk -index throwaway -id a.b < fixtures/pr-8-1.json
...

Concatenated ID

Version 0.4.3 adds support for IDs that are the concatenation of multiple fields:

$ cat fixtures/pr-8-2.json
{"a": {"b": 1}, "c": "a"}
{"a": {"b": 2}, "c": "b"}
{"a": {"b": 3}, "c": "c"}

$ esbulk -index throwaway -id a.b,c < fixtures/pr-8-1.json
...

      {
        "_index": "xxx",
        "_type": "default",
        "_id": "1a",
        "_score": 1,
        "_source": {
          "a": {
            "b": 1
          },
          "c": "a"
        }
      },

Using X-Pack

Since 0.4.2: support for secured elasticsearch nodes:

$ esbulk -u elastic:changeme -index myindex file.ldj

A similar project has been started for solr, called solrbulk.

Contributors

and other.

Measurements

$ csvlook -I measurements.csv
| es    | esbulk | docs      | avg_b | nodes | cores | total_heap_gb | t_s   | docs_per_s | repl |
|-------|--------|-----------|-------|-------|-------|---------------|-------|------------|------|
| 6.1.2 | 0.4.8  | 138000000 | 2000  | 1     | 32    |  64           |  6420 |  22100     | 1    |
| 6.1.2 | 0.4.8  | 138000000 | 2000  | 1     |  8    |  30           | 27360 |   5100     | 1    |
| 6.1.2 | 0.4.8  |   1000000 | 2000  | 1     |  4    |   1           |   300 |   3300     | 1    |
| 6.1.2 | 0.4.8  |  10000000 |   26  | 1     |  4    |   8           |   122 |  81000     | 1    |
| 6.1.2 | 0.4.8  |  10000000 |   26  | 1     | 32    |  64           |    32 | 307000     | 1    |
| 6.2.3 | 0.4.10 | 142944530 | 2000  | 2     | 64    | 128           | 26253 |   5444     | 1    |
| 6.2.3 | 0.4.10 | 142944530 | 2000  | 2     | 64    | 128           | 11113 |  12831     | 0    |
| 6.2.3 | 0.4.13 |  15000000 | 6000  | 2     | 64    | 128           |  2460 |   6400     | 0    |

Why not add a row?

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].