All Projects → lucidworks → Hadoop Solr

lucidworks / Hadoop Solr

Licence: apache-2.0
Code to index HDFS to Solr using MapReduce

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to Hadoop Solr

pulse
phData Pulse application log aggregation and monitoring
Stars: ✭ 13 (-74.51%)
Mutual labels:  hadoop, solr
wasp
WASP is a framework to build complex real time big data applications. It relies on a kind of Kappa/Lambda architecture mainly leveraging Kafka and Spark. If you need to ingest huge amount of heterogeneous data and analyze them through complex pipelines, this is the framework for you.
Stars: ✭ 19 (-62.75%)
Mutual labels:  hadoop, solr
Pdf
编程电子书,电子书,编程书籍,包括C,C#,Docker,Elasticsearch,Git,Hadoop,HeadFirst,Java,Javascript,jvm,Kafka,Linux,Maven,MongoDB,MyBatis,MySQL,Netty,Nginx,Python,RabbitMQ,Redis,Scala,Solr,Spark,Spring,SpringBoot,SpringCloud,TCPIP,Tomcat,Zookeeper,人工智能,大数据类,并发编程,数据库类,数据挖掘,新面试题,架构设计,算法系列,计算机类,设计模式,软件测试,重构优化,等更多分类
Stars: ✭ 12,009 (+23447.06%)
Mutual labels:  hadoop, solr
bigdata-fun
A complete (distributed) BigData stack, running in containers
Stars: ✭ 14 (-72.55%)
Mutual labels:  hadoop, solr
Nagios Plugins
450+ AWS, Hadoop, Cloud, Kafka, Docker, Elasticsearch, RabbitMQ, Redis, HBase, Solr, Cassandra, ZooKeeper, HDFS, Yarn, Hive, Presto, Drill, Impala, Consul, Spark, Jenkins, Travis CI, Git, MySQL, Linux, DNS, Whois, SSL Certs, Yum Security Updates, Kubernetes, Cloudera etc...
Stars: ✭ 1,000 (+1860.78%)
Mutual labels:  hadoop, solr
Devops Python Tools
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Function, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
Stars: ✭ 406 (+696.08%)
Mutual labels:  hadoop, solr
Dockerfiles
50+ DockerHub public images for Docker & Kubernetes - Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak, TeamCity and DevOps tools built on the major Linux distros: Alpine, CentOS, Debian, Fedora, Ubuntu
Stars: ✭ 847 (+1560.78%)
Mutual labels:  hadoop, solr
Storm Camel Example
Real-time analysis and visualization with Storm-AMQ-Camel-Websockets-Highcharts integration.
Stars: ✭ 28 (-45.1%)
Mutual labels:  hadoop
Weblogsanalysissystem
A big data platform for analyzing web access logs
Stars: ✭ 37 (-27.45%)
Mutual labels:  hadoop
Cdc Kafka Hadoop
MySQL to NoSQL real time dataflow
Stars: ✭ 13 (-74.51%)
Mutual labels:  hadoop
Wukong
An ORM Client library for SolrCloud http://wukong.readthedocs.io/en/latest/
Stars: ✭ 10 (-80.39%)
Mutual labels:  solr
Data Algorithms Book
MapReduce, Spark, Java, and Scala for Data Algorithms Book
Stars: ✭ 949 (+1760.78%)
Mutual labels:  hadoop
Shutterstock Heatmap Toolkit
Shutterstock's interactive heatmap toolkit powered by heatmap.js and Solr
Stars: ✭ 37 (-27.45%)
Mutual labels:  solr
Interview Questions Collection
按知识领域整理面试题,包括C++、Java、Hadoop、机器学习等
Stars: ✭ 21 (-58.82%)
Mutual labels:  hadoop
Moosefs
MooseFS – Open Source, Petabyte, Fault-Tolerant, Highly Performing, Scalable Network Distributed File System (Software-Defined Storage)
Stars: ✭ 1,025 (+1909.8%)
Mutual labels:  hadoop
Bigdata Interview
🎯 🌟[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结
Stars: ✭ 857 (+1580.39%)
Mutual labels:  hadoop
Base
https://www.researchgate.net/profile/Rajah_Iyer
Stars: ✭ 48 (-5.88%)
Mutual labels:  hadoop
Ik Analyzer Solr
ik-analyzer for solr 7.x-8.x
Stars: ✭ 1,017 (+1894.12%)
Mutual labels:  solr
Learning Spark
零基础学习spark,大数据学习
Stars: ✭ 37 (-27.45%)
Mutual labels:  hadoop
Jsr203 Hadoop
A Java NIO file system provider for HDFS
Stars: ✭ 35 (-31.37%)
Mutual labels:  hadoop

:packageUser: solr :connectorVersion: 4.0.0

= Hadoop Job Jar

This project includes tools to build a Hadoop job jar that can index documents from HDFS to Solr.

== Contents

  • <>
  • <>
  • <>
  • <>
  • <> ** <> ** <>
  • <>

== Features

  • Uses MapReduce to process large content sets.
  • Supports several types of content including SequenceFiles and CSV files.
  • Fully SolrCloud compatible.

Tested versions:

  • Solr 6.x and higher
  • Hadoop 2.4 and higher, 3.1 and higher

See the wiki page https://github.com/lucidworks/hadoop-solr/wiki/TestedDistros[Tested Distros] for more details on Hadoop distributions tested with this job jar.

// tag::build[] == Build Jars

WARNING: You must use Java 8 or higher to build the .jar.

This repository uses Gradle to build the job jar. It is not required to have Gradle installed before starting the build process.

The compiled job jar relies on code in a separate repository that is included with hadoop-solr as a submodule named solr-hadoop-common.

After cloning hadoop-solr, but before building the job jar, you must first initialize the solr-hadoop-common submodule by running the following commands from the top level of your hadoop-solr clone:

  • git submodule init to initialize your local configuration file for the submodule.

  • git submodule update to fetch all the data from the remote repository and check out the appropriate commit listed in the super-project.

  • if a build is happening from a branch, please make sure that solr-hadoop-common is pointing to the correct SHA. (see https://github.com/blog/2104-working-with-submodules for more details) [source]


hadoop-solr $ git checkout hadoop-solr $ cd solr-hadoop-common hadoop-solr/solr-hadoop-common $ git checkout hadoop-solr/solr-hadoop-common $ cd ..

Once the submodule has been pulled, you can build the job jar with this command:

./gradlew clean shadowJar --info

This will produce one jar, named {packageUser}-hadoop-job-{connectorVersion}.jar, located in the solr-hadoop-job/build/libs directory.

If you rebuild the job jar at a later time (to take advantage of updates), remember to update the submodule with git submodule update.

NOTE: The Gradle task will also produce {packageUser}-hadoop-tika-{connectorVersion}.jar located in the solr-hadoop-common/solr-hadoop-tika/build/libs directory. This jar will be needed if you use the -Dlw.tika.process parameter described below. Using Tika is recommended when when using the DirectoryIngestMapper or the ZipIngestMapper.

=== Troubleshooting

If GitHub + SSH is not configured the following exception will be thrown:

[source]

Cloning into 'solr-hadoop-common'...
Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
fatal: clone of '[email protected]:LucidWorks/solr-hadoop-common.git' into submodule path 'solr-hadoop-common' failed

You can use the https://help.github.com/articles/generating-an-ssh-key/[Generating an SSH Key] tutorial to fix the problem. // end::build[]

// tag::how-it-works[] == How it Works

The job jar runs a series of MapReduce-enabled jobs to convert raw content into documents for indexing to Solr.

WARNING: You must use the job jar with a user that has permissions to write to the hadoop.tmp.dir. The /tmp directory in HDFS must also be writable.

The Hadoop job jar works in three stages designed to take in raw content and output results to Solr. These stages are:

. Create one or more SequenceFiles from the raw content. This is done in one of two ways: .. If the source files are available in a shared Hadoop filesystem, prepare a list of source files and their locations as a SequenceFile. The raw contents of each file are not processed until step 2. .. If the source files are not available, prepare a list of source files and the raw content. This process is done sequentially and can take a significant amount of time if there are a large number of documents and/or if they are very large. . Run a MapReduce job to extract text and metadata from the raw content. ** This process uses ingest mappers to parse documents and prepare them to be added to a Solr index. The section <> below provides a list of available mappers. ** Apache Tika can also be used for additional document parsing and metadata extraction. . Run a MapReduce job to send the extracted content from HDFS to Solr using the SolrJ client. This implementation works with SolrJ's CloudServer Java client which is aware of where Solr is running via Zookeeper.

NOTE: Incremental indexing, where only changes to a document or directory are processed on successive job jar runs, is not supported. All three steps will be completed each time the job jar is run, regardless of whether the original content has changed.

The first step of the process converts the input content into a SequenceFile. In order to do this, the entire contents of that file must be read into memory so that it can be written out as a LWDocument in the SequenceFile. Thus, you should be careful to ensure that the system does not load into memory a file that is larger than the Java heap size of the process.

// tag::ingest-mappers[] == Ingest Mappers Ingest mappers in the job jar parse documents and prepare them for indexing to Solr.

There are several available ingest mappers:

  • CSVIngestMapper
  • DirectoryIngestMapper
  • GrokIngestMapper
  • RegexIngestMapper
  • SequenceFileIngestMapper
  • SolrXMLIngestMapper
  • XMLIngestMapper
  • WarcIngestMapper
  • ZipIngestMapper

The ingest mapper is added to the job arguments with the use of the -cls parameter. However, many mappers require additional arguments. Please refer to the the wiki page https://github.com/lucidworks/hadoop-solr/wiki/IngestMappers[Ingest Mapper Arguments] for each mapper for the required and optional arguments. // end::ingest-mappers[] // end::how-it-works[]

// tag::how-to-use[] == Using the Job Jar

The job jar allows you to index many different types of content stored in HDFS to Solr. It uses MapReduce to leverage the scaling qualities of http://hadoop.apache.org[Apache Hadoop] while indexing content to Solr.

To use the job jar, you will need to initiate a job in your Hadoop cluster (using the hadoop jar command). Additional parameters (arguments) will be required. These arguments define the location of your data, how to parse your content, and the location of your Solr instance for indexing.

The job jar takes three types of arguments. These must be defined in the proper order, as shown below:

  • the main class
  • system and mapper-specific arguments
  • key-value pair arguments

These are discussed in more detail in the section <> below.

[IMPORTANT]

The job jar can be run from any location, but requires a Hadoop client if used on a server where Hadoop (bin/hadoop) is not installed. A properly configured client allows the job jar to be submitted to Hadoop to run the job.

The specific client you need will vary depending on the Hadoop distribution vendor. Speak to your vendor for more information about how to download and configure a client for your distribution.

// tag::job-jar-args[] === Job Jar Arguments

The job jar arguments allow you to define the type of content in HDFS, choose the ingest mappers appropriate for that content, and set other job parameters as needed.

There are three main sections to the job jar arguments:

  • the main class
  • system and mapper-specific arguments
  • key-value pair arguments

WARNING: The arguments must be supplied in the above order.

The available arguments and parameters are described in the following sections.

// tag::main-class[] ==== Main Class

The main class must be specified. For all of the mappers available, it is always defined as com.lucidworks.hadoop.ingest.IngestJob. // end::main-class[]

// tag::mapper-args[] ==== System and Mapper-specific Arguments

System or Mapper-specific arguments, defined with a pattern of -Dargument=value, are supplied after the class name. In many cases, the arguments chosen depend on the ingest mapper chosen. The ingest mapper will be defined later in the argument string.

The order of system-level or mapper-specific arguments does not matter, but they must be after the class name and before the key-value pair arguments.

For available system arguments, see https://github.com/lucidworks/hadoop-solr/wiki/SystemArguments[System Arguments].

For ingest mapper arguments, see https://github.com/lucidworks/hadoop-solr/wiki/IngestMappers[Ingest Mapper Arguments].

Other arguments not described in this repo's documentation (such as Hadoop-specific system arguments) can be supplied as needed and they will be added to the Hadoop configuration. These arguments should be defined with the -Dargument=value syntax. // end::mapper-args[]

// tag::key-value-pairs[] ==== Key-Value Pair Arguments Key-value pair arguments apply to the ingest job generally. These arguments are expressed as -argument value. They are the last arguments supplied before the jar name is defined.

For more information see https://github.com/lucidworks/hadoop-solr/wiki/KeyValuePairArguments[Key-Value Pair Arguments]. // end::key-value-pairs[] // end::job-jar-args[]

// tag::example[] === Example Job

This is a simple job request to index a CSV file which demonstrates the order of the arguments:

[source,bash,subs="verbatim,attributes"]

bin/hadoop jar /path/to/{packageUser}-hadoop-job-{connectorVersion}.jar --<1>

com.lucidworks.hadoop.ingest.IngestJob -- <2>

-Dlww.commit.on.close=true -DcsvDelimiter=| -- <3>

-cls com.lucidworks.hadoop.ingest.CSVIngestMapper -c gettingstarted -i /data/CSV -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -s http://localhost:8888/solr -- <4>

We can summarize the proper order as follows:

<1> The Hadoop command to run a job. This includes the path to the job jar (as necessary). <2> The main ingest class. <3> Mapper arguments, which vary depending on the Mapper class chosen, in the format of -Dargument=value. <4> Key-value arguments, which include the ingest mapper, Solr collection name, and other parameters, in the format of -argument value. // end::example[] // end::how-to-use[]

// tag::contribute[] == How to Contribute

. Fork this repo i.e. <username|organization>/hadoop-solr, following the http://help.github.com/fork-a-repo/[fork a repo/] tutorial. Then, clone the forked repo on your local machine: + [source, git] $ git clone https://github.com/<username|organization>/hadoop-solr.git + . Configure remotes with the https://help.github.com/articles/configuring-a-remote-for-a-fork/[configuring remotes] tutorial. . Create a new branch: + [source] $ git checkout -b new_branch $ git push origin new_branch + Use the https://help.github.com/articles/creating-and-deleting-branches-within-your-repository/[creating branches] tutorial to create the branch from GitHub UI if you prefer. + . Develop on new_branch branch only, do not merge new_branch to your master. Commit changes to new_branch as often as you like: + [source] $ git add $ git commit -m 'commit message' + . Push your changes to GitHub. + [source] $ git push origin new_branch + . Repeat the commit & push steps until your development is complete. . Before submitting a pull request, fetch upstream changes that were done by other contributors: + [source] $ git fetch upstream + . And update master locally: + [source] $ git checkout master $ git pull upstream master + . Merge master branch into new_branch in order to avoid conflicts: + [source] $ git checkout new_branch $ git merge master + . If conflicts happen, use the https://help.github.com/articles/resolving-a-merge-conflict-from-the-command-line/[resolving merge conflicts] tutorial to fix them: . Push master changes to new_branch branch + [source] $ git push origin new_branch + . Add jUnits, as appropriate to test your changes. . When all testing is done, use the https://help.github.com/articles/creating-a-pull-request/[create a pull request] tutorial to submit your change to the repo.

[NOTE]

Please be sure that your pull request sends only your changes, and no others. Check it using the command:

[source] git diff new_branch upstream/master

// end::contribute[]

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].