All Projects → mocobeta → lucene-postings-format

mocobeta / lucene-postings-format

Licence: other
At-a-glance overview diagrams of Apache Lucene's default PostingsFormat (inverted index binary format).

Labels

Projects that are alternatives of or similar to lucene-postings-format

Eclipse Instasearch
Eclipse plug-in for fast code search
Stars: ✭ 165 (+153.85%)
Mutual labels:  lucene
cloud-note
无道云笔记,原生JSP的仿有道云笔记项目
Stars: ✭ 66 (+1.54%)
Mutual labels:  lucene
solr
Apache Solr open-source search software
Stars: ✭ 651 (+901.54%)
Mutual labels:  lucene
Smartstorenet
Open Source ASP.NET MVC Enterprise eCommerce Shopping Cart Solution
Stars: ✭ 2,363 (+3535.38%)
Mutual labels:  lucene
Clavin
CLAVIN (Cartographic Location And Vicinity INdexer) is an open source software package for document geoparsing and georesolution that employs context-based geographic entity resolution.
Stars: ✭ 237 (+264.62%)
Mutual labels:  lucene
RedisDirectory
🔒 A simple redis storage engine for lucene - 基于Redis的Lucene索引存储引擎 - Star me if you like it!
Stars: ✭ 18 (-72.31%)
Mutual labels:  lucene
Fxdesktopsearch
A JavaFX based desktop search application.
Stars: ✭ 147 (+126.15%)
Mutual labels:  lucene
IndexWikipedia
A simple utility to index wikipedia dumps using Lucene.
Stars: ✭ 20 (-69.23%)
Mutual labels:  lucene
hermes
A library and microservice implementing the health and care terminology SNOMED CT with support for cross-maps, inference, fast full-text search, autocompletion, compositional grammar and the expression constraint language.
Stars: ✭ 131 (+101.54%)
Mutual labels:  lucene
Valley-eCommerce-prototype
An eCommerce website prototype with a layered architecture and MVC using Spring Boot v1.2, Spring Security, Hibernate, and Apache Lucene for full-text searching. for front-end: Bootstrap, Typeahead.js and Graph.js using Thymeleaf as RE.
Stars: ✭ 28 (-56.92%)
Mutual labels:  lucene
Jblog
🔱一个简洁漂亮的java blog 👉基于Spring /MVC+ Hibernate + MySQL + Bootstrap + freemarker. 实现 🌈
Stars: ✭ 187 (+187.69%)
Mutual labels:  lucene
Lucene
lucene技术细节
Stars: ✭ 233 (+258.46%)
Mutual labels:  lucene
LogiEM
面向Elasticsearch研发与运维人员,围绕集群、索引构建的零侵入、多租户的Elasticsearch GUI管控平台
Stars: ✭ 209 (+221.54%)
Mutual labels:  lucene
Roaringbitmap
A better compressed bitset in Java
Stars: ✭ 2,460 (+3684.62%)
Mutual labels:  lucene
LuceneTutorial
A simple tutorial of Lucene for LIS 501 Introduction to Text Mining students at the University of Wisconsin-Madison (Fall 2021).
Stars: ✭ 62 (-4.62%)
Mutual labels:  lucene
Code4java
Repository for my java projects.
Stars: ✭ 164 (+152.31%)
Mutual labels:  lucene
lucene-demo
基于lucene-5.5.4实现的全文检索demo
Stars: ✭ 70 (+7.69%)
Mutual labels:  lucene
jease
Jease is a Java CMS framework based on Object Database
Stars: ✭ 25 (-61.54%)
Mutual labels:  lucene
luceneappengine
This project provides a directory useful to build Lucene and Google App Engine powered applications
Stars: ✭ 16 (-75.38%)
Mutual labels:  lucene
lqt
Lucene Query Tool
Stars: ✭ 19 (-70.77%)
Mutual labels:  lucene

Lucene PostingsFormat At-a-Glance

English / 日本語

Last updated: 2022-05-28 (commit f5c1f11)

This is an at-a-glance overview of Apache Lucene's default PostingsFormat, which encodes inverted indices into low-level binary format, written for advanced users (and myself).

NOTE: The contents are NOT related to any Lucene release version but specific revision (commit). It will be updated on an irregular basis, also very fine details are often omitted. Please refer the official documentation or source code (the latter is the best) for more detailed and/or up-to-date information.

Overview

Basically, PostingsFormat (i.e., inverted index) composes of two components: term dictionary and postings list.

  1. Term dictionary composes of:
  2. Postings list composes of:

Term Metadata

Overview of the term metadata format (.tmd file).

+--------+-----------+------------+------------+-----+-----------------+----------------+--------+
| Header | NumFields | FieldStats | FieldStats | ... | TermIndexLength | TermDictLength | Footer |
+--------+-----------+------------+------------+-----+-----------------+----------------+--------+
                     |------- ( # of fields ) -------|
  • Header (CodecHeader)
  • NumFields (VInt) : Numbef of fields in this index.
  • FieldStats : The field level statistics and metadata.
  • TermIndexLength (Long) : Whole length of Term Index.
  • TermDictLength (Long) : Whole length of Term Dictionary.
  • Footer (CodecFooter)

FieldStats

+-------------+----------+----------------+----------+-------------------+--
| FieldNumber | NumTerms | RootCodeLength | RootCode | SumTotalTermFreq? |
+-------------+----------+----------------+----------+-------------------+--

--+------------+----------+---------------+---------+---------------+---------+--
  | SumDocFreq | DocCount | MinTermLength | MinTerm | MaxTermLength | MaxTerm |
--+------------+----------+---------------+---------+---------------+---------+--

--+--------------+-----------+-------------+
  | IndexStartFP | FSTHeader | FSTMetadata |
--+--------------+-----------+-------------+
  • FieldNumber (VInt) : Field number.
  • NumTerms (VLong) : Number of unique terms for the field.
  • RootCodeLength (VInt): The length of following RootCode.
  • RootCode (Bytes) :
  • SumTotalTermFreq (VLong): Sum of total term frequencies; omitted when only documents are indexed.
  • SumDocFreq (VLong) : Sum of document frequencies for terms in the field.
  • DocCount (VInt) : Number of documents which have the field.
  • MinTermLength (VInt) : The length of following MinTerm.
  • MinTerm (Bytes): Minimum (first) term for the field.
  • MaxTermLength (VInt): The length of following MaxTerm.
  • MaxTerm (Bytes): Maximum (last) term for the field.
  • IndexStartFP (VLong): The file pointer to Term Index for the field.
  • FSTHeader (CodecHeader)
  • FSTMetadata

Term Dictionary

Overview of the term dictionary format. (.tim file)

+--------+-----------+-----------+-----------+-----+--------+
| Header | NodeBlock | NodeBlock | NodeBlock | ... | Footer |
+--------+-----------+-----------+-----------+-----+--------+
         |------------ ( # of blocks ) ------------|
  • Header (CodecHeader)
  • NodeBlock : Block-packed terms data.
  • Footer (CodecFooter)

NodeBlock

+-------------+--------------+--------+---------------------+---------------
| BlockHeader | SuffixLength | Suffix | SuffixLengthsLength | SuffixLengths 
+-------------+--------------+--------+---------------------+---------------

--+-------------+-----------+-----------+-----------+-----
  | StatsLength | TermStats | TermStats | TermStats | ... 
--+-------------+-----------+-----------+-----------+-----
                |------------- ( # of terms ) ------------

--+----------------+--------------+--------------+--------------+-----+
  | MetadataLength | TermMetadata | TermMetadata | TermMetadata | ... |
--+----------------+--------------+--------------+--------------+-----+
--|                |----------------- ( # of terms ) -----------------|
  • BlockHeader (VInt) : Block metadata (e.g. number of entries the block contains).
  • SuffixLength (VLong) : The length of following Suffix.
  • Suffix (Bytes): Concatenated suffixes for the all terms that are packed in the block.
  • SuffixLengthsLength (VInt) : The length of following SuffixLengths.
  • SuffixLengths (Byte) or Bytes) : The lengths of suffixes of the all terms that are packed in the block.
  • StatsLength (VInt) : Total length of following TermStats.
  • TermStats : The term level statistics.
  • MetaLength (VInt) : Total length of following TermMetadata.
  • TermMetadata

TermStats

+-----------------+---------+----------------+
| SingletonCount? | DocFreq | TotalTermFreq? |
+-----------------+---------+----------------+
  • SingletonCount (VInt) : Number of singleton terms (having DF==1, TTF==1) preceding the term.
  • DocFreq (VInt) : Document frequency of the term.
  • TotalTermFreq (VLong) : Total term frequency of the term; ommitted when only documents are indexed.

TermMetadata

+------------+-----------------+-------------+-------------+---------------------
| DocStartFP | SingletonDocID? | PosStartFP? | PayStartFP? | LastPosBlockOffset?
+------------+-----------------+-------------+-------------+---------------------

--+-------------+
  | SkipOffset? |
--+-------------+
  • DocStartFP (VLong) : The file pointer to the start of the doc ids for this term in .doc file
  • SingletonDocID (VInt) : Document id if there is only one posting for the term.
  • PosStartFP (VLong) : The file pointer to the start of the positions for this term in .pos file; omitted when positions are not indexed.
  • PayStartFP (VLong) : The file pointer to the start of the payloads for this term in .pay file; omitted when neither offsets nor payloads are indexed.
  • LastPosBlockOffset (VLong) : The file offset for the last position for the last block; omitted when there are less positions than the block size.
  • SkipOffset (VLong) : The relative file offset for the start of the skip list to DocStartFP; omitted when there are less docs than the block size.

Term Index

Overview of the term index file format. (.tip file)

+--------+----------+----------+----------+-----+--------+
| Header | FSTIndex | FSTIndex | FSTIndex | ... | Footer |
+--------+----------+----------+----------+-----+--------+
         |---------- ( # of fields ) -----------|
  • Header (CodecHeader)
  • FSTIndex (Bytes) : Binary encoded fst index for a field.
  • Footer (CodecFooter)

Frequencies and Skip data

Overview of the document and term frequencies file format. (.doc file)

+--------+------------------------+------------------------+-----+--------+
| Header | (TermFreqs, SkipData?) | (TermFreqs, SkipData?) | ... | Footer |
+--------+------------------------+------------------------+-----+--------+
         |------------------- ( # of terms ) --------------------|
  • Header (CodecHeader)
  • TermFreqs : The document id deltas and term frequencies.
  • SkipData : The skip list data for faster retrieval.
  • Footer (CodecFooter)

TermFreqs

+-------------------------------+-------------------------------+-----
| (PackedDocDelta, PackedFreq?) | (PackedDocDelta, PackedFreq?) | ... 
+-------------------------------+-------------------------------+-----
|------------------------- ( # of doc blocks ) -----------------------

--+-------------------+-------------------+-------------------+-----+
  | (DocDelta, Freq?) | (DocDelta, Freq?) | (DocDelta, Freq?) | ... |
--+-------------------+-------------------+-------------------+-----+
--|------------------- (# of remaining docs ) ----------------------|

  • PackedDocDelta (PackedInts) : Block compressed document id deltas in each block.
  • PackedFreq (PackedInts) : Block compressed term frequency deltas in each block; omitted when term frequencies are not indexed.
  • DocDelta (VInt) : Document id delta.
  • Freq (VInt) : Term frequency delta; omitted when term frequencies are not indexed.

SkipData

+------------------------------+------------------------------+-----+-----------+
| (SkipLevelLength, SkipLevel) | (SkipLevelLength, SkipLevel) | ... | SkipDatum |
+------------------------------+------------------------------+-----+-----------+
|------------------- ( # of skip levels - 1 ) ----------------------|
  • SkipLevelLength (VLong) : The length of following SkipLevel.
  • SkipLevel
  • SkipDatum : Skip datum for level 0.

SkipLevel

+-----------+-------------------------------+-------------------------------+-----+
| SkipDatum | (SkipDatum, ChildSkipLevelFP) | (SkipDatum, ChildSkipLevelFP) | ... | 
+-----------+-------------------------------+-------------------------------+-----+
            |--- ( maximum number of skip level for the number of docs seen ) ----|
  • SkipDatum
  • ChildSkipLevelFP (VLong) : The file pointer of its direct child skip level data.

SkipDatum

+--------------+----------------+-----------------+---------------------+---------------------+-----------------+--
| SkipDocDelta | SkipDocFPDelta | SkipPosFPDelta? | SkipPosBlockOffset? | SkipPayBlockLength? | SkipPayFPDelta? |
+--------------+----------------+-----------------+---------------------+---------------------+-----------------+--

--+--------------+--------+--------+--------+-----+
  | ImpactLength | Impact | Impact | Impact | ... |
--+--------------+--------+--------+--------+-----+
                 |------- ( # of impacts ) -------|
  • SkipDocDelta (VInt) : The last document id delta in each block.
  • SkipDocFPDelta (VLong) : The file pointer of each block in .doc file.
  • SkipPosFPDelta (VLong) : The file pointer of each related block in .pos file; omitted when positions are not indexed.
  • SkipPosBlockOffset (VInt) : The offset value inside the related block in .pos file; omitted when positions are not indexed.
  • SkipPayBlockLength (VInt) : The sum of the payload lengths of each related block in .pay file; omitted when payloads are not indexed.
  • SkipPayFPDelta (VLong) : The file pointer of each related block in .pay file; omitted when offsets/payloads are not indexed.
  • ImpactLength (VInt) : The total length of following Impacts.
  • Impact : The competitive frequency and norm data for faster top-k retrieval.

Impact

+----------------------+-----------------------+
| CompetitiveFreqDelta | CompetitiveNormDelta? |
+----------------------+-----------------------+
  • CompetitiveFreqDelta (VInt)
  • CompetitiveNormDelta (ZLong)

Positions

Overview of the positions file format. (.pos file)

+--------+---------------+---------------+---------------+-----+--------+
| Header | TermPositions | TermPositions | TermPositions | ... | Footer |
+--------+---------------+---------------+---------------+-----+--------+
         |------------------- ( # of terms ) ------------------|
  • Header (CodecHeader)
  • TermPositions : The term positions data that composes of fixed-size block compressed part and residual part.
  • Footer (CodecFooter)

TermPositions

+----------------+----------------+-----+------------------+------------------+-----+
| PackedPosDelta | PackedPosDelta | ... | ResidualPosDelta | ResidualPosDelta | ... |
+----------------+----------------+-----+------------------+------------------+-----+
|---------- ( # of pos blocks ) --------|------- ( # of remaining positions ) ------|
  • PackedPosDelta (PackedInts) : Block compressed position deltas in each block.
  • ResidualPosDelta : The residual position deltas that are encoded as VInt.

ResidualPosDelta

+----------+----------------+----------+--------------+---------------+
| PosDelta | PayloadLength? | Payload? | OffsetDelta? | OffsetLength? |
+----------+----------------+----------+--------------+---------------+
  • PosDelta (VInt) : Position delta.
  • PayloadLength (VInt) : The length of following Payload; omitted when payloads are not indexed.
  • Payload (Bytes) : Payload data; ommitted when payloads are not indexed.
  • OffsetDelta (VInt) : Start offset delta; omitted when offsets are not indexed.
  • OffsetLength (VInt) : The length of the offset (end offset - start offset); omitted when offsets are not indexed.

Payloads and Offsets

Overview of the payloads and offsets file format. (.pay file)

+--------+-------------------------------+------------------------------+-----+--------+
| Header | (TermPayloads?, TermOffsets?) | (TermPayloads?, TermOffsets) | ... | Footer |
+--------+-------------------------------+------------------------------+-----+--------+
         |---------------------- ( # of terms ) ------------------------------|
  • Header (CodecHeader)
  • TermPayloads : Payload data; ommitted when payloads are not indexed.
  • TermOffsets : Offsets data; omitted when offsets are not indexed.
  • Footer (CodecFooter)

TermPayloads

+--------------------------------------------------------+--------------------------------------------------------+-----+
| (PackedPayloadLengths, SumPayloadLengths, PayloadData) | (PackedPayloadLengths, SumPayloadLengths, PayloadData) | ... |
+--------------------------------------------------------+--------------------------------------------------------+-----+
|------------------------------------------- ( # of pos blocks ) -------------------------------------------------------|
  • PackedPayloadLengths (PackedInts) : Block compressed payload lengths in each block.
  • SumPayloadLengths (VInt) : The sum of payload lengths in each block.
  • PayloadData (Bytes) : Concatenated payload data in each block.

TermOffsets

+------------------------------------------+------------------------------------------+-----+
| (PackedOffsetDelta, PackedOffsetLengths) | (PackedOffsetDelta, PackedOffsetLengths) | ... |
+------------------------------------------+------------------------------------------+-----+
|----------------------------------- ( # of pos blocks ) -----------------------------------|
  • PackedOffsetDelta (PackedInts) : Block compressed start offset deltas in each block.
  • PackedOffsetLengths (PackedInts) : Block compressed offset lengths (end offset - start offset) in each block.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].