A Codec is the component that controls how every part of a Lucene segment is written to and read from disk. Swapping the codec changes the binary format of the index without touching any search or analysis logic. Lucene discovers codec implementations through Java’s ServiceLoader SPI mechanism. The codec name is written into every segment file so Lucene can load the right implementation when the segment is opened later.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/apache/lucene/llms.txt
Use this file to discover all available pages before exploring further.
Built-in codecs
Lucene104Codec
The current default. Used automatically by every new
IndexWriterConfig unless you override it. Combines Lucene104PostingsFormat, Lucene90DocValuesFormat, Lucene90StoredFieldsFormat, and more.SimpleTextCodec
Writes every index structure as human-readable plain text. Extremely slow and large — intended only for debugging and understanding the index format.
Codec sub-formats
Codec is an abstract class that delegates each responsibility to a dedicated format object. You can extend FilterCodec to override only the sub-formats you care about.
| Method | Responsibility |
|---|---|
postingsFormat() | Term dictionary and postings lists (doc IDs, positions, offsets, payloads) |
docValuesFormat() | Per-document numeric, binary, sorted, and sorted-set doc values |
storedFieldsFormat() | Stored field values retrieved at search time |
termVectorsFormat() | Term vectors (per-document term/position/offset data) |
normsFormat() | Per-field length normalization factors |
liveDocsFormat() | Bitset of non-deleted documents within a segment |
compoundFormat() | Optional bundling of segment files into a single .cfs compound file |
pointsFormat() | BKD-tree encoded numeric and geo points |
knnVectorsFormat() | HNSW-indexed dense float vectors for k-NN search |
Setting a codec on IndexWriterConfig
IndexWriterConfig.setCodec() accepts any Codec instance. The codec is applied to all new segments flushed or merged by that writer.
Mode.BEST_COMPRESSION:
PerFieldPostingsFormat
The defaultLucene104Codec uses PerFieldPostingsFormat internally to route each field to a specific PostingsFormat. This lets you use, for example, a memory-mapped format for a high-traffic field while keeping others on the default disk-based format.
Override getPostingsFormatForField in a Lucene104Codec subclass to apply per-field routing:
The format name written into the index must match a registered
PostingsFormat implementation at read time. If you use a custom format, register it via Java SPI (META-INF/services/org.apache.lucene.codecs.PostingsFormat) in your JAR.getDocValuesFormatForField to apply per-field doc values formats, and getKnnVectorsFormatForField to tune HNSW vector parameters per field.
Writing a custom PostingsFormat
Extend PostingsFormat
Subclass
org.apache.lucene.codecs.PostingsFormat and supply a unique name that will be written into the index.Register via SPI
Create
META-INF/services/org.apache.lucene.codecs.PostingsFormat in your JAR and add your fully-qualified class name:Checking the default codec
You can inspect or change the process-wide default codec that newIndexWriterConfig instances receive: