rfc8878xml2.original.xml   rfc8878.xml 
<?xml version="1.0" encoding="US-ASCII"?> <?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [ <!-- updated by Chris 05/06/20 -->
<!ENTITY RFC1952 PUBLIC "" "http://xml2rfc.tools.ietf.org/public/rfc/bibxml/re
ference.RFC.1952.xml">
]>
<rfc ipr="trust200902" category="info" obsoletes="8478" <!DOCTYPE rfc SYSTEM "rfc2629-xhtml.ent">
docName="draft-kucherawy-rfc8478bis-05">
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?> <rfc xmlns:xi="http://www.w3.org/2001/XInclude" ipr="trust200902"
obsoletes="8478" docName="draft-kucherawy-rfc8478bis-05" number="8878"
updates="" submissionType="IETF" category="info" consensus="true"
xml:lang="en" tocInclude="true" tocDepth="4" symRefs="true"
sortRefs="true" version="3">
<?rfc toc="yes" ?> <!-- xml2rfc v2v3 conversion 2.44.0 -->
<?rfc tocdepth="4" ?>
<?rfc symrefs="yes" ?>
<?rfc sortrefs="yes"?>
<?rfc compact="yes" ?>
<?rfc subcompact="no"?>
<front> <front>
<title abbrev="application/zstd">
Zstandard Compression and the application/zstd Media Type
</title>
<author initials="Y." surname="Collet"
fullname="Yann Collet">
<organization>
Facebook
</organization>
<address>
<postal>
<street>1 Hacker Way</street>
<city>Menlo Park</city>
<region>CA</region>
<code>94025</code>
<country>United States of America</country>
</postal>
<email>cyan@fb.com</email>
</address>
</author>
<author initials="M. S." surname="Kucherawy" <title abbrev="application/zstd">
fullname="Murray S. Kucherawy" Zstandard Compression and the 'application/zstd' Media Type
role="editor"> </title>
<seriesInfo name="RFC" value="8878"/>
<organization> <author initials="Y." surname="Collet" fullname="Yann Collet">
Facebook <organization>Facebook</organization>
</organization> <address>
<postal>
<address> <street>1 Hacker Way</street>
<postal> <city>Menlo Park</city>
<street>1 Hacker Way</street> <region>CA</region>
<city>Menlo Park</city> <code>94025</code>
<region>CA</region> <country>United States of America</country>
<code>94025</code> </postal>
<country>United States of America</country> <email>cyan@fb.com</email>
</postal> </address>
</author>
<email>msk@fb.com</email> <author initials="M." surname="Kucherawy" fullname="Murray S. Kucherawy" rol
</address> e="editor">
</author> <organization>Facebook</organization>
<address>
<date year="2020"/> <postal>
<street>1 Hacker Way</street>
<area>General</area> <city>Menlo Park</city>
<region>CA</region>
<keyword>compression</keyword> <code>94025</code>
<country>United States of America</country>
</postal>
<email>msk@fb.com</email>
</address>
</author>
<date year="2020" month="September" />
<area>General</area>
<keyword>compression</keyword>
<abstract>
<abstract> <t>Zstandard, or "zstd" (pronounced "zee standard"), is a lossless data
<t> Zstandard, or "zstd" (pronounced "zee standard"), is a compression mechanism. This document describes the mechanism and
data compression mechanism. This document describes the registers a media type, content encoding, and a structured syntax
mechanism and registers a media type and content suffix to be used when transporting zstd-compressed content via MIME.</t>
encoding to be used when transporting zstd-compressed
content via Multipurpose Internet Mail Extensions
(MIME). It also registers a corresponding media
type, content encoding, and structured syntax suffix. </t>
<t> Despite use of the word "standard" as part of its name, <t> Despite use of the word "standard" as part of Zstandard,
readers are advised that this document is not an readers are advised that this document is not an
Internet Standards Track specification; it is being Internet Standards Track specification; it is being
published for informational purposes only. </t> published for informational purposes only. </t>
<t> This document replaces and obsoletes RFC 8478. </t>
<t> This document replaces and obsoletes RFC 8478. </t> </abstract>
</abstract> </front>
<middle>
</front> <section anchor="intro" numbered="true" toc="default">
<name>Introduction</name>
<middle> <t> Zstandard, or "zstd" (pronounced "zee standard"), is a
<section anchor="intro" title="Introduction">
<t> Zstandard, or "zstd" (pronounced "zee standard"), is a
data compression mechanism, akin to gzip data compression mechanism, akin to gzip
<xref target="RFC1952"/>. </t> <xref target="RFC1952" format="default"/>. </t>
<t> Despite use of the word "standard" as part of its name,
<t> Despite use of the word "standard" as part of its name,
readers are advised that this document is not an readers are advised that this document is not an
Internet Standards Track specification; it is being Internet Standards Track specification; it is being
published for informational purposes only. </t> published for informational purposes only. </t>
<t>This document describes the Zstandard format. Also, to enable the
<t> This document describes the Zstandard format. Also, transport of a data object compressed with Zstandard, this document
to enable the transport of a data object compressed with registers a media type, content encoding, and structured syntax suffix
Zstandard, this document registers a media type that can be that can be used to identify such content when it is used in a
used to identify such content when it is used in a payload payload.</t>
encoded using Multipurpose Internet Mail Extensions </section>
(MIME). </t> <section anchor="definitions" numbered="true" toc="default">
</section> <name>Definitions</name>
<t>Some terms used elsewhere in this document are defined
<section anchor="definitions" title="Definitions"> here for clarity. </t>
<t> Some terms used elsewhere in this document are defined <dl newline="false" spacing="normal">
here for clarity. </t> <dt>uncompressed:</dt>
<dd>Describes an arbitrary
<t> <list style="hanging">
<t hangText="uncompressed:"> Describes an arbitrary
set of bytes in their original form, prior to being set of bytes in their original form, prior to being
subjected to compression. </t> subjected to compression. </dd>
<t hangText="compress, compression:"> The act of processing
a set of bytes via the compression mechanism
described here. </t>
<t hangText="compressed:"> Describes the result of passing <dt>compressed:</dt>
<dd>Describes the result of passing
a set of bytes through this mechanism. The original a set of bytes through this mechanism. The original
input has thus been compressed. </t> input has thus been compressed. </dd>
<dt>decompressed:</dt>
<t hangText="decompress, decompression:"> The act of <dd>Describes the result of passing
processing a set of bytes through the inverse
of the compression mechanism described here, in an
attempt to recover the original set of bytes prior
to compression. </t>
<t hangText="decompressed:"> Describes the result of passing
a set of bytes through the reverse of this mechanism. a set of bytes through the reverse of this mechanism.
When this is successful, the decompressed payload and When this is successful, the decompressed payload and
the uncompressed payload are indistinguishable. </t> the uncompressed payload are indistinguishable. </dd>
<dt>encode:</dt>
<t hangText="encode:"> The process of translating data <dd>The process of translating data
from one form to another; this may include compression from one form to another; this may include compression,
or it may refer to other translations done as part or it may refer to other translations done as part
of this specification. </t> of this specification. </dd>
<dt>decode:</dt>
<t hangText="decode:"> The reverse of "encode"; describes <dd>The reverse of "encode"; describes
a process of reversing a prior encoding to recover a process of reversing a prior encoding to recover
the original content. </t> the original content. </dd>
<dt>frame:</dt>
<t hangText="frame:"> Content compressed by Zstandard is <dd>Content compressed by Zstandard is
transformed into a Zstandard frame. Multiple frames transformed into a Zstandard frame. Multiple frames
can be appended into a single file or stream. A frame can be appended into a single file or stream. A frame
is completely independent, has a defined beginning is completely independent, has a defined beginning
and end, and has a set of parameters that tells the and end, and has a set of parameters that tells the
decoder how to decompress it. </t> decoder how to decompress it. </dd>
<dt>block:</dt>
<t hangText="block:"> A frame encapsulates one or multiple <dd>A frame encapsulates one or multiple
blocks. Each block contains arbitrary content, which blocks. Each block contains arbitrary content, which
is described by its header, and has a guaranteed is described by its header, and has a guaranteed
maximum content size that depends upon frame maximum content size that depends upon frame
parameters. Unlike frames, each block depends parameters. Unlike frames, each block depends
on previous blocks for proper decoding. However, each on previous blocks for proper decoding. However, each
block can be decompressed without waiting for its block can be decompressed without waiting for its
successor, allowing streaming operations. </t> successor, allowing streaming operations. </dd>
<dt>natural order:</dt>
<t hangText="natural order:"> A sequence or ordering of <dd>A sequence or ordering of
objects or values that is typical of that type of objects or values that is typical of that type of
object or value. A set of unique integers, for object or value. A set of unique integers, for
example, is in "natural order" if when progressing example, is in "natural order" if, when progressing
from one element in the set or sequence to the next, from one element in the set or sequence to the next,
there is never a decrease in value. </t> there is never a decrease in value. </dd>
</list> </t> </dl>
<t>The naming convention for identifiers within the
<t> The naming convention for identifiers within the
specification is Mixed_Case_With_Underscores. specification is Mixed_Case_With_Underscores.
Identifiers inside square brackets indicate that the Identifiers inside square brackets indicate that the
identifier is optional in the presented context. </t> identifier is optional in the presented context. </t>
</section> </section>
<section anchor="compression" numbered="true" toc="default">
<section anchor="compression" title="Compression Algorithm"> <name>Compression Algorithm</name>
<t> This section describes the Zstandard algorithm. </t> <t> This section describes the Zstandard algorithm. </t>
<t> The purpose of this document is to define a lossless <t> The purpose of this document is to define a lossless
compressed data format that is a) independent of the CPU compressed data format that is a) independent of the CPU
type, operating system, file system, and character set and type, operating system, file system, and character set and
b) is suitable for file compression and pipe and streaming b) suitable for file compression and pipe and streaming
compression, using the Zstandard algorithm. The text of compression, using the Zstandard algorithm. The text of
the specification assumes a basic background in the specification assumes a basic background in
programming at the level of bits and other primitive data programming at the level of bits and other primitive data
representations. </t> representations. </t>
<t> The data can be produced or consumed, even for an
<t> The data can be produced or consumed, even for an
arbitrarily long sequentially presented input data stream, arbitrarily long sequentially presented input data stream,
using only an a priori bounded amount of intermediate using only an a priori bounded amount of intermediate
storage, and hence can be used in data communications. storage; hence, it can be used in data communications.
The format uses the Zstandard compression method, and The format uses the Zstandard compression method, and
an optional xxHash-64 checksum method an optional xxHash-64 checksum method
<xref target="XXHASH"/>, for detection of data <xref target="XXHASH" format="default"/>, for detection of da ta
corruption. </t> corruption. </t>
<t> The data format defined by this specification does not
<t> The data format defined by this specification does not
attempt to allow random access to compressed data. </t> attempt to allow random access to compressed data. </t>
<t> Unless otherwise indicated below, a compliant compressor
<t> Unless otherwise indicated below, a compliant compressor
must produce data sets that conform to the specifications must produce data sets that conform to the specifications
presented here. However, it does not need to support all presented here. However, it does not need to support all
options. </t> options. </t>
<t> A compliant decompressor must be able to decompress at
<t> A compliant decompressor must be able to decompress at
least one working set of parameters that conforms to the least one working set of parameters that conforms to the
specifications presented here. It may also ignore specifications presented here. It may also ignore
informative fields, such as the checksum. Whenever it does informative fields, such as the checksum. Whenever it does
not support a parameter defined in the compressed stream, not support a parameter defined in the compressed stream,
it must produce a non-ambiguous error code and associated it must produce an unambiguous error code and associated
error message explaining which parameter is error message explaining which parameter is
unsupported. </t> unsupported. </t>
<t> This specification is intended for use by implementers
<t> This specification is intended for use by implementers
of software to compress data into Zstandard format and/or of software to compress data into Zstandard format and/or
decompress data from Zstandard format. The Zstandard decompress data from Zstandard format. The Zstandard
format is supported by an open source reference format is supported by an open-source reference
implementation, written in portable C, and available at implementation, written in portable C, and available at
<xref target="ZSTD"/>. </t> <xref target="ZSTD" format="default"/>. </t>
<section anchor="comp_frames" numbered="true" toc="default">
<section anchor="comp_frames" title="Frames"> <name>Frames</name>
<t> Zstandard compressed data is made up of one <t> Zstandard compressed data is made up of one
or more frames. Each frame is independent and can or more frames. Each frame is independent and can
be decompressed independently of other frames. The be decompressed independently of other frames. The
decompressed content of multiple concatenated decompressed content of multiple concatenated
frames is the concatenation of each frame's frames is the concatenation of each frame's
decompressed content. </t> decompressed content. </t>
<t> There are two frame formats defined for
<t> There are two frame formats defined for
Zstandard: Zstandard frames and skippable frames. Zstandard: Zstandard frames and skippable frames.
Zstandard frames contain compressed data, while Zstandard frames contain compressed data, while
skippable frames contain custom user metadata. </t> skippable frames contain custom user metadata. </t>
<section anchor="comp_zstd_frames" numbered="true" toc="default">
<section anchor="comp_zstd_frames" <name>Zstandard Frames</name>
title="Zstandard Frames"> <t> The structure of a single Zstandard frame
<t> The structure of a single Zstandard frame
is as follows: is as follows:
<figure><artwork> </t>
+--------------------+------------+
| Magic_Number | 4 bytes |
+--------------------+------------+
| Frame_Header | 2-14 bytes |
+--------------------+------------+
| Data_Block | n bytes |
+--------------------+------------+
| [More Data_Blocks] | |
+--------------------+------------+
| [Content_Checksum] | 4 bytes |
+--------------------+------------+
</artwork></figure> </t>
<t> <list style="hanging">
<t hangText="Magic_Number:">
4 bytes, little-endian
format. Value: 0xFD2FB528.
</t>
<t hangText="Frame_Header:">
2 to 14 bytes, detailed in
<xref target="comp_frame_hdr"/>.
</t>
<t hangText="Data_Block:">
Detailed in
<xref target="blocks"/>.
This is where
data appears.
</t>
<t hangText="Content_Checksum:">
An optional 32-bit checksum,
only present if
Content_Checksum_Flag is set.
The content checksum is the
result of the XXH64() hash
function
<xref target="XXHASH"/>
digesting the original
(decoded) data as input, and a
seed of zero. The low 4
bytes of the checksum are
stored in little-endian format.
</t>
</list> </t>
<t> The magic number was selected to be less
probable to find at the beginning of an
arbitrary file. It avoids trivial
patterns (0x00, 0xFF, repeated bytes,
increasing bytes, etc.), contains byte
values outside of ASCII range, and doesn't
map into UTF-8 space, all of which reduce
the likelihood of its appearance at the
top of a text file. </t>
<section anchor="comp_frame_hdr"
title="Frame Header">
<t> The frame header has a variable
size, with a minimum of 2 bytes
and up to 14 bytes depending on
optional parameters. The structure
of Frame_Header is as follows:
<figure><artwork> <table anchor="single-frame">
+-------------------------+-----------+ <name>The Structure of a Single Zstandard Frame</name>
| Frame_Header_Descriptor | 1 byte | <tbody>
+-------------------------+-----------+ <tr>
| [Window_Descriptor] | 0-1 byte | <td>Magic_Number</td>
+-------------------------+-----------+ <td>4 bytes</td>
| [Dictionary_ID] | 0-4 bytes | </tr>
+-------------------------+-----------+ <tr>
| [Frame_Content_Size] | 0-8 bytes | <td>Frame_Header</td>
+-------------------------+-----------+ <td>2-14 bytes</td>
</artwork></figure> </t> </tr>
<tr>
<td>Data_Block</td>
<td>n bytes</td>
</tr>
<tr>
<td>[More Data_Blocks]</td>
<td></td>
</tr>
<tr>
<td>[Content_Checksum]</td>
<td>4 bytes</td>
</tr>
</tbody>
</table>
<dl newline="false" spacing="normal">
<dt>Magic_Number:</dt>
<dd>
4 bytes, little-endian format. Value: 0xFD2FB528.
</dd>
<dt>Frame_Header:</dt>
<dd>
2 to 14 bytes, detailed in
<xref target="comp_frame_hdr" format="default"/>.
</dd>
<dt>Data_Block:</dt>
<dd>
Detailed in <xref target="blocks" format="default"/>.
This is where data appears.
</dd>
<dt>Content_Checksum:</dt>
<dd>
An optional 32-bit checksum,
only present if
Content_Checksum_Flag is set.
The content checksum is the
result of the XXH64() hash
function
<xref target="XXHASH" format="default"/>
digesting the original
(decoded) data as input, and a
seed of zero. The low 4
bytes of the checksum are
stored in little-endian format.
</dd>
</dl>
<t> The magic number was selected to be less
probable to find at the beginning of an
arbitrary file. It avoids trivial
patterns (0x00, 0xFF, repeated bytes,
increasing bytes, etc.), contains byte
values outside of the ASCII range, and doesn't
map into UTF-8 space, all of which reduce
the likelihood of its appearance at the
top of a text file. </t>
<section anchor="comp_frame_hdr" numbered="true" toc="default">
<name>Frame Header</name>
<t> The frame header has a variable
size, with a minimum of 2 bytes
up to a maximum of 14 bytes depending on
optional parameters. The structure
of Frame_Header is as follows:</t>
<section anchor="comp_frame_header_desc" <table anchor="frame-header">
title="Frame_Header_Descriptor"> <name>The Structure of Frame_Header</name>
<t> The first header's byte is <tbody>
called the <tr>
Frame_Header_Descriptor. <td>Frame_Header_Descriptor </td>
It describes <td>1 byte</td>
which other fields are </tr>
present. Decoding this <tr>
byte is enough to tell the <td>[Window_Descriptor]</td>
size of Frame_Header. <td>0-1 byte</td>
</tr>
<tr>
<td>[Dictionary_ID]</td>
<td>0-4 bytes</td>
</tr>
<tr>
<td>[Frame_Content_Size]</td>
<td>0-8 bytes</td>
</tr>
</tbody>
</table>
<figure><artwork> <section anchor="comp_frame_header_desc" numbered="true" toc="defaul
+------------+-------------------------+ t">
| Bit Number | Field Name | <name>Frame_Header_Descriptor</name>
+------------+-------------------------+ <t> The first header's byte is called the
| 7-6 | Frame_Content_Size_Flag | Frame_Header_Descriptor. It describes
+------------+-------------------------+ which other fields are present. Decoding this
| 5 | Single_Segment_Flag | byte is enough to tell the size of Frame_Header.
+------------+-------------------------+ </t>
| 4 | (unused) |
+------------+-------------------------+
| 3 | (reserved) |
+------------+-------------------------+
| 2 | Content_Checksum_Flag |
+------------+-------------------------+
| 1-0 | Dictionary_ID_Flag |
+------------+-------------------------+
</artwork></figure> </t>
<t> In this table, bit 7 is <table anchor="Frame-Header-Descriptor">
the highest bit, while bit <name>The Frame_Header_Descriptor</name>
0 is the lowest one. </t> <thead>
<tr>
<th>Bit Number</th>
<th>Field Name</th>
</tr>
</thead>
<tbody>
<tr>
<td>7-6</td>
<td>Frame_Content_Size_Flag </td>
</tr>
<tr>
<td>5</td>
<td>Single_Segment_Flag </td>
</tr>
<tr>
<td>4</td>
<td>(unused)</td>
</tr>
<tr>
<td>3</td>
<td>(reserved)</td>
</tr>
<tr>
<td>2</td>
<td>Content_Checksum_Flag</td>
</tr>
<tr>
<td>1-0</td>
<td>Dictionary_ID_Flag</td>
</tr>
<section title="Frame_Content_Siz </tbody>
e_Flag"> </table>
<t> <t> In <xref target="Frame-Header-Descriptor" />, bit 7 is
This is a 2-bit flag the highest bit, while bit
(equivalent to 0 is the lowest one. </t>
Frame_Header_Descriptor <section numbered="true" toc="default">
right-shifted 6 bits) <name>Frame_Content_Size_Flag</name>
specifying whether <t>
Frame_Content_Size This is a 2-bit flag (equivalent to Frame_Header_Descriptor
(the decompressed right-shifted 6 bits) specifying whether Frame_Content_Size
data size) is provided (the decompressed data size) is provided within the header.
within the header. Frame_Content_Size_Flag provides FCS_Field_Size, which is the
Frame_Content_Size_Flag number of bytes used by Frame_Content_Size according to
provides FCS_Field_Size, <xref target="frame-content-size-flag"/>:
which is the </t>
number of bytes
used by
Frame_Content_Size
according to
the following table:
<figure><artwork>
+-------------------------+--------+---+---+---+
| Frame_Content_Size_Flag | 0 | 1 | 2 | 3 |
+-------------------------+--------+---+---+---+
| FCS_Field_Size | 0 or 1 | 2 | 4 | 8 |
+-------------------------+--------+---+---+---+
</artwork></figu
re></t>
<t> <table anchor="frame-content-size-flag">
When <name>Frame_Content_Size_Flag Provides FCS_Field_Size</name>
Frame_Content_Size_Flag <tbody>
is 0, FCS_Field_Size <tr>
depends on <td>Frame_Content_Size_Flag</td>
Single_Segment_Flag: <td align="center">0</td>
If Single_Segment_Flag <td>1</td>
is set, FCS_Field_Size <td>2</td>
is 1. Otherwise, <td>3</td>
FCS_Field_Size is 0; </tr>
Frame_Content_Size <tr>
is not provided. <td>FCS_Field_Size</td>
</t> <td align="center">0 or 1</td>
</section> <td>2</td>
<td>4</td>
<td>8</td>
</tr>
</tbody>
</table>
<section title="Single_Segment_Fl <t>
ag"> When Frame_Content_Size_Flag
<t> is 0, FCS_Field_Size
depends on
Single_Segment_Flag:
if Single_Segment_Flag
is set, FCS_Field_Size
is 1. Otherwise,
FCS_Field_Size is 0;
Frame_Content_Size
is not provided.
</t>
</section>
<section numbered="true" toc="default">
<name>Single_Segment_Flag</name>
<t>
If this flag is If this flag is
set, data must set, data must
be regenerated be regenerated
within a single within a single
continuous continuous
memory segment. memory segment.
</t> </t>
<t>
<t>
In this case, In this case,
Window_Descriptor Window_Descriptor
byte is skipped, but byte is skipped, but
Frame_Content_Size Frame_Content_Size
is necessarily is necessarily
present. As a present. As a
consequence, consequence,
the decoder the decoder
must allocate a must allocate a
memory segment memory segment
of size equal of a size equal to
or larger than or larger than
Frame_Content_Size. Frame_Content_Size.
</t> </t>
<t>
<t>
In order to protect the In order to protect the
decoder from decoder from
unreasonable memory unreasonable memory
requirements, requirements,
a decoder is a decoder is
allowed to reject a allowed to reject a
compressed frame that compressed frame that
requests a memory size requests a memory size
beyond the decoder's beyond the decoder's
authorized range. authorized range.
</t> </t>
<t>
<t>
For broader For broader
compatibility, compatibility,
decoders are decoders are
recommended to recommended to
support memory support memory
sizes of at least 8 MB. sizes of at least 8 MB.
This is only a This is only a
recommendation; recommendation;
each decoder is each decoder is
free to support free to support
higher or lower higher or lower
limits, depending on limits, depending on
local limitations. local limitations.
</t> </t>
</section> </section>
<section numbered="true" toc="default">
<section title="Unused Bit"> <name>Unused Bit</name>
<t> <t>
A decoder compliant A decoder compliant
with this specification with this specification
version shall not version shall not
interpret this bit. interpret this bit.
It might It might
be used in a future be used in a future
version, to signal a version to signal a
property that is not property that is not
mandatory to properly mandatory to properly
decode the frame. decode the frame.
An encoder compliant An encoder compliant
with this specification with this specification
must set this bit to must set this bit to
zero. zero.
</t> </t>
</section> </section>
<section numbered="true" toc="default">
<section title="Reserved Bit"> <name>Reserved Bit</name>
<t> <t>
This bit is reserved This bit is reserved
for some future for some future
feature. Its value must feature. Its value must
be zero. A decoder be zero. A decoder
compliant with this compliant with this
specification version specification version
must ensure it is not must ensure it is not
set. This bit may be set. This bit may be
used in a future used in a future
revision, to signal a revision to signal a
feature that must be feature that must be
interpreted to decode interpreted to decode
the frame correctly. the frame correctly.
</t> </t>
</section> </section>
<section numbered="true" toc="default">
<section title="Content_Checksum_ <name>Content_Checksum_Flag</name>
Flag"> <t>
<t>
If this flag is set, a If this flag is set, a
32-bit 32-bit
Content_Checksum will Content_Checksum will
be present at the be present at the
frame's end. frame's end.
See the description of See the description of
Content_Checksum above. Content_Checksum above.
</t> </t>
</section> </section>
<section numbered="true" toc="default">
<section title="Dictionary_ID_Fla <name>Dictionary_ID_Flag</name>
g"> <t>
<t> This is a 2-bit flag
This is a 2-bit flag (= Frame_Header_Descriptor &amp; 0x3) indicating
(= Frame_Header_Descripto whether a dictionary ID
r &amp; 0x3) indicating is provided within the
whether a dictionary ID header. It also
is provided within the specifies the size of
header. It also this field as
specifies the size of DID_Field_Size:
this field as </t>
DID_Field_Size:
<figure><artwork>
+--------------------+---+---+---+---+
| Dictionary_ID_Flag | 0 | 1 | 2 | 3 |
+--------------------+---+---+---+---+
| DID_Field_Size | 0 | 1 | 2 | 4 |
+--------------------+---+---+---+---+
</artwork></figu
re></t>
</section>
</section>
<section anchor="comp_window_descr"
title="Window Descriptor">
<t>
This provides guarantees about
the minimum
memory buffer required to
decompress a frame. This
information is important for
decoders to allocate enough
memory.
</t>
<t>
The Window_Descriptor byte is
optional. When
Single_Segment_Flag is set,
Window_Descriptor is not
present. In this case,
Window_Size is
Frame_Content_Size, which can
be any value from 0 to
2^64-1 bytes (16 ExaBytes).
<figure><artwork> <table anchor="Dictionary-ID-Flag">
+------------+----------+----------+ <name>Dictionary_ID_Flag</name>
| Bit Number | 7-3 | 2-0 | <tbody>
+------------+----------+----------+ <tr>
| Field Name | Exponent | Mantissa | <td>Dictionary_ID_Flag</td>
+------------+----------+----------+ <td>0</td>
</artwork></figure></t> <td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>DID_Field_Size</td>
<td>0</td>
<td>1</td>
<td>2</td>
<td>4</td>
</tr>
</tbody>
</table>
</section>
</section>
<section anchor="comp_window_descr" numbered="true" toc="default">
<name>Window Descriptor</name>
<t>
This provides guarantees about
the minimum
memory buffer required to
decompress a frame. This
information is important for
decoders to allocate enough
memory.
</t>
<t>
The Window_Descriptor byte is
optional. When
Single_Segment_Flag is set,
Window_Descriptor is not
present. In this case,
Window_Size is
Frame_Content_Size, which can
be any value from 0 to
2<sup>64</sup> - 1 bytes (16 ExaBytes).
</t>
<t> <table anchor="window-descriptor">
The minimum memory buffer size <name>Window_Descriptor</name>
is called Window_Size. It is <tbody>
described by the following <tr>
formulae: <td>Bit Number</td>
<td align="center">7-3</td>
<td align="center">2-0</td>
</tr>
<tr>
<td>Field Name</td>
<td>Exponent</td>
<td>Mantissa</td>
</tr>
</tbody>
</table>
<figure><artwork> <t>
The minimum memory buffer size
is called Window_Size. It is
described by the following
formulas:
</t>
<artwork name="" type="" align="left" alt=""><![CDATA[
windowLog = 10 + Exponent; windowLog = 10 + Exponent;
windowBase = 1 <&lt; windowLog; windowBase = 1 <&lt; windowLog;
windowAdd = (windowBase / 8) * Mantissa; windowAdd = (windowBase / 8) * Mantissa;
Window_Size = windowBase + windowAdd; Window_Size = windowBase + windowAdd;
</artwork></figure></t> ]]></artwork>
<t>
<t>
The minimum Window_Size is The minimum Window_Size is
1 KB. The maximum Window_Size 1 KB. The maximum Window_Size
is (1&lt;&lt;41) + is (1&lt;&lt;41) +
7*(1&lt;&lt;38) bytes, 7*(1&lt;&lt;38) bytes,
which is 3.75 TB. which is 3.75 TB.
</t> </t>
<t> In general, larger
<t> In general, larger
Window_Size values tend to Window_Size values tend to
improve the compression ratio, bu t improve the compression ratio, bu t
at the cost of increased memory at the cost of increased memory
usage. </t> usage. </t>
<t>
<t>
To properly decode compressed To properly decode compressed
data, a decoder will need to data, a decoder will need to
allocate a buffer of at least allocate a buffer of at least
Window_Size bytes. Window_Size bytes.
</t> </t>
<t>
<t>
In order to protect In order to protect
decoders from unreasonable decoders from unreasonable
memory requirements, a decoder memory requirements, a decoder
is allowed to reject a is allowed to reject a
compressed frame that compressed frame that
requests a memory size beyond requests a memory size beyond the
decoder's authorized range. decoder's authorized range.
</t> </t>
<t>
<t>
For improved interoperability, For improved interoperability,
it's recommended for decoders it's recommended for decoders
to support values of to support values of
Window_Size up to 8 MB and Window_Size up to 8 MB and
for encoders not to generate for encoders not to generate
frames requiring a Window_Size frames requiring a Window_Size
larger than 8 MB. larger than 8 MB.
It's merely a recommendation It's merely a recommendation
though, and decoders are free though, and decoders are free
to support larger or lower to support higher or lower
limits, depending on local limits, depending on local
limitations. limitations.
</t> </t>
</section> </section>
<section anchor="comp_dictionary_id" numbered="true" toc="default">
<name>Dictionary_ID</name>
<section anchor="comp_dictionary_id" <t>
title="Dictionary_ID"> This is a field of variable size,
<t>
This is a variable size field,
which contains the ID of the which contains the ID of the
dictionary required to properly dictionary required to properly
decode the frame. This field decode the frame. This field
is optional. When it's not is optional. When it's not
present, it's up to the decoder present, it's up to the decoder
to know which dictionary to know which dictionary
to use. </t> to use. </t>
<t>
<t>
Dictionary_ID field size is Dictionary_ID field size is
provided by DID_Field_Size. provided by DID_Field_Size.
DID_Field_Size is directly DID_Field_Size is directly
derived from the value derived from the value
of Dictionary_ID_Flag. One byte of Dictionary_ID_Flag. One byte
can represent an ID 0-255; 2 can represent an ID 0-255; 2
bytes can represent an ID bytes can represent an ID
0-65535; 4 bytes can 0-65535; 4 bytes can
represent an ID 0-4294967295. represent an ID 0-4294967295.
Format is little-endian. Format is little-endian.
</t> </t>
<t>
<t>
It is permitted to represent a It is permitted to represent a
small ID (for example, 13) with small ID (for example, 13) with
a large 4-byte dictionary a large 4-byte dictionary
ID, even if it is less ID, even if it is less
efficient. efficient.
</t> </t>
<t>
<t> Within private environments,
Within private environments, any dictionary ID
any dictionary ID can be used. However, for
can be used. However, for frames and dictionaries
frames and dictionaries distributed in public space,
distributed in public space, Dictionary_ID must be
Dictionary_ID must be attributed carefully.
attributed carefully. The following
The following ranges are reserved for use
ranges are reserved for use only with dictionaries that
only with dictionaries that have been registered with
have been registered with IANA (see <xref target="iana_dict" format="default"/>):
IANA (see <xref target="iana_dict </t>
"/>): <dl newline="false" spacing="normal">
<dt>low range:</dt>
<list style="hanging"> <dd>
<?rfc subcompact="yes"?>
<t hangText="low range:">
&lt;= 32767 &lt;= 32767
</t> </dd>
<t hangText="high range:" <dt>high range:</dt>
> <dd>
>= (1 &lt;&lt; 31 &gt;= (1 &lt;&lt;
) 31)
</t> </dd>
<?rfc subcompact="no"?> </dl>
</list> </t> <t> Any other value for
<t> Any other value for
Dictionary_ID can be used Dictionary_ID can be used
by private arrangement by private arrangement
between participants. </t> between participants. </t>
<t> Any payload presented for
<t> Any payload presented for
decompression that decompression that
references an unregistered references an unregistered
reserved dictionary ID reserved dictionary ID
results in an error. </t> results in an error. </t>
</section> </section>
<section anchor="comp_frame_content_size" numbered="true" toc="defau
<section anchor="comp_frame_content_size" lt">
title="Frame Content Size">
<t>
This is the original
(uncompressed) size. This
information is optional.
Frame_Content_Size uses a
variable number of bytes,
provided by FCS_Field_Size.
FCS_Field_Size is provided by
the value of
Frame_Content_Size_Flag.
FCS_Field_Size can be equal to
0 (not present), 1, 2, 4, or
8 bytes.
<figure><artwork>
+----------------+--------------+
| FCS Field Size | Range |
+----------------+--------------+
| 0 | unknown |
+----------------+--------------+
| 1 | 0 - 255 |
+----------------+--------------+
| 2 | 256 - 65791 |
+----------------+--------------+
| 4 | 0 - 2^32 - 1 |
+----------------+--------------+
| 8 | 0 - 2^64 - 1 |
+----------------+--------------+
</artwork></figure></t>
<t>
Frame_Content_Size format is
little-endian. When
FCS_Field_Size is 1, 4, or 8
bytes, the value is read
directly. When FCS_Field_Size
is 2, the offset of 256 is
added. It's allowed to
represent a small size (for
example 18) using any
compatible variant.
</t>
</section>
</section>
<section anchor="blocks" title="Blocks">
<t> After Magic_Number and
Frame_Header, there are some number of
blocks. Each frame must have at least
1 block, but there is no upper
limit on the number of blocks per
frame. </t>
<t> The structure of a block is as
follows:
<figure><artwork>
+--------------+---------------+
| Block_Header | Block_Content |
+--------------+---------------+
| 3 bytes | n bytes |
+--------------+---------------+
</artwork></figure></t>
<t> Block_Header uses 3 bytes,
written using little-endian
convention. It contains three fields:
<figure><artwork>
+------------+------------+------------+
| Last_Block | Block_Type | Block_Size |
+------------+------------+------------+
| bit 0 | bits 1-2 | bits 3-23 |
+------------+------------+------------+
</artwork></figure></t>
<section title="Last_Block">
<t> The lowest bit (Last_Block)
signals whether
this block is the last one.
The frame will end after this
last block. It may be followed
by an optional Content_Checksum
(see
<xref target="comp_zstd_frames"/>
).
</t>
</section>
<section title="Block_Type">
<t> The next 2 bits represent
the Block_Type. There are four
block types:
<figure><artwork>
+-----------+------------------+
| Value | Block_Type |
+-----------+------------------+
| 0 | Raw_Block |
+-----------+------------------+
| 1 | RLE_Block |
+-----------+------------------+
| 2 | Compressed_Block |
+-----------+------------------+
| 3 | Reserved |
+-----------+------------------+
</artwork></figure></t>
<t> <list style="hanging">
<t hangText="Raw_Block:">
This is an
uncompressed block.
Block_Content contains
Block_Size bytes. </t>
<t hangText="RLE_Block:"> <name>Frame_Content_Size</name>
This is a single byte, <t>
repeated Block_Size This is the original
times. Block_Content (uncompressed) size. This
consists of a single information is optional.
byte. On the Frame_Content_Size uses a
decompression side, variable number of bytes,
this byte must be provided by FCS_Field_Size.
repeated Block_Size FCS_Field_Size is provided by
times. </t> the value of
Frame_Content_Size_Flag.
FCS_Field_Size can be equal to
0 (not present), 1, 2, 4, or
8 bytes.
</t>
<t hangText="Compressed_B <table anchor="Frame-Content-Size">
lock:"> <name>Frame_Content_Size</name>
This is a compressed <thead>
block as described in <tr>
<xref target="comp_blocks <th align="center">FCS Field Size</th>
"/>. <th>Range</th>
Block_Size is the </tr>
length of </thead>
Block_Content, namely <tbody>
the compressed data. <tr>
The decompressed size <td align="center">0</td>
is not known, but its <td>unknown</td>
maximum possible value </tr>
is guaranteed (see <tr>
below). </t> <td align="center">1</td>
<td>0 - 255</td>
</tr>
<tr>
<td align="center">2</td>
<td>256 - 65791</td>
</tr>
<tr>
<td align="center">4</td>
<td>0 - 2<sup>32</sup> - 1</td>
</tr>
<tr>
<td align="center">8</td>
<td>0 - 2<sup>64</sup> - 1</td>
</tr>
<t hangText="Reserved:"> </tbody>
This is not a block. </table>
This value cannot be
used with the current
specification. If such
a value is present,
it is considered to be
corrupt data, and a
compliant decoder must
reject it. </t>
</list> </t>
</section>
<section title="Block_Size"> <t>
<t> The upper 21 bits of Frame_Content_Size format is
Block_Header represent the little-endian. When
Block_Size. </t> FCS_Field_Size is 1, 4, or 8
bytes, the value is read
directly. When FCS_Field_Size
is 2, the offset of 256 is
added. It's allowed to
represent a small size (for
example, 18) using any
compatible variant.
</t>
</section>
</section>
<section anchor="blocks" numbered="true" toc="default">
<name>Blocks</name>
<t> After Magic_Number and
Frame_Header, there are some number of
blocks. Each frame must have at least
1 block, but there is no upper
limit on the number of blocks per
frame. </t>
<t> The structure of a block is as
follows:
</t>
<t> When Block_Type is <table anchor="block">
Compressed_Block or Raw_Block, <name>The Structure of a Block</name>
Block_Size is the size of <thead>
Block_Content (hence excluding <tr>
Block_Header). </t> <th align="center">Block_Header</th>
<th align="center">Block_Content</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">3 bytes</td>
<td align="center">n bytes</td>
</tr>
</tbody>
</table>
<t> Block_Header uses 3 bytes,
written using little-endian
convention. It contains three fields:
</t>
<t> When Block_Type is <table anchor="block-header">
RLE_Block, since Block_Content's <name>Block_Header</name>
size is always 1, Block_Size <thead>
represents the number of times <tr>
this byte must be repeated. </t> <th align="center">Last_Block</th>
<th align="center">Block_Type</th>
<th align="center">Block_Size</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">bit 0</td>
<td align="center">bits 1-2</td>
<td align="center">bits 3-23</td>
</tr>
</tbody>
</table>
<t> Block_Size is limited by <section numbered="true" toc="default">
Block_Maximum_Size (see below). <name>Last_Block</name>
</t> <t> The lowest bit (Last_Block)
</section> signals whether
this block is the last one.
The frame will end after this
last block. It may be followed
by an optional Content_Checksum
(see
<xref target="comp_zstd_frames" format="default"/>).
</t>
</section>
<section numbered="true" toc="default">
<name>Block_Type</name>
<t> The next 2 bits represent
the Block_Type. There are four
block types:
</t>
<section <table anchor="block-types">
title="Block_Content and Block_Ma <name>The Four Block Types</name>
ximum_Size"> <thead>
<t> The size of Block_Content <tr>
is limited by <th align="center">Value</th>
Block_Maximum_Size, which is <th>Block_Type</th>
the smallest of: </tr>
</thead>
<tbody>
<tr>
<td align="center">0</td>
<td align="center">Raw_Block</td>
</tr>
<tr>
<td align="center">1</td>
<td align="center">RLE_Block</td>
</tr>
<tr>
<td align="center">2</td>
<td align="center">Compressed_Block</td>
</tr>
<tr>
<td align="center">3</td>
<td align="center">Reserved</td>
</tr>
</tbody>
</table>
<list style="symbols"> <dl newline="false" spacing="normal">
<t> Window_Size </t> <dt>Raw_Block:</dt>
<t> 128 KB </t> <dd>
</list> </t> This is an
uncompressed block.
Block_Content contains
Block_Size bytes. </dd>
<dt>RLE_Block:</dt>
<dd>
This is a single byte,
repeated Block_Size
times. Block_Content
consists of a single
byte. On the
decompression side,
this byte must be
repeated Block_Size
times. </dd>
<dt>Compressed_Block:</dt>
<dd>
This is a compressed
block as described in
<xref target="comp_blocks" format="default"/>.
Block_Size is the
length of
Block_Content, namely
the compressed data.
The decompressed size
is not known, but its
maximum possible value
is guaranteed (see
below). </dd>
<dt>Reserved:</dt>
<dd>
This is not a block.
This value cannot be
used with the current
specification. If such
a value is present,
it is considered to be
corrupt data, and a
compliant decoder must
reject it. </dd>
</dl>
</section>
<section numbered="true" toc="default">
<t> Block_Maximum_Size is <name>Block_Size</name>
<t> The upper 21 bits of
Block_Header represent the
Block_Size. </t>
<t> When Block_Type is
Compressed_Block or Raw_Block,
Block_Size is the size of
Block_Content (hence excluding
Block_Header). </t>
<t> When Block_Type is
RLE_Block, since Block_Content's
size is always 1, Block_Size
represents the number of times
this byte must be repeated. </t>
<t> Block_Size is limited by
Block_Maximum_Size (see below).
</t>
</section>
<section numbered="true" toc="default">
<name>Block_Content and Block_Maximum_Size</name>
<t> The size of Block_Content
is limited by
Block_Maximum_Size, which is
the smallest of:
</t>
<ul spacing="normal">
<li> Window_Size </li>
<li> 128 KB </li>
</ul>
<t> Block_Maximum_Size is
constant for a given frame. constant for a given frame.
This maximum is applicable to This maximum is applicable to
both the decompressed size both the decompressed size
and the compressed size of any and the compressed size of any
block in the frame. </t> block in the frame. </t>
<t> The reasoning for this
<t> The reasoning for this
limit is that a decoder can limit is that a decoder can
read this information at the read this information at the
beginning of a frame and use beginning of a frame and use
it to allocate buffers. it to allocate buffers.
The guarantees on the size of The guarantees on the size of
blocks ensure that the buffers blocks ensure that the buffers
will be large enough for any will be large enough for any
following block of the valid following block of the valid
frame. </t> frame. </t>
<t> If the compressed block
<t> If the compressed block is larger than the uncompressed,
is larger than the uncompressed
sending the uncompressed sending the uncompressed
block (i.e., a Raw_Block) is block (i.e., a Raw_Block) is
recommended instead. </t> recommended instead. </t>
</section> </section>
</section> </section>
<section anchor="comp_blocks" <section anchor="comp_blocks" numbered="true" toc="default">
title="Compressed Blocks"> <name>Compressed Blocks</name>
<t> To decompress a compressed block, <t> To decompress a compressed block,
the compressed size must be provided the compressed size must be provided
from the Block_Size field within from the Block_Size field within
Block_Header. </t> Block_Header. </t>
<t> A compressed block consists of two
<t> A compressed block consists of two sections: a Literals_Section
sections: a Literals Section (<xref target="comp_literals" format="def
(<xref target="comp_literals"/>) and ault"/>) and
a Sequences_Section a Sequences_Section
(<xref target="comp_sequences"/>). (<xref target="comp_sequences" format="de fault"/>).
The results of the two sections are The results of the two sections are
then combined to produce the then combined to produce the
decompressed data in Sequence decompressed data in Sequence
Execution Execution
(<xref target="comp_sequence_exec"/>). (<xref target="comp_sequence_exec" format
</t> ="default"/>).
</t>
<t> To decode a compressed block, the <t> To decode a compressed block, the
following elements are necessary: following elements are necessary:
<list style="symbols"> </t>
<t> Previous decoded data, up <ul spacing="normal">
<li> Previous decoded data, up
to a distance of Window_Size, to a distance of Window_Size,
or the beginning of the Frame, or the beginning of the Frame,
whichever is smaller. whichever is smaller.
Single_Segment_Flag Single_Segment_Flag
will be set in the latter will be set in the latter
case. </t> case. </li>
<li> List of "recent offsets"
<t> List of "recent offsets"
from the previous from the previous
Compressed_Block. Compressed_Block.
</t> </li>
<li> The previous Huffman tree,
<t> The previous Huffman tree,
required by required by
Treeless_Literals_Block type. Treeless_Literals_Block type.
</t> </li>
<t> Previous Finite State Entropy (FSE) decoding <li> Previous Finite State Entropy (FSE) decoding
tables, required by tables, required by
Repeat_Mode, for each symbol Repeat_Mode, for each symbol
type (literals lengths, type (literals lengths,
match lengths, offsets). </t> match lengths, offsets). </li>
</list> </t> </ul>
<t> Note that decoding tables are not
<t> Note that decoding tables are not
always from the previous always from the previous
Compressed_Block: Compressed_Block:
<list style="symbols"> </t>
<t> Every decoding table can <ul spacing="normal">
come from a dictionary. </t> <li> Every decoding table can
come from a dictionary. </li>
<t> The Huffman tree comes from <li> The Huffman tree comes from
the previous the previous
Compressed_Literals_Block. </ Compressed_Literals_Block. </
t> li>
</list></t> </ul>
<section anchor="comp_literals" numbered="true" toc="default">
<section anchor="comp_literals" <name>Literals_Section_Header</name>
title="Literals_Section_Header"> <t> All literals are regrouped
<t> All literals are regrouped
in the first part of the in the first part of the
block. They can be decoded block. They can be decoded
first and then copied during first and then copied during
Sequence Execution (see Sequence Execution (see
<xref target="comp_sequence_exec" />), <xref target="comp_sequence_exec" format="default"/>),
or they can be decoded on the or they can be decoded on the
flow during Sequence flow during Sequence
Execution. </t> Execution. </t>
<t> Literals can be stored
<t> Literals can be stored
uncompressed or compressed uncompressed or compressed
using Huffman prefix codes. using Huffman prefix codes.
When compressed, an optional When compressed, an optional
tree description can be tree description can be
present, followed by 1 or present, followed by 1 or
4 streams. 4 streams.
<figure><artwork> </t>
+----------------------------+ <table anchor="compressed-literals">
| Literals_Section_Header | <name>Compressed Literals</name>
+----------------------------+ <tbody>
| [Huffman_Tree_Description] | <tr>
+----------------------------+ <td align="center">Literals_Section_Header</td>
| [Jump_Table] | </tr>
+----------------------------+ <tr>
| Stream_1 | <td align="center">[Huffman_Tree_Description]</td>
+----------------------------+ </tr>
| [Stream_2] | <tr>
+----------------------------+ <td align="center">[Jump_Table]</td>
| [Stream_3] | </tr>
+----------------------------+ <tr>
| [Stream_4] | <td align="center">Stream_1</td>
+----------------------------+ </tr>
</artwork></figure></t> <tr>
<td align="center">[Stream_2]</td>
<section title="Literals_Section_ </tr>
Header"> <tr>
<t> <td align="center">[Stream_3]</td>
This field describes </tr>
how literals are <tr>
packed. It's a <td align="center">[Stream_4]</td>
byte-aligned </tr>
variable-size bit field, </tbody>
ranging from 1 to </table>
5 bytes, using <section numbered="true" toc="default">
little-endian <name>Literals_Section_Header</name>
convention. <t>
This field describes
<figure><artwork> how literals are
+---------------------+-----------+ packed. It's a
| Literals_Block_Type | 2 bits | byte-aligned
+---------------------+-----------+ variable-size bit field,
| Size_Format | 1-2 bits | ranging from 1 to
+---------------------+-----------+ 5 bytes, using
| Regenerated_Size | 5-20 bits | little-endian
+---------------------+-----------+ convention.
| [Compressed_Size] | 0-18 bits | </t>
+---------------------+-----------+
</artwork></figure><
/t>
<t> In this
representation, bits at
the top are the lowest
bits. </t>
<t> The
Literals_Block_Type
field uses the two
lowest bits of the
first byte, describing
four different block
types:
<figure><artwork>
+---------------------------+-------+
| Literals_Block_Type | Value |
+---------------------------+-------+
| Raw_Literals_Block | 0 |
+---------------------------+-------+
| RLE_Literals_Block | 1 |
+---------------------------+-------+
| Compressed_Literals_Block | 2 |
+---------------------------+-------+
| Treeless_Literals_Block | 3 |
+---------------------------+-------+
</artwork></figure><
/t>
<t> <list style="hanging"
>
<t hangText="Raw_Lite
rals_Block:">
Literals are
stored
uncompressed.
Literals_Section_
Content
is
Regenerated_Size.
</t>
<t hangText="RLE_Lite
rals_Block:">
Literals
consist of a
single-byte
value repeated
Regenerated_Size
times.
Literals_Section_
Content
is 1. </t>
<t hangText="Compress
ed_Literals_Block:">
This is a
standard
Huffman-compresse
d
block, starting
with a Huffman
tree
description.
See details
below. Literals_
Section_Content
is
Compressed_Size.
</t>
<t hangText="Treeless
_Literals_Block:">
This is a
Huffman-compresse
d
block, using the
Huffman tree
from the previous
Compressed_Litera
ls_Block,
or a dictionary
if there is no pr
evious
Huffman-compresse
d
literals block.
Huffman_Tree_Desc
ription
will be
skipped. Note
that if this
mode is
triggered
without any
previous
Huffman-table
in the frame
(or
dictionary, per
<xref target="com
p_dict"/>),
it should be
treated as data
corruption.
Literals_Section_
Content
is Compressed_Siz
e.
</t>
</list> </t>
<t> The Size_Format is
divided into two
families:
<list style="symbols"
>
<t>
For
Raw_Literals_Bloc
k
and
RLE_Literals_Bloc
k,
it's only
necessary to
decode
Regenerated_Size.
There is no
Compressed_Size
field. </t>
<t>
For
Compressed_Block
and
Treeless_Literals
_Block,
it's required
to decode both
Compressed_Size
and
Regenerated_Size
(the
decompressed
size). It's
also necessary
to decode the
number of
streams
(1 or 4). </t>
</list> </t>
<t> For values
spanning several bytes,
the convention is
little endian. </t>
<t> Size_Format for
Raw_Literals_Block
and RLE_Literals_Block
uses 1 or 2 bits. Its
value is (Literals_Sectio
n_Header[0]>>2) &amp; 0x3.
<list style="hanging"
>
<t hangText="Size
_Format == 00 or 10:">
Size_Format
uses 1 bit.
Regenerated_Size
uses 5 bits
(value 0-31).
Literals_Section_
Header
uses 1 byte.
Regenerated_Size
= Literal_Section
_Header[0]>>3.
</t>
<t hangText="Size
_Format == 01:">
Size_Format
uses 2 bits.
Regenerated_Size
uses 12 bits
(values 0-4095).
Literals_Section_
Header
uses 2 bytes.
Regenerated_Size
=
(Literals_Section
_Header[0]>>4)
+
(Literals_Section
_Header[1]&lt;&lt;4).
</t>
<t hangText="Size
_Format == 11:">
Size_Format
uses 2 bits.
Regenerated_Size
uses 20 bits
(values
0-1048575).
Literals_Section_
Header
uses 3
bytes.
Regenerated_Size
=
(Literals_Section
_Header[0]>>4)
+
(Literals_Section
_Header[1]&lt;&lt;4)
+
(Literals_Section
_Header[2]&lt;&lt;12)
</t>
</list> </t> <table anchor="Literals_Section_Header">
<name>Literals_Section_Header</name>
<tbody>
<tr>
<td align="center">Literals_Block_Type</td>
<td align="center">2 bits</td>
</tr>
<tr>
<td align="center">Size_Format</td>
<td align="center">1-2 bits</td>
</tr>
<tr>
<td align="center">Regenerated_Size</td>
<td align="center">5-20 bits</td>
</tr>
<tr>
<td align="center">[Compressed_Size] </td>
<td align="center">0-18 bits</td>
</tr>
</tbody>
</table>
<t> Only Stream_1 is <t> In this
present for these representation, bits at
cases. Note that it is the top are the lowest
permitted to represent bits. </t>
a short value (for <t> The
example, 13) using a Literals_Block_Type
long format, even if field uses the two
it's less efficient. lowest bits of the
</t> first byte, describing
four different block
types:
</t>
<t> Size_Format for <table anchor="Literals_Block_Type">
Compressed_Literals_Block <name>Literals_Block_Type</name>
and <thead>
Treeless_Literals_Block <tr>
always uses 2 bits. <th align="center">Literals_Block_Type</th>
<list style="hanging" <th align="center">Value</th>
> </tr>
<t hangText="Size </thead>
_Format == 00:"> <tbody>
<tr>
<td align="center">Raw_Literals_Block</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">RLE_Literals_Block</td>
<td align="center">1</td>
</tr>
<tr>
<td align="center">Compressed_Literals_Block</td>
<td align="center">2</td>
</tr>
<tr>
<td align="center">Treeless_Literals_Block</td>
<td align="center">3</td>
</tr>
</tbody>
</table>
<dl newline="false" spacing="normal">
<dt>Raw_Literals_Block:</dt>
<dd>
Literals are
stored
uncompressed.
Literals_Section_Content
is
Regenerated_Size.
</dd>
<dt>RLE_Literals_Block:</dt>
<dd>
Literals consist of a single-byte value repeated
Regenerated_Size times. Literals_Section_Content is
1. </dd>
<dt>Compressed_Literals_Block:</dt>
<dd>
This is a standard Huffman-compressed
block, starting with a Huffman tree description.
See details below. Literals_Section_Content
is Compressed_Size.
</dd>
<dt>Treeless_Literals_Block:</dt>
<dd>
This is a
Huffman-compressed
block, using the
Huffman tree
from the previous
Compressed_Literals_Block
or a dictionary
if there is no previous
Huffman-compressed
literals block.
Huffman_Tree_Description
will be
skipped. Note that if this mode is triggered without any previous Huffman table
in the frame (or dictionary, per <xref target="comp_dict" format="default"/>),
it should be treated as data corruption. Literals_Section_Content is
Compressed_Size.
</dd>
</dl>
<t> The Size_Format is divided into two families:
</t>
<ul spacing="normal">
<li>
For Raw_Literals_Block and RLE_Literals_Block,
it's only necessary to decode Regenerated_Size.
There is no Compressed_Size field.</li>
<li>
For Compressed_Block and Treeless_Literals_Block,
it's required to decode both Compressed_Size
and Regenerated_Size (the decompressed
size). It's also necessary to decode the
number of streams (1 or 4). </li>
</ul>
<t> For values
spanning several bytes,
the convention is
little endian. </t>
<t> Size_Format for
Raw_Literals_Block
and RLE_Literals_Block
uses 1 or 2 bits. Its
value is (Literals_Section_Header[0]&gt;&gt;2) &amp; 0x3.
</t>
<dl newline="false" spacing="normal">
<dt>Size_Format == 00 or 10:</dt>
<dd>
Size_Format
uses 1 bit.
Regenerated_Size
uses 5 bits
(value 0-31).
Literals_Section_Header
uses 1 byte.
Regenerated_Size
= Literal_Section_Header[0]&gt;&gt;3.
</dd>
<dt>Size_Format == 01:</dt>
<dd>
Size_Format
uses 2 bits.
Regenerated_Size
uses 12 bits
(values 0-4095).
Literals_Section_Header
uses 2 bytes.
Regenerated_Size
=
(Literals_Section_Header[0]&gt;&gt;4)
+
(Literals_Section_Header[1]&lt;&lt;4).
</dd>
<dt>Size_Format == 11:</dt>
<dd>
Size_Format
uses 2 bits.
Regenerated_Size
uses 20 bits
(values
0-1048575).
Literals_Section_Header
uses 3
bytes.
Regenerated_Size
=
(Literals_Section_Header[0]&gt;&gt;4)
+
(Literals_Section_Header[1]&lt;&lt;4)
+
(Literals_Section_Header[2]&lt;&lt;12).
</dd>
</dl>
<t> Only Stream_1 is
present for these
cases. Note that it is
permitted to represent
a short value (for
example, 13) using a
long format, even if
it's less efficient.
</t>
<t> Size_Format for
Compressed_Literals_Block
and
Treeless_Literals_Block
always uses 2 bits.
</t>
<dl newline="false" spacing="normal">
<dt>Size_Format == 00:</dt>
<dd>
A single A single
stream. Both stream. Both
Regenerated_Size Regenerated_Size
and Compressed_Si ze and Compressed_Si ze
use 10 bits use 10 bits
(values (values
0-1023). 0-1023).
Literals_Section_ Header Literals_Section_ Header
uses 3 uses 3
bytes. bytes.
</t> </dd>
<dt>Size_Format == 01:</dt>
<t hangText="Size <dd>
_Format == 01:">
4 streams. 4 streams.
Both Both
Regenerated_Size Regenerated_Size
and and
Compressed_Size Compressed_Size
use 10 bits use 10 bits
(values (values
0-1023). 0-1023).
Literals_Section_ Header Literals_Section_ Header
uses 3 uses 3
bytes. bytes.
</t> </dd>
<dt>Size_Format == 10:</dt>
<t hangText="Size <dd>
_Format == 10:">
4 streams. 4 streams.
Both Both
Regenerated_Size Regenerated_Size
and and
Compressed_Size Compressed_Size
use 14 bits use 14 bits
(values (values
0-16383). 0-16383).
Literals_Section_ Header Literals_Section_ Header
uses 4 uses 4
bytes. bytes.
</t> </dd>
<dt>Size_Format == 11:</dt>
<t hangText="Size <dd>
_Format == 11:">
4 streams. 4 streams.
Both Both
Regenerated_Size Regenerated_Size
and and
Compressed_Size Compressed_Size
use 18 bits use 18 bits
(values (values
0-262143). 0-262143).
Literals_Section_ Header Literals_Section_ Header
uses 5 uses 5
bytes. bytes.
</t> </dd>
</list> </t> </dl>
<t> Both the
<t> Both the
Compressed_Size and Compressed_Size and
Regenerated_Size fields Regenerated_Size fields
follow little-endian follow little-endian
convention. Note that convention. Note that
Compressed_Size Compressed_Size
includes the size of includes the size of
the Huffman_Tree_Descript ion the Huffman_Tree_Descript ion
when it is when it is
present. </t> present. </t>
</section> </section>
<section numbered="true" toc="default">
<section title="Raw_Literals_Bloc <name>Raw_Literals_Block</name>
k"> <t>
<t>
The data in Stream_1 is The data in Stream_1 is
Regenerated_Size bytes Regenerated_Size bytes
long. It contains long. It contains
the raw literals data the raw literals data
to be used during to be used during
Sequence Execution Sequence Execution
(<xref target="comp_seque (<xref target="comp_seque
nces"/>). nces" format="default"/>).
</t> </t>
</section> </section>
<section numbered="true" toc="default">
<section title="RLE_Literals_Bloc <name>RLE_Literals_Block</name>
k"> <t>
<t>
Stream_1 consists of a Stream_1 consists of a
single byte that single byte that
should be repeated should be repeated
Regenerated_Size times Regenerated_Size times
to generate the to generate the
decoded literals. decoded literals.
</t> </t>
</section> </section>
<section numbered="true" toc="default">
<name>Compressed_Literals_Block and Treeless_Literals_Block</nam
e>
<t>
<section title="Compressed_Litera
ls_Block and Treeless_Literals_Block">
<t>
Both of these modes Both of these modes
contain Huffman-encoded contain Huffman-coded
data. data.
For Treeless_Literals_Blo ck, For Treeless_Literals_Blo ck,
the Huffman table comes f rom the Huffman table comes f rom
the previously the previously
compressed literals compressed literals
block, or from a block, or from a
dictionary; dictionary;
see <xref target="comp_di see <xref target="comp_di
ct"/>. ct" format="default"/>.
</t> </t>
</section> </section>
<section numbered="true" toc="default">
<section title="Huffman_Tree_Desc <name>Huffman_Tree_Description</name>
ription"> <t>
<t>
This section is This section is
only present only present
when the when the
Literals_Block_Type type Literals_Block_Type type
is is
Compressed_Literals_Block Compressed_Literals_Block
(2). The format (2). The format
of Huffman_Tree_Descripti on of Huffman_Tree_Descripti on
can be found in can be found in
<xref target="huffman_tre e_desc"/>. <xref target="huffman_tre e_desc" format="default"/>.
The size of The size of
Huffman_Tree_Description Huffman_Tree_Description
is determined is determined
during the during the
decoding process. It decoding process. It
must be used must be used
to determine to determine
where streams where streams
begin. begin.
<figure><artwork> </t>
<artwork name="" type="" align="left" alt=""><![CDATA[
Total_Streams_Size = Compressed_Size Total_Streams_Size = Compressed_Size
- Huffman_Tree_Description_Size - Huffman_Tree_Description_Size
</artwork></figu ]]></artwork>
re></t> </section>
</section> <section numbered="true" toc="default">
<name>Jump_Table</name>
<section title="Jump_Table"> <t> The Jump_Table
<t> The Jump_Table
is only present is only present
when there are when there are
4 Huffman-coded 4 Huffman-coded
streams. </t> streams. </t>
<t> (Reminder:
<t> (Reminder:
Huffman-compresse d Huffman-compresse d
data data
consists of consists of
either 1 or either 1 or
4 4
Huffman-coded Huffman-coded
streams.) streams.)
</t> </t>
<t> If only 1
<t> If only 1
stream is stream is
present, it is present, it is
a single a single
bitstream bitstream
occupying the occupying the
entire entire
remaining remaining
portion of the portion of the
literals literals
block, encoded block, encoded
as described as described
within within
<xref target="huf <xref target="huf
fman_coded_streams"/>. fman_coded_streams" format="default"/>.
</t> </t>
<t> If there are
<t> If there are
4 streams, 4 streams,
Literals_Section_Header Literals_Section_Header
only provides only provides
enough enough
information to information to
know the know the
decompressed decompressed
and compressed and compressed
sizes of all sizes of all
4 streams 4 streams
skipping to change at line 1410 skipping to change at line 1438
last stream, last stream,
which may be which may be
up to 3 up to 3
bytes smaller, bytes smaller,
to reach a to reach a
total total
decompressed decompressed
size as size as
specified in specified in
Regenerated_Size. </t> Regenerated_Size. </t>
<t>
<t>
The compressed The compressed
size of each size of each
stream is stream is
provided provided
explicitly in explicitly in
the Jump_Table. the Jump_Table.
The Jump_Table The Jump_Table
is 6 bytes long and is 6 bytes long and
consists of three consists of three
2-byte 2-byte
skipping to change at line 1433 skipping to change at line 1460
fields, fields,
describing the describing the
compressed compressed
sizes of the sizes of the
first 3 first 3
streams. streams.
Stream4_Size Stream4_Size
is computed is computed
from from
Total_Streams_Siz e Total_Streams_Siz e
minus sizes of minus the sizes o f
other streams. other streams.
<figure><artwork> </t>
<artwork name="" type="" align="left" alt=""><![CDATA[
Stream4_Size = Total_Streams_Size - 6 Stream4_Size = Total_Streams_Size - 6
- Stream1_Size - Stream2_Size - Stream1_Size - Stream2_Size
- Stream3_Size - Stream3_Size
</artwork></figu ]]></artwork>
re></t> <t>
<t>
Note that if Note that if
Stream1_Size + Stream1_Size +
Stream2_Size + Stream2_Size +
Stream3_Size Stream3_Size
exceeds exceeds
Total_Streams_Siz e, Total_Streams_Siz e,
the data are the data are
considered considered
corrupted. </t> corrupted. </t>
<t> <t>
Each of these Each of these
4 bitstreams 4 bitstreams
is then decoded is then decoded
independently independently
as a as a
Huffman-Coded Huffman-coded
stream, as stream, as
described in described in
<xref target="huf <xref target="huf
fman_coded_streams"/>. fman_coded_streams" format="default"/>.
</t> </t>
</section> </section>
</section> </section>
<section anchor="comp_sequences" numbered="true" toc="default">
<section anchor="comp_sequences" <name>Sequences_Section</name>
title="Sequences_Section"> <t> A compressed block is a
<t> A compressed block is a
succession of sequences. succession of sequences.
A sequence is a literal A sequence is a literal
copy command, followed by copy command, followed by
a match copy command. A a match copy command. A
literal copy command literal copy command
specifies a length. It is specifies a length. It is
the number of bytes to be the number of bytes to be
copied (or extracted) from copied (or extracted) from
the Literals Section. the Literals_Section.
A match copy command A match copy command
specifies an offset and a specifies an offset and a
length. </t> length. </t>
<t> When all sequences are
<t> When all sequences are
decoded, if there are decoded, if there are
literals left in the literals left in the
literals section, these Literals_Section, these
bytes are added at the bytes are added at the
end of the block. </t> end of the block. </t>
<t> This is described in more
<t> This is described in more
detail in detail in
<xref target="comp_sequence_e <xref target="comp_sequence_e
xec"/>. </t> xec" format="default"/>. </t>
<t> The Sequences_Section
<t> The Sequences_Section
regroups all symbols regroups all symbols
required to decode required to decode
commands. There are three commands. There are three
symbol types: literals symbol types: literals
lengths, offsets, and match lengths, offsets, and match
lengths. They are encoded lengths. They are encoded
together, interleaved, in together, interleaved, in
a single "bitstream". </t> a single "bitstream". </t>
<t> The Sequences_Section
<t> The Sequences_Section
starts by a header, starts by a header,
followed by optional followed by optional
probability tables for probability tables for
each symbol type, followed each symbol type, followed
by the bitstream. </t> by the bitstream. </t>
<artwork name="" type="" align="left" alt=""><![CDATA[
<t> <figure><artwork>
Sequences_Section_Header Sequences_Section_Header
[Literals_Length_Table] [Literals_Length_Table]
[Offset_Table] [Offset_Table]
[Match_Length_Table] [Match_Length_Table]
bitStream bitStream
</artwork></figure> </t> ]]></artwork>
<t> To decode the
<t> To decode the
Sequences_Section, it's Sequences_Section, it's
necessary to know its necessary to know its
size. This size is deduced size. This size is deduced
from the size of the Literals _Section: from the size of the Literals _Section:
Sequences_Section_Size = Bloc Sequences_Section_Size = Bloc
k_Size - Literals_Section_Header - Literals_Section_Content </t> k_Size - Literals_Section_Header - Literals_Section_Content. </t>
<section anchor="seq_sec_hdr" numbered="true" toc="default">
<section title="Sequences_Section <name>Sequences_Section_Header</name>
_Header" anchor="seq_sec_hdr"> <t> This header
<t> This header
consists of two consists of two
items: items:
<list style="symbols" </t>
> <ul spacing="normal">
<t> Number_of_Seq <li> Number_of_Sequences </li>
uences </t> <li> Symbol_Compression_Modes </li>
<t> Symbol_Compre </ul>
ssion_Modes </t> <t> Number_of_Sequences
</list> </t> is a variable size
field using between
<t> Number_of_Sequences 1 and 3
is a variable size bytes. If the
field using between first byte is
1 and 3 "byte0":
bytes. If the
first byte is
"byte0":
<list style="symbols" </t>
> <ul spacing="normal">
<t> if <li> if
(byte0 == (byte0 ==
0): there 0): there
are no are no
sequences. sequences.
The The
sequence sequence
section section
stops here. stops here.
Decompressed Decompressed
content is content is
defined defined
entirely as entirely as
Literals Literals_Sect
Section ion
content. content.
The FSE The FSE
tables used tables used
in Repeat_Mod e in Repeat_Mod e
are not are not
updated. </t> updated. </li
>
<t> if (byte0 <li> if (byte0
&lt; 128): &lt; 128):
Number_of_Seq uences Number_of_Seq uences
= byte0. = byte0.
Uses 1 Uses 1
byte. </t> byte. </li>
<li> if (byte0
<t> if (byte0
&lt; 255): &lt; 255):
Number_of_Seq uences Number_of_Seq uences
= =
((byte0 - 128 ) ((byte0 - 128 )
&lt;&lt; 8) + &lt;&lt; 8) +
byte1. Uses byte1. Uses
2 bytes. </t> 2 bytes. </li
>
<t> if (byte0 <li> if (byte0
== 255): == 255):
Number_of_Seq uences Number_of_Seq uences
= byte1 + = byte1 +
(byte2 &lt;&l t; 8) (byte2 &lt;&l t; 8)
+ 0x7F00. + 0x7F00.
Uses 3 Uses 3
bytes. </t> bytes. </li>
</list> </t> </ul>
<t> Symbol_Compression_Modes
<t> Symbol_Compression_Mo is a single byte,
des defining the
is a single byte, compression mode of
defining the each symbol type.
compression mode of </t>
each symbol type.
<figure><artwork>
+-------------+----------------------+
| Bit Number | Field Name |
+-------------+----------------------+
| 7-6 | Literal_Lengths_Mode |
+-------------+----------------------+
| 5-4 | Offsets_Mode |
+-------------+----------------------+
| 3-2 | Match_Lengths_Mode |
+-------------+----------------------+
| 1-0 | Reserved |
+-------------+----------------------+
</artwork></figure>
</t>
<t> The last field, <table anchor="Symbol_Compression_Modes">
Reserved, must be <name>Symbol_Compression_Modes</name>
all zeroes. </t> <thead>
<tr>
<th align="center">Bit Number</th>
<th align="center">Field Name</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">7-6</td>
<td align="center">Literal_Lengths_Mode</td>
</tr>
<tr>
<td align="center">5-4</td>
<td align="center">Offsets_Mode</td>
</tr>
<tr>
<td align="center">3-2</td>
<td align="center">Match_Lengths_Mode</td>
</tr>
<tr>
<td align="center">1-0</td>
<td align="center">Reserved</td>
</tr>
</tbody>
</table>
<t> Literals_Lengths_Mode <t> The last field,
, Reserved, must be
Offsets_Mode, and all zeroes. </t>
Match_Lengths_Mode <t> Literals_Lengths_Mode,
define the Offsets_Mode, and
Compression_Mode of Match_Lengths_Mode
literals lengths, define the
offsets, and match Compression_Mode of
lengths symbols, literals lengths,
respectively. They offsets, and match
follow the same lengths symbols,
enumeration: respectively. They
follow the same
enumeration:
</t>
<figure><artwork> <table anchor="literals">
+-------+---------------------+ <name>Literals_Lengths_Mode, Offsets_Mode, and Match_Lengths_Mode</name>
| Value | Compression_Mode | <thead>
+-------+---------------------+ <tr>
| 0 | Predefined_Mode | <th align="center">Value</th>
+-------+---------------------+ <th align="center">Compression_Mode</th>
| 1 | RLE_Mode | </tr>
+-------+---------------------+ </thead>
| 2 | FSE_Compressed_Mode | <tbody>
+-------+---------------------+ <tr>
| 3 | Repeat_Mode | <td align="center">0</td>
+-------+---------------------+ <td align="center">Predefined_Mode</td>
</artwork></figure>< </tr>
/t> <tr>
<td align="center">1</td>
<td align="center">RLE_Mode</td>
</tr>
<tr>
<td align="center">2</td>
<td align="center">FSE_Compressed_Mode</td>
</tr>
<tr>
<td align="center">3</td>
<td align="center">Repeat_Mode</td>
</tr>
</tbody>
</table>
<t> <list style="hanging" <dl newline="false" spacing="normal">
> <dt>Predefined_Mode:</dt>
<t hangText="Predefin <dd>
ed_Mode:">
A predefined A predefined
FSE FSE
(see <xref target ="comp_fse"/>) (see <xref target ="comp_fse" format="default"/>)
distribution distribution
table is used, as table is used, as
defined in defined in
<xref target="def ault_dist"/>. <xref target="def ault_dist" format="default"/>.
No distribution No distribution
table will be table will be
present. </t> present. </dd>
<dt>RLE_Mode:</dt>
<t hangText="RLE_Mode <dd>
:">
The table The table
description description
consists of a consists of a
single byte, single byte,
which contains which contains
the symbol's the symbol's
value. This value. This
symbol will symbol will
be used for be used for
all sequences. all sequences.
</t> </dd>
<dt>FSE_Compressed_Mode:</dt>
<t hangText="FSE_Comp <dd>
ressed_Mode:">
Standard FSE Standard FSE
compression. A compression. A
distribution distribution
table will be table will be
present. The present. The
format of this format of this
distribution distribution
table is table is
described in described in
<xref target="com p_fse_table"/>. <xref target="com p_fse_table" format="default"/>.
Note that the Note that the
maximum allowed maximum allowed
accuracy log accuracy log
for literals for literals
length and length and
match length match length
tables is 9, tables is 9,
and the and the
maximum maximum
accuracy log accuracy log
skipping to change at line 1701 skipping to change at line 1747
offsets table offsets table
is 8. is 8.
This mode must This mode must
not be used when not be used when
only one symbol only one symbol
is present; is present;
RLE_Mode should RLE_Mode should
be used instead be used instead
(although any (although any
other mode other mode
will work). </t> will work). </dd>
<dt>Repeat_Mode:</dt>
<t hangText="Repeat_M <dd>
ode:">
The table used The table used
in the previous in the previous
Compressed_Block Compressed_Block
with with
Number_Of_Sequenc es > 0 Number_Of_Sequenc es &gt; 0
will be will be
used again, or used again, or
if this is the if this is the
first block, first block,
the table in the table in
the dictionary the dictionary
will be used. will be used.
Note that this Note that this
includes includes
RLE_Mode, RLE_Mode,
skipping to change at line 1745 skipping to change at line 1791
No distribution No distribution
table will be table will be
present. present.
If this mode is If this mode is
used without used without
any previous any previous
sequence table sequence table
in the frame in the frame
(or dictionary; (or dictionary;
see see
<xref target="com p_dict"/>) <xref target="com p_dict" format="default"/>)
to repeat, this to repeat, this
should be should be
treated as treated as
corruption. </t> corruption. </dd>
</dl>
</list></t> <section anchor="codes_lengths_offsets" numbered="true" toc="def
ault">
<section title="Sequence <name>Sequence Codes for Lengths and Offsets</name>
Codes for Lengths and Offsets" anchor="codes_lengths_offsets"> <t> Each symbol is a
<t> Each symbol is a
code in its own code in its own
context, which context, which
specifies Baseline specifies Baseline
and Number_of_Bits and Number_of_Bits
to add. Codes are to add. Codes are
FSE compressed FSE compressed
and interleaved and interleaved
with raw additional with raw additional
bits in the same bits in the same
bitstream. </t> bitstream. </t>
<t> Literals length
<t> Literals length
codes are values codes are values
ranging from 0 to ranging from 0 to
35 inclusive. They 35, inclusive. They
define lengths from define lengths from
0 to 131071 bytes. 0 to 131071 bytes.
The literals length The literals length
is equal to the is equal to the
decoded Baseline decoded Baseline
plus the result of plus the result of
reading reading
Number_of_Bits bits Number_of_Bits bits
from the bitstream, from the bitstream,
as a little-endian as a little-endian
value. value.
<figure><artwork> </t>
+----------------------+----------+----------------+
| Literals_Length_Code | Baseline | Number_of_Bits |
+----------------------+----------+----------------+
| 0-15 | length | 0 |
+----------------------+----------+----------------+
| 16 | 16 | 1 |
+----------------------+----------+----------------+
| 17 | 18 | 1 |
+----------------------+----------+----------------+
| 18 | 20 | 1 |
+----------------------+----------+----------------+
| 19 | 22 | 1 |
+----------------------+----------+----------------+
| 20 | 24 | 2 |
+----------------------+----------+----------------+
| 21 | 28 | 2 |
+----------------------+----------+----------------+
| 22 | 32 | 3 |
+----------------------+----------+----------------+
| 23 | 40 | 3 |
+----------------------+----------+----------------+
| 24 | 48 | 4 |
+----------------------+----------+----------------+
| 25 | 64 | 6 |
+----------------------+----------+----------------+
| 26 | 128 | 7 |
+----------------------+----------+----------------+
| 27 | 256 | 8 |
+----------------------+----------+----------------+
| 28 | 512 | 9 |
+----------------------+----------+----------------+
| 29 | 1024 | 10 |
+----------------------+----------+----------------+
| 30 | 2048 | 11 |
+----------------------+----------+----------------+
| 31 | 4096 | 12 |
+----------------------+----------+----------------+
| 32 | 8192 | 13 |
+----------------------+----------+----------------+
| 33 | 16384 | 14 |
+----------------------+----------+----------------+
| 34 | 32768 | 15 |
+----------------------+----------+----------------+
| 35 | 65536 | 16 |
+----------------------+----------+----------------+
</artwork></figure><
/t>
<t> Match length codes <table anchor="length">
<name>Literals Length Codes</name>
<thead>
<tr>
<th align="center">Literals_Length_Code</th>
<th align="center">Baseline</th>
<th align="center">Number_of_Bits</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">0-15</td>
<td align="center">length</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">16</td>
<td align="center">16</td>
<td align="center">1</td>
</tr>
<tr>
<td align="center">17</td>
<td align="center">18</td>
<td align="center">1</td>
</tr>
<tr>
<td align="center">18</td>
<td align="center">20</td>
<td align="center">1</td>
</tr>
<tr>
<td align="center">19</td>
<td align="center">22</td>
<td align="center">1</td>
</tr>
<tr>
<td align="center">20</td>
<td align="center">24</td>
<td align="center">2</td>
</tr>
<tr>
<td align="center">21</td>
<td align="center">28</td>
<td align="center">2</td>
</tr>
<tr>
<td align="center">22</td>
<td align="center">32</td>
<td align="center">3</td>
</tr>
<tr>
<td align="center">23</td>
<td align="center">40</td>
<td align="center">3</td>
</tr>
<tr>
<td align="center">24</td>
<td align="center">48</td>
<td align="center">4</td>
</tr>
<tr>
<td align="center">25</td>
<td align="center">64</td>
<td align="center">6</td>
</tr>
<tr>
<td align="center">26</td>
<td align="center">128</td>
<td align="center">7</td>
</tr>
<tr>
<td align="center">27</td>
<td align="center">256</td>
<td align="center">8</td>
</tr>
<tr>
<td align="center">28</td>
<td align="center">512</td>
<td align="center">9</td>
</tr>
<tr>
<td align="center">29</td>
<td align="center">1024</td>
<td align="center">10</td>
</tr>
<tr>
<td align="center">30</td>
<td align="center">2048</td>
<td align="center">11</td>
</tr>
<tr>
<td align="center">31</td>
<td align="center">4096</td>
<td align="center">12</td>
</tr>
<tr>
<td align="center">32</td>
<td align="center">8192</td>
<td align="center">13</td>
</tr>
<tr>
<td align="center">33</td>
<td align="center">16384</td>
<td align="center">14</td>
</tr>
<tr>
<td align="center">34</td>
<td align="center">32768</td>
<td align="center">15</td>
</tr>
<tr>
<td align="center">35</td>
<td align="center">65536</td>
<td align="center">16</td>
</tr>
</tbody>
</table>
<t> Match length codes
are values ranging are values ranging
from 0 to 52 from 0 to 52,
inclusive. They inclusive. They
define lengths from define lengths from
3 to 131074 bytes. 3 to 131074 bytes.
The match length is The match length is
equal to the equal to the
decoded Baseline decoded Baseline
plus the result of plus the result of
reading reading
Number_of_Bits bits Number_of_Bits bits
from the bitstream, from the bitstream,
as a little-endian as a little-endian
value. value.
<figure><artwork> </t>
+-------------------+-----------------------+----------------+
| Match_Length_Code | Baseline | Number_of_Bits |
+-------------------+-----------------------+----------------+
| 0-31 | Match_Length_Code + 3 | 0 |
+-------------------+-----------------------+----------------+
| 32 | 35 | 1 |
+-------------------+-----------------------+----------------+
| 33 | 37 | 1 |
+-------------------+-----------------------+----------------+
| 34 | 39 | 1 |
+-------------------+-----------------------+----------------+
| 35 | 41 | 1 |
+-------------------+-----------------------+----------------+
| 36 | 43 | 2 |
+-------------------+-----------------------+----------------+
| 37 | 47 | 2 |
+-------------------+-----------------------+----------------+
| 38 | 51 | 3 |
+-------------------+-----------------------+----------------+
| 39 | 59 | 3 |
+-------------------+-----------------------+----------------+
| 40 | 67 | 4 |
+-------------------+-----------------------+----------------+
| 41 | 83 | 4 |
+-------------------+-----------------------+----------------+
| 42 | 99 | 5 |
+-------------------+-----------------------+----------------+
| 43 | 131 | 7 |
+-------------------+-----------------------+----------------+
| 44 | 259 | 8 |
+-------------------+-----------------------+----------------+
| 45 | 515 | 9 |
+-------------------+-----------------------+----------------+
| 46 | 1027 | 10 |
+-------------------+-----------------------+----------------+
| 47 | 2051 | 11 |
+-------------------+-----------------------+----------------+
| 48 | 4099 | 12 |
+-------------------+-----------------------+----------------+
| 49 | 8195 | 13 |
+-------------------+-----------------------+----------------+
| 50 | 16387 | 14 |
+-------------------+-----------------------+----------------+
| 51 | 32771 | 15 |
+-------------------+-----------------------+----------------+
| 52 | 65539 | 16 |
+-------------------+-----------------------+----------------+
</artwork></figure><
/t>
<t> Offset codes are <table anchor="Match_Length_Code">
<name>Match Length Codes</name>
<thead>
<tr>
<th align="center">Match_Length_Code</th>
<th align="center">Baseline</th>
<th align="center">Number_of_Bits</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">0-31</td>
<td align="center">Match_Length_Code + 3</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">32</td>
<td align="center">35</td>
<td align="center">1</td>
</tr>
<tr>
<td align="center">33</td>
<td align="center">37</td>
<td align="center">1</td>
</tr>
<tr>
<td align="center">34</td>
<td align="center">39</td>
<td align="center">1</td>
</tr>
<tr>
<td align="center">35</td>
<td align="center">41</td>
<td align="center">1</td>
</tr>
<tr>
<td align="center">36</td>
<td align="center">43</td>
<td align="center">2</td>
</tr>
<tr>
<td align="center">37</td>
<td align="center">47</td>
<td align="center">2</td>
</tr>
<tr>
<td align="center">38</td>
<td align="center">51</td>
<td align="center">3</td>
</tr>
<tr>
<td align="center">39</td>
<td align="center">59</td>
<td align="center">3</td>
</tr>
<tr>
<td align="center">40</td>
<td align="center">67</td>
<td align="center">4</td>
</tr>
<tr>
<td align="center">41</td>
<td align="center">83</td>
<td align="center">4</td>
</tr>
<tr>
<td align="center">42</td>
<td align="center">99</td>
<td align="center">5</td>
</tr>
<tr>
<td align="center">43</td>
<td align="center">131</td>
<td align="center">7</td>
</tr>
<tr>
<td align="center">44</td>
<td align="center">259</td>
<td align="center">8</td>
</tr>
<tr>
<td align="center">45</td>
<td align="center">515</td>
<td align="center">9</td>
</tr>
<tr>
<td align="center">46</td>
<td align="center">1027</td>
<td align="center">10</td>
</tr>
<tr>
<td align="center">47</td>
<td align="center">2051</td>
<td align="center">11</td>
</tr>
<tr>
<td align="center">48</td>
<td align="center">4099</td>
<td align="center">12</td>
</tr>
<tr>
<td align="center">49</td>
<td align="center">8195</td>
<td align="center">13</td>
</tr>
<tr>
<td align="center">50</td>
<td align="center">16387</td>
<td align="center">14</td>
</tr>
<tr>
<td align="center">51</td>
<td align="center">32771</td>
<td align="center">15</td>
</tr>
<tr>
<td align="center">52</td>
<td align="center">65539</td>
<td align="center">16</td>
</tr>
</tbody>
</table>
<t> Offset codes are
values ranging from values ranging from
0 to N. </t> 0 to N. </t>
<t> A decoder is free
<t> A decoder is free
to limit its to limit its
maximum supported maximum supported
value for N. value for N.
Support for values Support for values
of at least 22 is of at least 22 is
recommended. recommended.
At the time of this At the time of this
writing, the writing, the
reference decoder reference decoder
supports a maximum supports a maximum
N value of 31. </t> N value of 31. </t>
<t> An offset code is
<t> An offset code is
also the number of also the number of
additional bits to additional bits to
read in read in
little-endian little-endian
fashion and can fashion and can
be translated into be translated into
an Offset_Value an Offset_Value
using the following using the following
formulas: formulas:
<figure><artwork> </t>
Offset_Value = (1 &lt;&lt; offsetCode) + readNBits(offsetCode); <artwork name="" type="" align="left" alt=""><![CDATA[
Offset_Value = (1 << offsetCode) + readNBits(offsetCode);
if (Offset_Value > 3) Offset = Offset_Value - 3; if (Offset_Value > 3) Offset = Offset_Value - 3;
</artwork></figure>< ]]></artwork>
/t> <t> This means that
maximum
<t> This means that Offset_Value is
maximum (2<sup>N+1</sup>) - 1,
Offset_Value is supporting
(2^(N+1))-1, back-reference
supporting distance up to
back-reference (2<sup>N+1</sup>) - 4, but it is
distance up to limited by the
(2^(N+1))-4, but it i maximum
s back-reference
limited by the distance (see
maximum <xref target="comp_window_descr" format="default"/>). </t>
back-reference <t> Offset_Value from
distance (see
<xref target="comp_wi
ndow_descr"/>). </t>
<t> Offset_Value from
1 to 3 are special: 1 to 3 are special:
they define "repeat they define "repeat
codes". This is codes". This is
described in more described in more
detail in detail in
<xref target="repeat_ <xref target="repeat_
offsets"/>. </t> offsets" format="default"/>. </t>
</section> </section>
<section numbered="true" toc="default">
<section title="Decoding <name>Decoding Sequences</name>
Sequences"> <t> FSE bitstreams are
<t> FSE bitstreams are
read in reverse of read in reverse of
the direction they the direction they
are written. In zstd, are written. In zstd,
the compressor the compressor
writes bits forward writes bits forward
into a block, and into a block, and
the decompressor the decompressor
must read the must read the
bitstream bitstream
backwards. </t> backwards. </t>
<t> To find the start
<t> To find the start
of the bitstream, it of the bitstream, it
is therefore is therefore
necessary to know necessary to know
the offset of the the offset of the
last byte of the last byte of the
block, which can be block, which can be
found by counting found by counting
Block_Size bytes Block_Size bytes
after the block after the block
header. </t> header. </t>
<t> After writing the
<t> After writing the
last bit containing last bit containing
information, the information, the
compressor writes a compressor writes a
single 1 bit and single 1 bit and
then fills the rest then fills the rest
of the byte with of the byte with
zero bits. The zero bits. The
last byte of the last byte of the
compressed compressed
bitstream cannot be bitstream cannot be
zero for that zero for that
reason. </t> reason. </t>
<t> When decompressing,
<t> When decompressing,
the last byte the last byte
containing the containing the
padding is the padding is the
first byte to read. first byte to read.
The decompressor The decompressor
needs to skip the needs to skip the
up to 7 bits of up to 7 bits of
0-padding as well 0-padding as well
as the the first 1 as the first 1
bit that occurs. bit that occurs.
Afterwards, the Afterwards, the
useful part of the useful part of the
bitstream bitstream
begins. </t> begins. </t>
<t> FSE decoding <t> FSE decoding
requires a 'state' requires a 'state'
to be carried from to be carried from
symbol to symbol. symbol to symbol.
For more For more
explanation on FSE explanation on FSE
decoding, see decoding, see
<xref target="comp_fs <xref target="comp_fse" format="default"/>. </t>
e"/>. </t> <t> For sequence
<t> For sequence
decoding, a decoding, a
separate state separate state
keeps track of keeps track of
each literal each literals
lengths, offsets, length, offsets,
and match lengths and match lengths
symbols. Some FSE symbol. Some FSE
primitives are primitives are
also used. For also used. For
more details on more details on
the operation of the operation of
these primitives, these primitives,
see see
<xref target="comp_fs <xref target="comp_fs
e"/>. </t> e" format="default"/>. </t>
<t> The bitstream
<t> The bitstream
starts with initial starts with initial
FSE state values, FSE state values,
each using the each using the
required number of required number of
bits in their bits in their
respective respective
accuracy, decoded accuracy, decoded
previously from previously from
their normalized their normalized
distribution. It distribution. It
starts with starts with
Literals_Length_State , Literals_Length_State ,
followed by followed by
Offset_State, and Offset_State, and
finally finally
Match_Length_State. < /t> Match_Length_State. < /t>
<t> Note that all
<t> Note that all
values are read values are read
backward, so the backward, so the
'start' of the 'start' of the
bitstream is at the bitstream is at the
highest position in highest position in
memory, immediately memory, immediately
before the last before the last
1 bit for 1 bit for
padding. </t> padding. </t>
<t> After decoding the
<t> After decoding the
starting states, a starting states, a
single sequence is single sequence is
decoded decoded
Number_Of_Sequences Number_Of_Sequences
times. These times. These
sequences are sequences are
decoded in order decoded in order
from first to last. from first to last.
Since the Since the
compressor writes compressor writes
the bitstream in the bitstream in
the forward the forward
direction, this direction, this
means the means the
compressor must compressor must
encode the encode the
sequences starting sequences starting
with the last one with the last one
and ending with the and ending with the
first. </t> first. </t>
<t> For each of the
<t> For each of the
symbol types, the symbol types, the
FSE state can be FSE state can be
used to determine used to determine
the appropriate the appropriate
code. The code then code. The code then
defines the defines the
Baseline and Number_o f_Bits Baseline and Number_o f_Bits
to read for to read for
each type. The each type. The
description of the description of the
codes for how to codes for how to
determine these determine these
values can be values can be
found in found in
<xref target="seq_sec <xref target="seq_sec
_hdr"/>. </t> _hdr" format="default"/>. </t>
<t> Decoding starts by
<t> Decoding starts by
reading the reading the
Number_of_Bits Number_of_Bits
required to decode required to decode
offset. It offset. It
does the same for does the same for
Match_Length and Match_Length and
then for then for
Literals_Length. Literals_Length.
This sequence is This sequence is
then used for then used for
Sequence Execution Sequence Execution
(see (see
<xref target="comp_se <xref target="comp_se
quence_exec"/>). </t> quence_exec" format="default"/>). </t>
<t> If it is not the
<t> If it is not the
last sequence in last sequence in
the block, the next the block, the next
operation is to operation is to
update states. update states.
Using the rules Using the rules
pre-calculated in precalculated in
the decoding the decoding
tables, tables,
Literals_Length_State Literals_Length_State
is updated, is updated,
followed by followed by
Match_Length_State, Match_Length_State,
and then and then
Offset_State. Offset_State.
See See
<xref target="comp_fs e"/> <xref target="comp_fs e" format="default"/>
for details on how for details on how
to update states to update states
from the from the
bitstream. </t> bitstream. </t>
<t> This operation will
<t> This operation will
be repeated be repeated
Number_of_Sequences Number_of_Sequences
times. At the end, times. At the end,
the bitstream shall the bitstream shall
be entirely be entirely
consumed; otherwise, consumed; otherwise,
the bitstream is the bitstream is
considered considered
corrupted. </t> corrupted. </t>
</section> </section>
</section> </section>
<section anchor="default_dist" numbered="true" toc="default">
<section anchor="default_dist" <name>Default Distributions</name>
title="Default Distribu <t> If Predefined_Mode
tions">
<t> If Predefined_Mode
is selected for a is selected for a
symbol type, its symbol type, its
FSE decoding table FSE decoding table
is generated from a is generated from a
predefined predefined
distribution table distribution table
defined here. For defined here. For
details on how to details on how to
convert this convert this
distribution into distribution into
a decoding table, a decoding table,
see <xref target="com see <xref target="com
p_fse"/>. </t> p_fse" format="default"/>. </t>
<section numbered="true" toc="default">
<section title="Literals <name>Literals Length</name>
Length"> <t> The decoding
<t> The decoding
table uses an table uses an
accuracy log of accuracy log of
6 bits (64 6 bits (64
states). states).
<figure><artwork> </t>
<artwork name="" type="" align="left" alt=""><![CDATA[
short literalsLength_defaultDistribution[36] = short literalsLength_defaultDistribution[36] =
{ 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, { 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1,
-1,-1,-1,-1 -1,-1,-1,-1
}; };
</artwork></figu ]]></artwork>
re></t> </section>
</section> <section numbered="true" toc="default">
<name>Match Length</name>
<section title="Match Len <t> The decoding
gth">
<t> The decoding
table uses an table uses an
accuracy log of accuracy log of
6 bits (64 6 bits (64
states). states).
<figure><artwork> </t>
<artwork name="" type="" align="left" alt=""><![CDATA[
short matchLengths_defaultDistribution[53] = short matchLengths_defaultDistribution[53] =
{ 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, { 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,
-1,-1,-1,-1,-1 -1,-1,-1,-1,-1
}; };
</artwork></figu ]]></artwork>
re></t> </section>
</section> <section numbered="true" toc="default">
<name>Offset Codes</name>
<section title="Offset Co <t> The decoding
des">
<t> The decoding
table uses an table uses an
accuracy log of accuracy log of
5 bits (32 5 bits (32
states), and states) and
supports a supports a
maximum N value maximum N value
of 28, allowing of 28, allowing
offset values offset values
up to up to
536,870,908. </t> 536,870,908. </t>
<t> If any sequence
<t> If any sequence
in the in the
compressed compressed
block requires block requires
a larger offset a larger offset
than this, it's than this, it's
not possible to not possible to
use the default use the default
distribution to distribution to
represent it. represent it.
<figure><artwork> </t>
<artwork name="" type="" align="left" alt=""><![CDATA[
short offsetCodes_defaultDistribution[29] = short offsetCodes_defaultDistribution[29] =
{ 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, { 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1
}; };
</artwork></figu ]]></artwork>
re></t> </section>
</section> </section>
</section> </section>
</section> </section>
</section> <section anchor="comp_sequence_exec" numbered="true" toc="default">
<name>Sequence Execution</name>
<section anchor="comp_sequence_exec" <t> Once literals and sequences have been decoded,
title="Sequence Execution">
<t> Once literals and sequences have been decoded,
they are combined to produce the decoded content they are combined to produce the decoded content
of a block. </t> of a block. </t>
<t> Each sequence consists of a tuple of
<t> Each sequence consists of a tuple of
(literals_length, offset_value, match_length), (literals_length, offset_value, match_length),
decoded as described in the Sequences_Section decoded as described in the Sequences_Section
(<xref target="comp_sequences"/>). To execute a (<xref target="comp_sequences" format="default"/>). T o execute a
sequence, first copy literals_length bytes from sequence, first copy literals_length bytes from
the decoded literals to the output. </t> the decoded literals to the output. </t>
<t> Then, match_length bytes are copied from previous
<t> Then, match_length bytes are copied from previous
decoded data. The offset to copy from is decoded data. The offset to copy from is
determined by offset_value: determined by offset_value:
<list style="symbols"> </t>
<t> if Offset_Value > 3, then the offset is <ul spacing="normal">
Offset_Value - 3; </t> <li> if Offset_Value &gt; 3, then the offset is
Offset_Value - 3; </li>
<t> if Offset_Value is from 1-3, the offset is <li> if Offset_Value is from 1-3, the offset is
a special repeat offset value. See a special repeat offset value. See
<xref target="repeat_offsets"/> for how <xref target="repeat_offsets" format="default
the offset is determined in this case. </t> "/> for how
</list> </t> the offset is determined in this case. </li>
</ul>
<t> The offset is defined as from the current <t> The offset is defined as from the current
position (after copying the literals), so an position (after copying the literals), so an
offset of 6 and a match length of offset of 6 and a match length of
3 means that 3 bytes should be copied from 6 bytes 3 means that 3 bytes should be copied from 6 bytes
back. Note that all offsets leading to previously back. Note that all offsets leading to previously
decoded data must be smaller than Window_Size decoded data must be smaller than Window_Size
defined in Frame_Header_Descriptor defined in Frame_Header_Descriptor
(<xref target="comp_frame_header_desc"/>). </t> (<xref target="comp_frame_header_desc" format="defaul
</section> t"/>). </t>
</section>
<section title="Repeat Offsets" <section anchor="repeat_offsets" numbered="true" toc="default">
anchor="repeat_offsets"> <name>Repeat Offsets</name>
<t> As seen above, the first three values <t> As seen above, the first three values
define a repeated offset; we will call define a repeated offset; we will call
them Repeated_Offset1, Repeated_Offset2, them Repeated_Offset1, Repeated_Offset2,
and Repeated_Offset3. They are sorted in and Repeated_Offset3. They are sorted in
recency order, with Repeated_Offset1 recency order, with Repeated_Offset1
meaning "most recent one". </t> meaning "most recent one". </t>
<t> If offset_value is 1, then the offset used
<t> If offset_value is 1, then the offset used
is Repeated_Offset1, etc. </t> is Repeated_Offset1, etc. </t>
<t> There is one exception: when the current
<t> There is one exception: When the current
sequence's literals_length is 0, repeated sequence's literals_length is 0, repeated
offsets are shifted by 1, so an offsets are shifted by 1, so an
offset_value of 1 means Repeated_Offset2, offset_value of 1 means Repeated_Offset2,
an offset_value of 2 means Repeated_Offset3, an offset_value of 2 means Repeated_Offset3,
and an offset_value of 3 means and an offset_value of 3 means
Repeated_Offset1 - 1_byte. </t> Repeated_Offset1 - 1_byte. </t>
<t> For the first block, the starting offset
<t> For the first block, the starting offset
history is populated with the following history is populated with the following
values: Repeated_Offset1 (1), values: Repeated_Offset1 (1),
Repeated_Offset2 (4), and Repeated_Offset2 (4), and
Repeated_Offset3 (8), unless Repeated_Offset3 (8), unless
a dictionary is used, in which case they a dictionary is used, in which case they
come from the dictionary. </t> come from the dictionary. </t>
<t> Then each block gets its starting offset
<t> Then each block gets its starting offset
history from the ending values of the most history from the ending values of the most
recent Compressed_Block. Note that blocks recent Compressed_Block. Note that blocks
that are not Compressed_Block are skipped; that are not Compressed_Block are skipped;
they do not contribute to offset they do not contribute to offset
history. </t> history. </t>
<t> The newest offset takes the lead in offset
<t> The newest offset takes the lead in offset
history, shifting others back (up to its history, shifting others back (up to its
previous place if it was already previous place if it was already
present). This means that when present). This means that when
Repeated_Offset1 (most recent) is used, Repeated_Offset1 (most recent) is used,
history is unmodified. When history is unmodified. When
Repeated_Offset2 is used, it is swapped Repeated_Offset2 is used, it is swapped
with Repeated_Offset1. If any other offset with Repeated_Offset1. If any other offset
is used, it becomes Repeated_Offset1, and is used, it becomes Repeated_Offset1, and
the rest are shifted back by 1. </t> the rest are shifted back by 1. </t>
</section> </section>
</section> </section>
<section anchor="comp_skippable" numbered="true" toc="default">
<name>Skippable Frames</name>
<section anchor="comp_skippable" <table anchor="skippable">
title="Skippable Frames"> <name>Skippable Frames</name>
<t> <figure><artwork> <thead>
+--------------+------------+-----------+ <tr>
| Magic_Number | Frame_Size | User_Data | <th align="center">Magic_Number</th>
+--------------+------------+-----------+ <th align="center">Frame_Size</th>
| 4 bytes | 4 bytes | n bytes | <th align="center">User_Data</th>
+--------------+------------+-----------+ </tr>
</artwork></figure> </t> </thead>
<tbody>
<tr>
<td align="center">4 bytes</td>
<td align="center">4 bytes</td>
<td align="center">n bytes</td>
</tr>
</tbody>
</table>
<t> Skippable frames allow the insertion of <t> Skippable frames allow the insertion of
user-defined metadata into a flow of concatenated user-defined metadata into a flow of concatenated
frames. </t> frames. </t>
<t> Skippable frames defined in this specification are
<t> Skippable frames defined in this specification are
compatible with skippable frames in compatible with skippable frames in
<xref target="LZ4"/>. </t> <xref target="LZ4" format="default"/>. </t>
<t> From a compliant decoder perspective, skippable
<t> From a compliant decoder perspective, skippable
frames simply need to be skipped, and their frames simply need to be skipped, and their
content ignored, resuming decoding after the content ignored, resuming decoding after the
skippable frame. </t> skippable frame. </t>
<t> It should be noted that a skippable frame can be
<t> It should be noted that a skippable frame can be
used to watermark a stream of concatenated frames used to watermark a stream of concatenated frames
embedding any kind of tracking information (even embedding any kind of tracking information (even
just a Universally Unique Identifier (UUID)). Users w ary of such possibility just a Universally Unique Identifier (UUID)). Users w ary of such possibility
should scan the stream of concatenated frames in should scan the stream of concatenated frames in
an attempt to detect such frames for analysis or an attempt to detect such frames for analysis or
removal. </t> removal. </t>
<t> The fields are:
</t>
<dl newline="false" spacing="normal">
<dt>Magic_Number:</dt>
<t> The fields are: <dd>4 bytes, little-endian format. Value:
<list style="hanging">
<t hangText="Magic_Number:">
4 bytes, little-endian format. Value:
0x184D2A5?, which means any value from 0x184D2A5?, which means any value from
0x184D2A50 to 0x184D2A5F. All 16 0x184D2A50 to 0x184D2A5F. All 16
values are valid to identify a values are valid to identify a
skippable frame. This specification skippable frame. This specification
does not detail any specific tagging does not detail any specific tagging
methods for skippable frames. methods for skippable frames.
</t> </dd>
<dt>Frame_Size:</dt>
<t hangText="Frame_Size:"> <dd>
This is the size, in bytes, of the This is the size, in bytes, of the
following User_Data (without including following User_Data (without including
the magic number nor the size field the magic number nor the size field
itself). This field is represented itself). This field is represented
using 4 bytes, little-endian format, using 4 bytes, little-endian format,
unsigned 32 bits. This means User_Data unsigned 32 bits. This means User_Data
can't be bigger than (2^32-1) bytes. can't be bigger than
</t> (2<sup>32</sup> -1) bytes.
</dd>
<t hangText="User_Data:"> <dt>User_Data:</dt>
<dd>
This field can be anything. Data will This field can be anything. Data will
just be skipped by the decoder. just be skipped by the decoder.
</t> </dd>
</list> </t> </dl>
</section> </section>
</section> </section>
</section> </section>
<section anchor="comp_entropy" numbered="true" toc="default">
<section anchor="comp_entropy" title="Entropy Encoding"> <name>Entropy Encoding</name>
<t> Two types of entropy encoding are used by the <t> Two types of entropy encoding are used by the
Zstandard format: FSE and Huffman coding. Zstandard format: FSE and Huffman coding.
Huffman is used to compress literals, while FSE Huffman is used to compress literals, while FSE
is used for all other symbols is used for all other symbols
(Literals_Length_Code, Match_Length_Code, and offset (Literals_Length_Code, Match_Length_Code, and offset
codes) and to compress Huffman headers.</t> codes) and to compress Huffman headers.</t>
<section anchor="comp_fse" numbered="true" toc="default">
<section anchor="comp_fse" title="FSE"> <name>FSE</name>
<t> FSE, short for Finite State Entropy, is <t> FSE, short for Finite State Entropy, is
an entropy codec based on an entropy codec based on
<xref target="ANS"/>. <xref target="ANS" format="default"/>.
FSE encoding/decoding involves FSE encoding/decoding involves
a state that is carried over between a state that is carried over between
symbols, so decoding must be done in the symbols, so decoding must be done in the
opposite direction as encoding. Therefore, opposite direction as encoding. Therefore,
all FSE bitstreams are read from end to all FSE bitstreams are read from end to
beginning. Note that the order of the beginning. Note that the order of the
bits in the stream is not reversed; bits in the stream is not reversed;
they are simply read in the reverse they are simply read in the reverse
order from which they were written. </t> order from which they were written. </t>
<t> For additional details on FSE, see
"FiniteStateEntropy" <xref target="FSE" format="default"/>. </t>
<t> For additional details on FSE, see <t> FSE decoding involves a decoding table
Finite State Entropy that has a power-of-2 size and contains
<xref target="FSE"/>. </t>
<t> FSE decoding involves a decoding table
that has a power of 2 size and contains
three elements: Symbol, Num_Bits, and three elements: Symbol, Num_Bits, and
Baseline. The base 2 logarithm Baseline. The base 2 logarithm
of the table size is its Accuracy_Log. of the table size is its Accuracy_Log.
An FSE state value represents an index in An FSE state value represents an index in
this table. </t> this table. </t>
<t> To obtain the initial state value,
<t> To obtain the initial state value,
consume Accuracy_Log bits from the stream consume Accuracy_Log bits from the stream
as a little-endian value. The next symbol as a little-endian value. The next symbol
in the stream is the Symbol indicated in in the stream is the Symbol indicated in
the table for that state. To obtain the the table for that state. To obtain the
next state value, the decoder should next state value, the decoder should
consume Num_Bits bits from the stream as a consume Num_Bits bits from the stream as a
little-endian value and add it to little-endian value and add it to
Baseline. </t> Baseline. </t>
<section anchor="comp_fse_table" numbered="true" toc="default">
<section anchor="comp_fse_table" <name>FSE Table Description</name>
title="FSE Table Description"> <t> To decode FSE streams, it is necessary
<t> To decode FSE streams, it is necessary
to construct the decoding table. The to construct the decoding table. The
Zstandard format encodes FSE table Zstandard format encodes FSE table
descriptions as described here. </t> descriptions as described here. </t>
<t> An FSE distribution table describes the
<t> An FSE distribution table describes the
probabilities of all symbols from 0 to probabilities of all symbols from 0 to
the last present one (included) on a the last present one (included) on a
normalized scale of normalized scale of
(1&nbsp;&lt;&lt; Accuracy_Log). (1&nbsp;&lt;&lt; Accuracy_Log).
Note that there must be two or
more symbols with non-zero probability.
</t>
<t> A bitstream is read forward, in Note that there must be two or
more symbols with nonzero probability.
</t>
<t> A bitstream is read forward, in
little-endian fashion. It is not little-endian fashion. It is not
necessary to know its exact size, necessary to know its exact size,
since the size will be discovered and since the size will be discovered and
reported by the decoding process. The reported by the decoding process. The
bitstream starts by reporting on which bitstream starts by reporting on which
scale it operates. If low4bits scale it operates. If low4bits
designates the lowest 4 bits of designates the lowest 4 bits of
the first byte, then the first byte, then
Accuracy_Log = low4bits + 5. </t> Accuracy_Log = low4bits + 5. </t>
<t> This is followed by each symbol value,
<t> This is followed by each symbol value,
from 0 to the last present one. The from 0 to the last present one. The
number of bits used by each field is number of bits used by each field is
variable and depends on: variable and depends on:
<list style="hanging"> </t>
<t hangText="Remaining probabilities <dl newline="false" spacing="normal">
+ 1:"> <dt>Remaining probabilities + 1:</dt>
<dd>
For example, presuming an For example, presuming an
Accuracy_Log of 8, and Accuracy_Log of 8, and
presuming 100 probabilities presuming 100 probabilities
points have already been points have already been
distributed, the decoder may distributed, the decoder may
read any value from 0 to read any value from 0 to
(256 - 100 + 1) == 157, (256 - 100 + 1) == 157,
inclusive. Therefore, it must inclusive. Therefore, it must
read log2sup(157) == 8 read log2sup(157) == 8
bits. </t> bits. </dd>
<dt>Value decoded:</dt>
<t hangText="Value decoded:"> <dd>
<t>
Small values use 1 fewer bit. Small values use 1 fewer bit.
For example, presuming values For example, presuming values
from 0 to 157 (inclusive) are from 0 to 157, inclusive, are
possible, 255 - 157 = 98 values possible, 255 - 157 = 98 values
are remaining in an 8-bit are remaining in an 8-bit
field. The first 98 values field. The first 98 values
(hence from 0 to 97) use only (hence, from 0 to 97) use only
7 bits, and values from 98 to 7 bits, and values from 98 to
157 use 8 bits. This is 157 use 8 bits. This is
achieved through this scheme: achieved through the scheme in
<figure><artwork> <xref target="value" />:
+------------+---------------+-----------+ </t>
| Value Read | Value Decoded | Bits Used |
+------------+---------------+-----------+
| 0 - 97 | 0 - 97 | 7 |
+------------+---------------+-----------+
| 98 - 127 | 98 - 127 | 8 |
+------------+---------------+-----------+
| 128 - 225 | 0 - 97 | 7 |
+------------+---------------+-----------+
| 226 - 255 | 128 - 157 | 8 |
+------------+---------------+-----------+
</artwork></figure> </t>
</list></t>
<t> Symbol probabilities are read <table anchor="value">
<name>Values Decoded</name>
<thead>
<tr>
<th align="center">Value Read</th>
<th align="center">Value Decoded</th>
<th align="center">Bits Used</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">0 - 97</td>
<td align="center">0 - 97</td>
<td align="center">7</td>
</tr>
<tr>
<td align="center">98 - 127</td>
<td align="center">98 - 127</td>
<td align="center">8</td>
</tr>
<tr>
<td align="center">128 - 225</td>
<td align="center">0 - 97</td>
<td align="center">7</td>
</tr>
<tr>
<td align="center">226 - 255</td>
<td align="center">128 - 157</td>
<td align="center">8</td>
</tr>
</tbody>
</table>
</dd>
</dl>
<t> Symbol probabilities are read
one by one, in order. The one by one, in order. The
probability is obtained from probability is obtained from
Value decoded using the Value Decoded using the
formula P = Value - 1. This formula P = Value - 1. This
means the value 0 becomes the means the value 0 becomes the
negative probability -1. This negative probability -1. This
is a special probability that is a special probability that
means "less than 1". Its means "less than 1". Its
effect on the distribution effect on the distribution
table is described below. For table is described below. For
the purpose of calculating the purpose of calculating
total allocated probability total allocated probability
points, it counts as 1. </t> points, it counts as 1. </t>
<t> When a symbol has a
<t> When a symbol has a
probability of zero, it is probability of zero, it is
followed by a 2-bit repeat followed by a 2-bit repeat
flag. This repeat flag tells flag. This repeat flag tells
how many probabilities of how many probabilities of
zeroes follow the current one. zeroes follow the current one.
It provides a number ranging It provides a number ranging
from 0 to 3. If it is a 3, from 0 to 3. If it is a 3,
another 2-bit repeat flag another 2-bit repeat flag
follows, and so on. </t> follows, and so on. </t>
<t> When the last symbol reaches
<t> When the last symbol reaches
a cumulated total of a cumulated total of
(1&nbsp;&lt;&lt;&nbsp;Accuracy_Lo g), (1&nbsp;&lt;&lt;&nbsp;Accuracy_Lo g),
decoding is complete. If the decoding is complete. If the
last symbol makes the cumulated last symbol makes the cumulated
total go above total go above
(1 &lt;&lt; Accuracy_Log), (1 &lt;&lt; Accuracy_Log),
distribution is considered distribution is considered
corrupted. </t> corrupted. </t>
<t> Finally, the decoder can tell
<t> Finally, the decoder can tell
how many bytes were used in how many bytes were used in
this process and how many this process and how many
symbols are present. The symbols are present. The
bitstream consumes a round bitstream consumes a round
number of bytes. Any remaining number of bytes. Any remaining
bit within the last byte is bit within the last byte is
simply unused. </t> simply unused. </t>
<t> The context in which the table
<t> The context in which the table
is to be used specifies an expected is to be used specifies an expected
number of symbols. That expected number of symbols. That expected
number of symbols never exceeds 256. number of symbols never exceeds 256.
If the number of symbols decoded If the number of symbols decoded
is not equal to the expected, the is not equal to the expected, the
header should be considered header should be considered
corrupt. </t> corrupt. </t>
<t> The distribution of normalized
<t> The distribution of normalized
probabilities is enough to probabilities is enough to
create a unique decoding create a unique decoding
table. The table has a size table. The table has a size
of (1 &lt;&lt; Accuracy_Log). of (1 &lt;&lt; Accuracy_Log).
Each cell describes the symbol Each cell describes the symbol
decoded and instructions to decoded and instructions to
get the next state. </t> get the next state. </t>
<t> Symbols are scanned in their
<t> Symbols are scanned in their
natural order for "less than 1" natural order for "less than 1"
probabilities as described probabilities as described
above. Symbols with this above. Symbols with this
probability are being probability are being
attributed a single cell, attributed a single cell,
starting from the end of the starting from the end of the
table and retreating. These table and retreating. These
symbols define a symbols define a
full state reset, reading full state reset, reading
Accuracy_Log bits. </t> Accuracy_Log bits. </t>
<t> All remaining symbols are
<t> All remaining symbols are
allocated in their natural allocated in their natural
order. Starting from symbol 0 order. Starting from symbol 0
and table position 0, each and table position 0, each
symbol gets allocated as many symbol gets allocated as many
cells as its probability. Cell cells as its probability. Cell
allocation is spread, not allocation is spread, not
linear; each successor linear; each successor
position follows this rule: position follows this rule:
<figure><artwork> </t>
<artwork name="" type="" align="left" alt=""><![CDATA[
position += (tableSize >> 1) + (tableSize >> 3) + 3; position += (tableSize >> 1) + (tableSize >> 3) + 3;
position &amp;= tableSize - 1; position &= tableSize - 1;
</artwork></figure></t> ]]></artwork>
<t> A position is skipped if it is
<t> A position is skipped if it is
already occupied by a "less already occupied by a "less
than 1" probability symbol. than 1" probability symbol.
Position does not reset between Position does not reset between
symbols; it simply iterates symbols; it simply iterates
through each position in the through each position in the
table, switching to the next table, switching to the next
symbol when enough states have symbol when enough states have
been allocated to the current been allocated to the current
one. </t> one. </t>
<t> The result is a list of state
<t> The result is a list of state
values. Each state will decode values. Each state will decode
the current symbol. </t> the current symbol. </t>
<t> To get the Number_of_Bits and
<t> To get the Number_of_Bits and
Baseline required for the next Baseline required for the next
state, it is first necessary state, it is first necessary
to sort all states in their to sort all states in their
natural order. The lower natural order. The lower
states will need 1 more bit states will need 1 more bit
than higher ones. The process than higher ones. The process
is repeated for each symbol. is repeated for each symbol.
</t> </t>
<t> For example, presuming a symbol
<t> For example, presuming a symbol
has a probability of 5, it has a probability of 5, it
receives five state values. receives five state values.
States are sorted in natural States are sorted in natural
order. The next power of order. The next power of
2 is 8. The space of 2 is 8. The space of
probabilities is divided into probabilities is divided into
8 equal parts. Presuming the 8 equal parts. Presuming the
Accuracy_Log is 7, this Accuracy_Log is 7, this
defines 128 states, and each defines 128 states, and each
share (divided by 8) is 16 share (divided by 8) is 16
in size. In order to reach in size. In order to reach
8, 8 - 5 = 3 lowest states will 8, 8 - 5 = 3 lowest states will
count "double", doubling the count "double", doubling the
number of shares (32 in width), number of shares (32 in width),
requiring 1 more bit in the requiring 1 more bit in the
process. </t> process. </t>
<t> Baseline is assigned starting
<t> Baseline is assigned starting
from the higher states using from the higher states using
fewer bits, and proceeding fewer bits, and proceeding
naturally, then resuming at naturally, then resuming at
the first state, each taking the first state, each taking
its allocated width from its allocated width from
Baseline. </t> Baseline. </t>
<t> <figure><artwork> <table anchor="state">
+----------------+-------+-------+--------+------+-------+ <name>Baseline Assignments</name>
| state order | 0 | 1 | 2 | 3 | 4 | <tbody>
+----------------+-------+-------+--------+------+-------+ <tr>
| width | 32 | 32 | 32 | 16 | 16 | <td align="center">state order</td>
+----------------+-------+-------+--------+------+-------+ <td align="center">0</td>
| Number_of_Bits | 5 | 5 | 5 | 4 | 4 | <td align="center">1</td>
+----------------+-------+-------+--------+------+-------+ <td align="center">2</td>
| range number | 2 | 4 | 6 | 0 | 1 | <td align="center">3</td>
+----------------+-------+-------+--------+------+-------+ <td align="center">4</td>
| Baseline | 32 | 64 | 96 | 0 | 16 | </tr>
+----------------+-------+-------+--------+------+-------+ <tr>
| range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 | <td align="center">width</td>
+----------------+-------+-------+--------+------+-------+ <td align="center">32</td>
</artwork></figure> </t> <td align="center">32</td>
<td align="center">32</td>
<td align="center">16</td>
<td align="center">16</td>
</tr>
<tr>
<td align="center">Number_of_Bits</td>
<td align="center">5</td>
<td align="center">5</td>
<td align="center">5</td>
<td align="center">4</td>
<td align="center">4</td>
</tr>
<tr>
<td align="center">range number</td>
<td align="center">2</td>
<td align="center">4</td>
<td align="center">6</td>
<td align="center">0</td>
<td align="center">1</td>
</tr>
<tr>
<td align="center">Baseline</td>
<td align="center">32</td>
<td align="center">64</td>
<td align="center">96</td>
<td align="center">0</td>
<td align="center">16</td>
</tr>
<tr>
<td align="center">range</td>
<td align="center">32-63</td>
<td align="center">64-95</td>
<td align="center">96-127</td>
<td align="center">0-15</td>
<td align="center">16-31</td>
</tr>
</tbody>
</table>
<t> The next state is determined <t> The next state is determined
from the current state by from the current state by
reading the required reading the required
Number_of_Bits and adding the Number_of_Bits and adding the
specified Baseline. </t> specified Baseline. </t>
<t> See <xref target="app_tables" format="default"/>
<t> See <xref target="app_tables"/>
for the results of this for the results of this
process that are applied to the process that are applied to the
default distributions. </t> default distributions. </t>
</section> </section>
</section> </section>
<section anchor="comp_huffman" numbered="true" toc="default">
<section anchor="comp_huffman" title="Huffman Coding"> <name>Huffman Coding</name>
<t> Zstandard Huffman-coded streams are read <t> Zstandard Huffman-coded streams are read
backwards, similar to the FSE bitstreams. backwards, similar to the FSE bitstreams.
Therefore, to find the start of the Therefore, to find the start of the
bitstream, it is necessary to know the bitstream, it is necessary to know the
offset of the last byte of the offset of the last byte of the
Huffman-coded stream. </t> Huffman-coded stream. </t>
<t> After writing the last bit containing
<t> After writing the last bit containing
information, the compressor writes a information, the compressor writes a
single 1 bit and then fills the rest of single 1 bit and then fills the rest of
the byte with 0 bits. The last byte of the byte with 0 bits. The last byte of
the compressed bitstream cannot be 0 for the compressed bitstream cannot be 0 for
that reason. </t> that reason. </t>
<t> When decompressing, the last byte
<t> When decompressing, the last byte
containing the padding is the first byte containing the padding is the first byte
to read. The decompressor needs to skip to read. The decompressor needs to skip
the up to 7 bits of 0-padding as well as the up to 7 bits of 0-padding as well as
the first 1 bit that occurs. Afterwards, the first 1 bit that occurs. Afterwards,
the useful part of the bitstream the useful part of the bitstream
begins. </t> begins. </t>
<t> The bitstream contains Huffman-coded
<t> The bitstream contains Huffman-coded
symbols in little-endian order, with the symbols in little-endian order, with the
codes defined by the method below. </t> codes defined by the method below. </t>
<section anchor="huffman_tree_desc" numbered="true" toc="default">
<section anchor="huffman_tree_desc" <name>Huffman Tree Description</name>
title="Huffman Tree Description"> <t> Prefix coding represents symbols
<t> Prefix coding represents symbols
from an a priori known alphabet by from an a priori known alphabet by
bit sequences (codewords), one bit sequences (codewords), one
codeword for each symbol, in a codeword for each symbol, in a
manner such that different symbols manner such that different symbols
may be represented by bit sequences may be represented by bit sequences
of different lengths, but a parser of different lengths, but a parser
can always parse an encoded string can always parse an encoded string
unambiguously unambiguously,
symbol by symbol. </t> symbol by symbol. </t>
<t> Given an alphabet with known symbol
<t> Given an alphabet with known symbol
frequencies, the Huffman algorithm frequencies, the Huffman algorithm
allows the construction of an allows the construction of an
optimal prefix code using the optimal prefix code using the
fewest bits of any possible prefix fewest bits of any possible prefix
codes for that alphabet. </t> codes for that alphabet. </t>
<t> The prefix code must not exceed a
<t> The prefix code must not exceed a
maximum code length. More bits maximum code length. More bits
improve accuracy but yield a larger improve accuracy but yield a larger
header size and require more header size and require more
memory or more complex decoding memory or more complex decoding
operations. This specification operations. This specification
limits the maximum code length to limits the maximum code length to
11 bits. </t> 11 bits. </t>
<t> All literal values from zero
<t> All literal values from zero
(included) to the last present one (included) to the last present one
(excluded) are represented by (excluded) are represented by
Weight with values from 0 to Weight with values from 0 to
Max_Number_of_Bits. Transformation Max_Number_of_Bits. Transformation
from Weight to Number_of_Bits from Weight to Number_of_Bits
follows this pseudocode: follows this pseudocode:
<figure><artwork> </t>
<sourcecode type="pseudocode">
if Weight == 0 if Weight == 0
Number_of_Bits = 0 Number_of_Bits = 0
else else
Number_of_Bits = Max_Number_of_Bits + 1 - Weight Number_of_Bits = Max_Number_of_Bits + 1 - Weight
</artwork></figure> </t> </sourcecode>
<t> The last symbol's Weight is
<t> The last symbol's Weight is
deduced from previously decoded deduced from previously decoded
ones, by completing to the nearest ones, by completing to the nearest
power of 2. This power of 2 gives power of 2. This power of 2 gives
Max_Number_of_Bits the depth of Max_Number_of_Bits the depth of
the current tree. </t> the current tree. </t>
<t> For example, presume the following
<t> For example, presume the following
Huffman tree must be described: Huffman tree must be described:
<figure><artwork> </t>
+---------------+----------------+
| Literal Value | Number_of_Bits |
+---------------+----------------+
| 0 | 1 |
+---------------+----------------+
| 1 | 2 |
+---------------+----------------+
| 2 | 3 |
+---------------+----------------+
| 3 | 0 |
+---------------+----------------+
| 4 | 4 |
+---------------+----------------+
| 5 | 4 |
+---------------+----------------+
</artwork></figure> </t>
<t> The tree depth is 4, since its <table anchor="Huffman">
<name>Huffman Tree</name>
<thead>
<tr>
<th align="center">Literal Value</th>
<th align="center">Number_of_Bits</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">0</td>
<td align="center">1</td>
</tr>
<tr>
<td align="center">1</td>
<td align="center">2</td>
</tr>
<tr>
<td align="center">2</td>
<td align="center">3</td>
</tr>
<tr>
<td align="center">3</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">4</td>
<td align="center">4</td>
</tr>
<tr>
<td align="center">5</td>
<td align="center">4</td>
</tr>
</tbody>
</table>
<t> The tree depth is 4, since its
longest element uses 4 bits. longest element uses 4 bits.
(The longest elements are those (The longest elements are those
with the smallest frequencies.) with the smallest frequencies.)
Value 5 will not be listed as it Value 5 will not be listed as it
can be determined from the values can be determined from the values
for 0-4, nor will values above 5 for 0-4, nor will values above 5
as they are all 0. Values from 0 as they are all 0. Values from 0
to 4 will be listed using Weight to 4 will be listed using Weight
instead of Number_of_Bits. The instead of Number_of_Bits. The
pseudocode to determine Weight is: pseudocode to determine Weight is:
<figure><artwork> </t>
<sourcecode type="pseudocode">
if Number_of_Bits == 0 if Number_of_Bits == 0
Weight = 0 Weight = 0
else else
Weight = Max_Number_of_Bits + 1 - Number_of_Bits Weight = Max_Number_of_Bits + 1 - Number_of_Bits
</artwork></figure> </t> </sourcecode>
<t> It gives the following series of
<t> It gives the following series of
weights: weights:
<figure><artwork> </t>
+---------------+--------+
| Literal Value | Weight |
+---------------+--------+
| 0 | 4 |
+---------------+--------+
| 1 | 3 |
+---------------+--------+
| 2 | 2 |
+---------------+--------+
| 3 | 0 |
+---------------+--------+
| 4 | 1 |
+---------------+--------+
</artwork></figure> </t>
<t> The decoder will do the inverse <table anchor="weights">
<name>Weights</name>
<thead>
<tr>
<th align="center">Literal Value</th>
<th align="center">Weight</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">0</td>
<td align="center">4</td>
</tr>
<tr>
<td align="center">1</td>
<td align="center">3</td>
</tr>
<tr>
<td align="center">2</td>
<td align="center">2</td>
</tr>
<tr>
<td align="center">3</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">4</td>
<td align="center">1</td>
</tr>
</tbody>
</table>
<t> The decoder will do the inverse
operation: having collected weights operation: having collected weights
of literals from 0 to 4, it knows of literals from 0 to 4, it knows
the last literal, 5, is present the last literal, 5, is present
with a non-zero Weight. The Weight with a nonzero Weight. The Weight
of 5 can be determined by advancing of 5 can be determined by advancing
to the next power of 2. The sum of to the next power of 2. The sum of
2^(Weight-1) (excluding 0's) is 15. 2<sup>(Weight-1)</sup> (excluding 0's ) is 15.
The nearest power of 2 is 16. The nearest power of 2 is 16.
Therefore, Max_Number_of_Bits = 4 Therefore, Max_Number_of_Bits = 4
and Weight[5] = 16 - 15 = 1. </t> and Weight[5] = 16 - 15 = 1. </t>
<section anchor="huffman_tree_header" numbered="true" toc="default">
<section anchor="huffman_tree_header" <name>Huffman Tree Header</name>
title="Huffman Tree Header"> <t> This is a single byte value
<t> This is a single byte value
(0-255), which describes how (0-255), which describes how
the series of weights is the series of weights is
encoded. encoded.
<list style="hanging"> </t>
<t hangText="headerByte &lt; <dl newline="false" spacing="normal">
128:"> <dt>headerByte &lt; 128:</dt>
<dd>
The series of weights The series of weights
is compressed using is compressed using
FSE (see below). The FSE (see below). The
length of the length of the
FSE-compressed series FSE-compressed series
is equal to headerByte is equal to headerByte
(0-127). </t> (0-127). </dd>
<dt>headerByte &gt;= 128:</dt>
<t hangText="headerByte >= 12 <dd>
8:"> <t>
This is a direct This is a direct
representation, where representation, where
each Weight is written each Weight is written
directly as a 4-bit directly as a 4-bit
field (0-15). They are field (0-15). They are
encoded forward, 2 encoded forward, 2
weights to a byte with weights to a byte with
the first weight taking the first weight taking
the top 4 bits and the top 4 bits and
the second taking the the second taking the
bottom 4; for example, th e bottom 4; for example, th e
following operations following operations
could be used to read could be used to read
the weights: the weights:
<figure><artwork> </t>
<artwork name="" type="" align="left" alt=""><![CDATA[
Weight[0] = (Byte[0] >> 4) Weight[0] = (Byte[0] >> 4)
Weight[1] = (Byte[0] &amp; 0xf), Weight[1] = (Byte[0] & 0xf),
etc. etc.
</artwork></figure> ]]></artwork>
<t>
The full representation The full representation
occupies occupies
ceiling(Number_of_Symbols /2) ceiling(Number_of_Symbols /2)
bytes, meaning it uses bytes, meaning it uses
only full bytes even only full bytes even
if Number_of_Symbols is if Number_of_Symbols is
odd. Number_of_Symbols odd. Number_of_Symbols
= headerByte - 127. = headerByte - 127.
Note that maximum Note that maximum
Number_of_Symbols is Number_of_Symbols is
255 - 127 = 128. If any 255 - 127 = 128. If any
literal literal
has a value over 128, has a value over 128,
raw header mode is not raw header mode is not
possible, and it is possible, and it is
necessary to use FSE necessary to use FSE
compression. </t> compression. </t>
</list></t> </dd>
</section> </dl>
</section>
<section anchor="huffman_tree_fse" <section anchor="huffman_tree_fse" numbered="true" toc="default">
title="FSE Compression of Huffman <name>FSE Compression of Huffman Weights</name>
Weights"> <t> In this case, the series of
<t> In this case, the series of
Huffman weights is compressed Huffman weights is compressed
using FSE compression. It is a using FSE compression. It is a
single bitstream with two single bitstream with two
interleaved states, sharing a interleaved states, sharing a
single distribution table. </t> single distribution table. </t>
<t> To decode an FSE bitstream, it
<t> To decode an FSE bitstream, it
is necessary to know its is necessary to know its
compressed size. Compressed compressed size. Compressed
size is provided by headerByte. size is provided by headerByte.
It's also necessary to know its It's also necessary to know its
maximum possible decompressed maximum possible decompressed
size, which is 255, since size, which is 255, since
literal values span from 0 to literal values span from 0 to
255, and the last symbol's 255, and the last symbol's
Weight is not represented. </t> Weight is not represented. </t>
<t> An FSE bitstream starts by
<t> An FSE bitstream starts by
a header, describing a header, describing
probabilities distribution. It probabilities distribution. It
will create a decoding table. will create a decoding table.
For a list of Huffman weights, For a list of Huffman weights,
the maximum accuracy log is 6 the maximum accuracy log is 6
bits. For more details, see bits. For more details, see
<xref target="comp_fse_table"/>. <xref target="comp_fse_table" for
</t> mat="default"/>.
</t>
<t> The Huffman header compression <t> The Huffman header compression
uses two states, which share uses two states, which share
the same FSE distribution the same FSE distribution
table. table.
The first state (State1) The first state (State1)
encodes the even-numbered index encodes the even-numbered index
symbols, and the second symbols, and the second
(State2) encodes the odd-numbered (State2) encodes the odd-numbered
index symbols. State1 is initiali zed index symbols. State1 is initiali zed
first, and then State2, and first, and then State2, and
they take turns decoding a they take turns decoding a
single symbol and updating single symbol and updating
their state. their state.
For more details For more details
on these FSE operations, see on these FSE operations, see
<xref target="comp_fse"/>. </t> <xref target="comp_fse" format="d
efault"/>. </t>
<t> The number of symbols to be <t> The number of symbols to be
decoded is determined by decoded is determined by
tracking the bitStream overflow tracking the bitStream overflow
condition: If updating state condition: if updating state
after decoding a symbol would after decoding a symbol would
require more bits than remain require more bits than remain
in the stream, it is assumed in the stream, it is assumed
that extra bits are zero. Then, that extra bits are zero. Then,
symbols for each of the symbols for each of the
final states are decoded and final states are decoded and
the process is complete.</t> the process is complete.</t>
</section> </section>
<section anchor="huffman_tree_conv" numbered="true" toc="default">
<section anchor="huffman_tree_conv" <name>Conversion from Weights to Huffman Prefix Codes</name>
title="Conversion from Weights to <t> All present symbols will now
Huffman Prefix Codes">
<t> All present symbols will now
have a Weight value. It is have a Weight value. It is
possible to transform weights possible to transform weights
into Number_of_Bits, using into Number_of_Bits, using
this formula: this formula:
<figure><artwork> </t>
<sourcecode type="pseudocode">
if Weight > 0 if Weight > 0
Number_of_Bits = Max_Number_of_Bits + 1 - Weight Number_of_Bits = Max_Number_of_Bits + 1 - Weight
else else
Number_of_Bits = 0 Number_of_Bits = 0
</artwork></figure></t> </sourcecode>
<t> Symbols are sorted by Weight.
<t> Symbols are sorted by Weight.
Within the same Weight, symbols Within the same Weight, symbols
keep natural sequential keep natural sequential
order. Symbols order. Symbols
with a Weight of zero are with a Weight of zero are
removed. Then, starting from removed. Then, starting from
the the
lowest Weight, prefix codes lowest Weight, prefix codes
are distributed in sequential are distributed in sequential
order. </t> order. </t>
<t> For example, assume the
<t> For example, assume the
following list of weights following list of weights
has been decoded: has been decoded:
<figure><artwork> </t>
+---------+--------+
| Literal | Weight |
+---------+--------+
| 0 | 4 |
+---------+--------+
| 1 | 3 |
+---------+--------+
| 2 | 2 |
+---------+--------+
| 3 | 0 |
+---------+--------+
| 4 | 1 |
+---------+--------+
| 5 | 1 |
+---------+--------+
</artwork></figure> </t>
<t> Sorting by weight and then <table anchor="decoded-weights">
<name>Decoded Weights</name>
<thead>
<tr>
<th align="center">Literal</th>
<th align="center">Weight</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">0</td>
<td align="center">4</td>
</tr>
<tr>
<td align="center">1</td>
<td align="center">3</td>
</tr>
<tr>
<td align="center">2</td>
<td align="center">2</td>
</tr>
<tr>
<td align="center">3</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">4</td>
<td align="center">1</td>
</tr>
<tr>
<td align="center">5</td>
<td align="center">1</td>
</tr>
</tbody>
</table>
<t> Sorting by weight and then
the natural sequential order the natural sequential order
yields the following yields the following
distribution: distribution:
<figure><artwork> </t>
+---------+--------+----------------+--------------+
| Literal | Weight | Number_Of_Bits | Prefix Codes |
+---------+--------+----------------|--------------+
| 3 | 0 | 0 | N/A |
+---------+--------+----------------|--------------+
| 4 | 1 | 4 | 0000 |
+---------+--------+----------------|--------------+
| 5 | 1 | 4 | 0001 |
+---------+--------+----------------|--------------+
| 2 | 2 | 3 | 001 |
+---------+--------+----------------|--------------+
| 1 | 3 | 2 | 01 |
+---------+--------+----------------|--------------+
| 0 | 4 | 1 | 1 |
+---------+--------+----------------|--------------+
</artwork></figure> </t>
</section>
</section>
<section anchor="huffman_coded_streams" <table anchor="sorting-by-weight">
title="Huffman-Coded Streams"> <name>Sorting by Weight</name>
<t> Given a Huffman decoding table, it is <thead>
<tr>
<th align="center">Literal</th>
<th align="center">Weight</th>
<th align="center">Number_Of_Bits</th>
<th align="center">Prefix Codes</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">3</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">N/A</td>
</tr>
<tr>
<td align="center">4</td>
<td align="center">1</td>
<td align="center">4</td>
<td align="center">0000</td>
</tr>
<tr>
<td align="center">5</td>
<td align="center">1</td>
<td align="center">4</td>
<td align="center">0001</td>
</tr>
<tr>
<td align="center">2</td>
<td align="center">2</td>
<td align="center">3</td>
<td align="center">001</td>
</tr>
<tr>
<td align="center">1</td>
<td align="center">3</td>
<td align="center">2</td>
<td align="center">01</td>
</tr>
<tr>
<td align="center">0</td>
<td align="center">4</td>
<td align="center">1</td>
<td align="center">1</td>
</tr>
</tbody>
</table>
</section>
</section>
<section anchor="huffman_coded_streams" numbered="true" toc="default">
<name>Huffman-Coded Streams</name>
<t> Given a Huffman decoding table, it is
possible to decode a Huffman-coded possible to decode a Huffman-coded
stream. </t> stream. </t>
<t> Each bitstream must be read backward, starting from the end and go
<t> Each bitstream must be read backward, ing up to
which starts from the end and goes up to
the beginning. Therefore, it is the beginning. Therefore, it is
necessary to know the size of each necessary to know the size of each
bitstream. </t> bitstream. </t>
<t> It is also necessary to know exactly
<t> It is also necessary to know exactly
which bit is the last. This is which bit is the last. This is
detected by a final bit flag: the detected by a final bit flag: the
highest bit of the last byte is a highest bit of the last byte is a
final-bit-flag. Consequently, a last final-bit-flag. Consequently, a last
byte of 0 is not possible. And the byte of 0 is not possible. And the
final-bit-flag itself is not part of final-bit-flag itself is not part of
the useful bitstream. Hence, the last the useful bitstream. Hence, the last
byte contains between 0 and 7 useful byte contains between 0 and 7 useful
bits. </t> bits. </t>
<t> Starting from the end, it is possible
<t> Starting from the end, it is possible
to read the bitstream in a to read the bitstream in a
little-endian fashion, keeping track little-endian fashion, keeping track
of already used bits. Since the of already used bits. Since the
bitstream is encoded in reverse order, bitstream is encoded in reverse order,
starting from the end, read symbols in starting from the end, read symbols in
forward order. </t> forward order. </t>
<t> For example, if the literal sequence
<t> For example, if the literal sequence
"0145" was encoded using the above prefix "0145" was encoded using the above prefix
code, it would be encoded (in reverse code, it would be encoded (in reverse
order) as: order) as:
<figure><artwork> </t>
+---------+----------+
| Symbol | Encoding |
+---------+----------+
| 5 | 0000 |
+---------+----------+
| 4 | 0001 |
+---------+----------+
| 1 | 01 |
+---------+----------+
| 0 | 1 |
+---------+----------+
| Padding | 00001 |
+---------+----------+
</artwork></figure></t>
<t> This results in the following 2-byte <table anchor="coded-example">
<name>Literal Sequence "0145"</name>
<thead>
<tr>
<th align="center">Symbol</th>
<th align="center">Encoding</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">5</td>
<td align="center">0000</td>
</tr>
<tr>
<td align="center">4</td>
<td align="center">0001</td>
</tr>
<tr>
<td align="center">1</td>
<td align="center">01</td>
</tr>
<tr>
<td align="center">0</td>
<td align="center">1</td>
</tr>
<tr>
<td align="center">Padding</td>
<td align="center">00001</td>
</tr>
</tbody>
</table>
<t> This results in the following 2-byte
bitstream: bitstream:
<figure><artwork> </t>
<artwork name="" type="" align="left" alt=""><![CDATA[
00010000 00001101 00010000 00001101
</artwork></figure></t> ]]></artwork>
<t> Here is an alternative representation
<t> Here is an alternative representation
with the symbol codes separated by with the symbol codes separated by
underscores: underscores:
<figure><artwork> </t>
<artwork name="" type="" align="left" alt=""><![CDATA[
0001_0000 00001_1_01 0001_0000 00001_1_01
</artwork></figure></t> ]]></artwork>
<t> Reading the highest Max_Number_of_Bits
<t> Reading the highest Max_Number_of_Bits
bits, it's possible to compare the bits, it's possible to compare the
extracted value to the decoding table, extracted value to the decoding table,
determining the symbol to decode and determining the symbol to decode and
number of bits to discard. </t> number of bits to discard. </t>
<t> The process continues reading up to
<t> The process continues reading up to
the required number of symbols per the required number of symbols per
stream. If a bitstream is not entirely stream. If a bitstream is not entirely
and exactly consumed, hence reaching and exactly consumed, hence reaching
exactly its beginning position with exactly its beginning position with
all bits consumed, the decoding process all bits consumed, the decoding process
is considered faulty. </t> is considered faulty. </t>
</section> </section>
</section> </section>
</section> </section>
<section anchor="comp_dict" numbered="true" toc="default">
<section anchor="comp_dict" title="Dictionary Format"> <name>Dictionary Format</name>
<t> Zstandard is compatible with "raw content" <t> Zstandard is compatible with "raw content"
dictionaries, free of any format restriction, dictionaries, free of any format restriction,
except that they must be at least 8 bytes. except that they must be at least 8 bytes.
These dictionaries function as if they were just These dictionaries function as if they were just
the content part of a formatted dictionary. </t> the content part of a formatted dictionary. </t>
<t> However, dictionaries created by "zstd --train"
<t> However, dictionaries created by "zstd --train"
in the reference implementation follow a specific in the reference implementation follow a specific
format, described here. </t> format, described here. </t>
<t> Dictionaries are not included in the compressed
<t> Dictionaries are not included in the compressed
content but rather are provided out of band. content but rather are provided out of band.
That is, the Dictionary_ID identifies which should That is, the Dictionary_ID identifies which should
be used, but this specification does not describe be used, but this specification does not describe
the mechanism by which the dictionary is obtained the mechanism by which the dictionary is obtained
prior to use during compression or prior to use during compression or
decompression. </t> decompression. </t>
<t> A dictionary has a size, defined either by a
<t> A dictionary has a size, defined either by a
buffer limit or a file size. The general format buffer limit or a file size. The general format
is: is:
<figure><artwork> </t>
+--------------+---------------+----------------+---------+
| Magic_Number | Dictionary_ID | Entropy_Tables | Content |
+--------------+---------------+----------------+---------+
</artwork></figure> </t>
<t> <list style="hanging"> <table anchor="dictionary">
<t hangText="Magic_Number:"> 4 bytes ID, <name>Dictionary General Format</name>
value 0xEC30A437, little-endian <thead>
format. </t> <tr>
<th>Magic_Number</th>
<th>Dictionary_ID</th>
<th>Entropy_Tables</th>
<th>Content</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
<t hangText="Dictionary_ID:"> 4 bytes, stored <dl newline="false" spacing="normal">
<dt>Magic_Number:</dt>
<dd> 4 bytes ID,
value 0xEC30A437, little-endian
format. </dd>
<dt>Dictionary_ID:</dt>
<dd>
<t> 4 bytes, stored
in little-endian format. Dictionary_ID in little-endian format. Dictionary_ID
can be any value, except 0 (which can be any value, except 0 (which
means no Dictionary_ID). It is used by means no Dictionary_ID). It is used by
decoders to check if they use the decoders to check if they use the
correct dictionary. If the frame is correct dictionary. If the frame is
going to be distributed in a private going to be distributed in a private
environment, any Dictionary_ID can be environment, any Dictionary_ID can be
used. However, for public distribution used. However, for public distribution
of compressed frames, the following of compressed frames, the following
ranges are reserved and shall not be ranges are reserved and shall not be
used: used:
<list style="hanging">
<?rfc subcompact="yes"?>
<t>low range: &lt;= 32767</t>
<t>high range: >= (2^31)</t>
<?rfc subcompact="no"?>
</list>
</t> </t>
<dl newline="false" spacing="normal">
<t hangText="Entropy_Tables:"> Follow the <dt/>
<dd>low range: &lt;= 32767</dd>
<dt/>
<dd>high range: &gt;= (2<sup>31</sup>)</dd>
</dl>
</dd>
<dt>Entropy_Tables:</dt>
<dd> Follow the
same format as the tables in same format as the tables in
compressed blocks. See the relevant compressed blocks. See the relevant
FSE and Huffman sections for how to FSE and Huffman sections for how to
decode these tables. They are stored decode these tables. They are stored
in the following order: Huffman table for in the following order: Huffman table for
literals, FSE table for offsets, FSE literals, FSE table for offsets, FSE
table for match lengths, and FSE table table for match lengths, and FSE table
for literals lengths. These tables for literals lengths. These tables
populate the Repeat Stats literals populate the Repeat Stats literals
mode and Repeat distribution mode for mode and Repeat distribution mode for
sequence decoding. It is finally sequence decoding. It is finally
followed by 3 offset values, followed by 3 offset values,
populating repeat offsets (instead of populating repeat offsets (instead of
using {1,4,8}), stored in order, using {1,4,8}), stored in order,
4-bytes little-endian each, for a 4 bytes little-endian each, for a
total of 12 bytes. Each repeat offset total of 12 bytes. Each repeat offset
must have a value less than the must have a value less than the
dictionary size. </t> dictionary size. </dd>
<dt>Content:</dt>
<t hangText="Content:"> The rest of the <dd> The rest of the
dictionary is its content. The content dictionary is its content. The content
acts as a "past" in front of data to be acts as a "past" in front of data to be
compressed or decompressed, so it can be compressed or decompressed, so it can be
referenced in sequence commands. As referenced in sequence commands. As
long as the amount of data decoded long as the amount of data decoded
from this frame is less than or equal from this frame is less than or equal
to Window_Size, sequence commands may to Window_Size, sequence commands may
specify offsets longer than the total specify offsets longer than the total
length of decoded output so far to length of decoded output so far to
reference back to the dictionary, reference back to the dictionary,
even parts of the dictionary with even parts of the dictionary with
offsets larger than Window_Size. offsets larger than Window_Size.
After the total output has surpassed After the total output has surpassed
Window_Size, however, this is no longer Window_Size, however, this is no longer
allowed, and the dictionary is no allowed, and the dictionary is no
longer accessible. </t> longer accessible. </dd>
</list> </t> </dl>
</section> </section>
<section anchor="dict_future" numbered="true" toc="default">
<section anchor="dict_future" title="Use of Dictionaries"> <name>Use of Dictionaries</name>
<t> Provisioning for use of dictionaries with zstd is being <t> Provisioning for use of dictionaries with zstd is being
explored. See, for example, <xref target="DICT-SEC"/>. explored. See, for example, <xref target="I-D.handte-httpbis
-dict-sec" format="default"/>.
The likely outcome will be a registry of well-tested The likely outcome will be a registry of well-tested
dictionaries optimized for different use cases and dictionaries optimized for different use cases and
identifiers for each, possibly with a private negotiation identifiers for each, possibly with a private negotiation
mechanism for use of unregistered dictionaries. </t> mechanism for use of unregistered dictionaries. </t>
<t> To ensure compatibility with the
<t> To ensure compatibility with the
future specification of use of dictionaries with zstd future specification of use of dictionaries with zstd
payloads, especially with MIME, content encoded with the payloads, especially with MIME, content encoded with the
media type registered here should not use a dictionary. media type registered here should not use a dictionary.
The exception to this requirement might be a private The exception to this requirement might be a private
dictionary negotiaton, suggested above, which is not part dictionary negotiation, suggested above, which is not part
of this specification. </t> of this specification. </t>
</section> </section>
<section anchor="iana" numbered="true" toc="default">
<section anchor="iana" title="IANA Considerations"> <name>IANA Considerations</name>
<t> IANA has updated two previously existing registrations and <t> IANA has updated two previously existing registrations and
made one new registration as described below. </t> made one new registration as described below. </t>
<section anchor="iana_media_type" numbered="true" toc="default">
<section anchor="iana_media_type" <name>The 'application/zstd' Media Type</name>
title="The 'application/zstd' Media Type"> <t> The 'application/zstd' media type identifies a
<t> The 'application/zstd' media type identifies a
block of data that is compressed using zstd block of data that is compressed using zstd
compression. The data is a stream of bytes as compression. The data is a stream of bytes as
described in this document. IANA has described in this document. IANA has
added the following to the "Media Types" added the following to the "Media Types"
registry: registry:
<list style="hanging"> </t>
<t hangText="Type name:"> application </t> <dl>
<dt>Type name:</dt>
<t hangText="Subtype name:"> zstd </t> <dd>application</dd>
<dt>Subtype name:</dt>
<t hangText="Required parameters:"> N/A </t> <dd>zstd</dd>
<dt>Required parameters:</dt>
<t hangText="Optional parameters:"> N/A </t> <dd>N/A</dd>
<dt>Optional parameters:</dt>
<t hangText="Encoding considerations:"> <dd>N/A</dd>
<dt>Encoding considerations:</dt>
<dd>
binary binary
</t> </dd>
<dt>Security considerations:</dt>
<t hangText="Security considerations:"> <dd>
See <xref target="security"/> of See <xref target="security" format="default"/> of
[this document] RFC 8878.
</t> </dd>
<dt>Interoperability considerations:</dt>
<t hangText="Interoperability considerations:"> <dd>
N/A N/A
</t> </dd>
<dt>Published specification:</dt>
<t hangText="Published specification:"> <dd>
[this document] RFC 8878
</t> </dd>
<dt>Applications which use this media type:</dt>
<t hangText="Applications that use this media typ <dd>
e:">
anywhere data size is an issue anywhere data size is an issue
</t> </dd>
<dt>Fragment identifier considerations:</dt>
<t hangText="Fragment identifier considerations:" <dd>
>
No fragment identifiers are defined No fragment identifiers are defined
for this type. for this type.
</t> </dd>
<dt>Additional information:</dt>
<dd>
<t><br/></t>
<dl spacing="compact">
<dt>Deprecated alias names for this type:</dt>
<dd>
N/A
</dd>
<dt>Magic number(s):</dt>
<dd>
4 bytes, little-endian format. Value:&nbsp;0xFD2FB528
</dd>
<t hangText="Additional information:"> <dt>File extension(s):</dt>
<list style="hanging"> <dd>
<t hangText="Magic number(s):"> zst
4 bytes, little-endian fo </dd>
rmat. Value:&nbsp;0xFD2FB528 <dt>Macintosh file type code(s):</dt>
</t> <dd>
<!--note: changed 'zstd' to 'zst' per author request--> N/A
<t hangText="File extension(s):"> </dd>
zst </dl>
</t> </dd>
<t hangText="Macintosh file type
code(s):">
N/A
</t>
</list>
</t>
<t hangText="For further information:"> <dt>Person &amp; email address to contact for further
See <xref target="ZSTD"/> information:</dt><dd>Yann Collet &lt;cyan@fb.com&gt;</dd>
</t>
<t hangText="Intended usage:"> <dt>Intended usage:</dt>
<dd>
common common
</t> </dd>
<dt>Restrictions on usage:</dt>
<t hangText="Restrictions on usage:"> <dd>
N/A N/A
</t>