<!DOCTYPE rfc SYSTEM 'rfc2629.dtd'>
<?rfc toc="yes" symrefs="yes" ?>
-<rfc ipr="trust200902" category="std" docName="draft-ietf-codec-opus-13">
+<rfc ipr="trust200902" category="std" docName="draft-ietf-codec-opus-14">
<front>
<title abbrev="Interactive Audio Codec">Definition of the Opus Audio Codec</title>
</address>
</author>
-<date day="15" month="May" year="2012" />
+<date day="17" month="May" year="2012" />
<area>General</area>
When low-latency transmission is required over a relatively slow connection, then
constrained VBR can also be used. This uses VBR in a way that simulates a
-"bit reservoir" and is equivalent to what MP3 (MPEG 1, Layer 3) and
+"bit reservoir" and is equivalent to what MP3 (MPEG 1, Layer 3) and
AAC (Advanced Audio Coding) call CBR (i.e., not true
CBR due to the bit reservoir).
</t>
This section describes the possible combinations of these parameters and the
internal framing used to pack multiple frames into a single packet.
This framing is not self-delimiting.
-Instead, it assumes that a higher layer (such as UDP or RTP <xref target='RFC3550'/>
+Instead, it assumes that a higher layer (such as UDP or RTP <xref target='RFC3550'/>
or Ogg <xref target='RFC3533'/> or Matroska <xref target='Matroska-website'/>)
will communicate the length, in bytes, of the packet, and it uses this
information to reduce the framing overhead in the packet itself.
</t>
<section anchor="toc_byte" title="The TOC Byte">
-<t>
-An Opus packet begins with a single-byte table-of-contents (TOC) header that
- signals which of the various modes and configurations a given packet uses.
+<t anchor="R1">
+A well-formed Opus packet MUST contain at least one byte [R1].
+This byte forms a table-of-contents (TOC) header that signals which of the
+ various modes and configurations a given packet uses.
It is composed of a configuration number, "config", a stereo flag, "s", and a
frame count code, "c", arranged as illustrated in
<xref target="toc_byte_fig"/>.
A description of each of these fields follows.
</t>
-<figure anchor="toc_byte_fig" title="The TOC byte">
+<figure anchor="toc_byte_fig" title="The TOC Byte">
<artwork align="center"><![CDATA[
0
0 1 2 3 4 5 6 7
the value of "c".
</t>
-<t anchor="R1">
-A well-formed Opus packet MUST contain at least one byte with the TOC
- information [R1], though the frame(s) within a packet MAY be zero bytes
- long.
-</t>
</section>
<section title="Frame Packing">
The special length 0 indicates that no frame is available, either because it
was dropped during transmission by some intermediary or because the encoder
chose not to transmit it.
-A length of 0 is valid for any Opus frame in any mode.
+Any Opus frame in any mode MAY have a length of 0.
</t>
<t>
which is itself a rediscovery of the FIFO arithmetic code introduced by <xref target="coding-thesis"></xref>.
It is very similar to arithmetic encoding, except that encoding is done with
digits in any base instead of with bits,
-so it is faster when using larger bases (i.e., an octet). All of the
+so it is faster when using larger bases (i.e., a byte). All of the
calculations in the range coder must use bit-exact integer arithmetic.
</t>
<t>
<section anchor="range-decoder-init" title="Range Decoder Initialization">
<t>
-Let b0 be the first input octet (or zero if there are no octets in this Opus
+Let b0 be the first input byte (or zero if there are no bytes in this Opus
frame).
The decoder initializes rng to 128 and initializes val to
(127 - (b0>>1)), where (b0>>1) is the top 7 bits of the
- first input octet.
+ first input byte.
It saves the remaining bit, (b0&1), for use in the renormalization
procedure described in <xref target="range-decoder-renorm"/>, which the
decoder invokes immediately after initialization to read additional bits and
by ec_dec_normalize() (entdec.c), until rng > 2**23.
If rng is already greater than 2**23, the entire process is skipped.
First, it sets rng to (rng<<8).
-Then it reads the next octet of the Opus frame and forms an 8-bit value sym,
- using the left-over bit buffered from the previous octet as the high bit
- and the top 7 bits of the octet just read as the other 7 bits of sym.
-The remaining bit in the octet just read is buffered for use in the next
+Then it reads the next byte of the Opus frame and forms an 8-bit value sym,
+ using the left-over bit buffered from the previous byte as the high bit
+ and the top 7 bits of the byte just read as the other 7 bits of sym.
+The remaining bit in the byte just read is buffered for use in the next
iteration.
-If no more input octets remain, it uses zero bits instead.
+If no more input bytes remain, it uses zero bits instead.
See <xref target="range-decoder-init"/> for the initialization used to process
- the first octet.
+ the first byte.
Then, it sets
<figure align="center">
<artwork align="center"><![CDATA[
2: Coded parameters
3: Pulses, LSBs, and signs
4: Pitch lags, Long-Term Prediction (LTP) coefficients
-5: Linear Prediction Coefficients (LPC) and gains
+5: Linear Predictive Coding (LPC) coefficients and gains
6: Decoded signal (mono or mid-side stereo)
7: Unmixed signal (mono or left-right stereo)
8: Resampled signal
When switching from 20 ms to 10 ms, the 10 ms Opus frame can
contain an LBRR frame covering at most half the prior 20 ms Opus frame,
potentially leaving a hole that needs to be concealed from even a single
- packet loss.
+ packet loss (see <xref target="Packet Loss Concealment"/>).
When switching from mono to stereo, the LBRR frames in the first stereo Opus
frame MAY contain a non-trivial side channel.
</t>
The first VQ stage uses a 32-element codebook, coded with one of the PDFs in
<xref target="silk_nlsf_stage1_pdfs"/>, depending on the audio bandwidth and
the signal type of the current SILK frame.
-This yields a single index, I1, for the entire frame.
-This indexes an element in a coarse codebook, selects the PDFs for the
- second stage of the VQ, and selects the prediction weights used to remove
- intra-frame redundancy from the second stage.
+This yields a single index, I1, for the entire frame, which
+<list style="numbers">
+<t>Indexes an element in a coarse codebook,</t>
+<t>Selects the PDFs for the second stage of the VQ, and</t>
+<t>Selects the prediction weights used to remove intra-frame redundancy from
+ the second stage.</t>
+</list>
The actual codebook elements are listed in
<xref target="silk_nlsf_nbmb_codebook"/> and
<xref target="silk_nlsf_wb_codebook"/>, but they are not needed until the last
<xref target="silk_ltp_params"/> to produce an LPC residual.
The LTP filter requires LPC residual values from before the current subframe as
input.
-However, since the LPCs may have changed, it obtains this residual by
- "rewhitening" the corresponding output signal using the LPCs from the current
- subframe.
+However, since the LPC coefficients may have changed, it obtains this residual
+ by "rewhitening" the corresponding output signal using the LPC coefficients
+ from the current subframe.
Let out[i] for
(j - pitch_lags[s] - d_LPC - 2) <= i < j
be the fully reconstructed output signal from the last
<xref target='MDCT'/> with partially overlapping windows of 5 to 22.5 ms.
The main principle behind CELT is that the MDCT spectrum is divided into
bands that (roughly) follow the Bark scale, i.e., the scale of the ear's
-critical bands. The normal CELT layer uses 21 of those bands, though Opus
+critical bands <xref target="Zwicker61"/>. The normal CELT layer uses 21 of those bands, though Opus
Custom (see <xref target="opus-custom"/>) may use a different number of bands.
+In Hybrid mode, the first 17 bands (up to 8 kHz) are not coded.
A band can contain as little as one MDCT bin per channel, and as many as 176
bins per channel, as detailed in <xref target="celt_band_sizes"/>.
In each band, the gain (energy) is coded separately from
Often this control is only indirect, and must be exercised carefully to
achieve the desired rate constraints.
The CELT layer, however, can adapt over a very wide range of rates, and thus
- has a large number of codebooks sizes to choose from for each band.
+ has a large number of codebook sizes to choose from for each band.
Explicitly signaling the size of each of these codebooks would impose
considerable overhead, even though the allocation is relatively static from
frame to frame.
at very high rates.</t>
<t>
-The "static" bit allocation (in 1/8 bits) for a quality q, excluding the minimums, maximums,
-tilt and boosts, is equal to channels*N*alloc[band][q]<<LM>>2, where
+The "static" bit allocation (in 1/8 bits) for a quality q, excluding the minimums, maximums,
+tilt and boosts, is equal to channels*N*alloc[band][q]<<LM>>2, where
alloc[][] is given in <xref target="static_alloc"/> and LM=log2(frame_size/120). The allocation
is obtained by linearly interpolating between two values of q (in steps of 1/64) to find the
highest allocation that does not exceed the number of bits remaining.
may result in waste: bitstream capacity available at the end
of the frame which can not be put to any use. The maximums
specified by the codec reflect the average maximum. In the reference
-implementation, the maximums in bit/sample are precomputed in a static table
+implementation, the maximums in bits/sample are precomputed in a static table
(see cache_caps50[] in static_modes_float.h) for each band,
-for each value of LM, and for both mono and stereo.
+for each value of LM, and for both mono and stereo.
Implementations are expected
to simply use the same table data, but the procedure for generating
size of the frame in 8th bits, 'total_boost' to zero, and 'tell' to the total number
of 8th bits decoded
so far. For each band from the coding start (0 normally, but 17 in Hybrid mode)
-to the coding end (which changes depending on the signaled bandwidth): set 'width'
-to the number of MDCT bins in this band for all channels. Take the larger of width
-and 64, then the minimum of that value and the width times eight and set 'quanta'
-to the result. This represents a boost step size of six bits subject to limits
-of 1/bit/sample and 1/8th bit/sample. Set 'boost' to zero and 'dynalloc_loop_logp'
+to the coding end (which changes depending on the signaled bandwidth), the boost quanta
+in units of 1/8 bit is calculated as quanta = min(8*N, max(48, N)).
+This represents a boost step size of six bits, subject to a lower limit of
+1/8th bit/sample and an upper limit of 1 bit/sample.
+Set 'boost' to zero and 'dynalloc_loop_logp'
to dynalloc_logp. While dynalloc_loop_log (the current worst case symbol cost) in
8th bits plus tell is less than total_bits plus total_boost and boost is less than cap[] for this
band: Decode a bit from the bitstream with a with dynalloc_loop_logp as the cost
<t>
The range encoder maintains an internal state vector composed of the four-tuple
(val, rng, rem, ext) representing the low end of the current
- range, the size of the current range, a single buffered output octet, and a
- count of additional carry-propagating output octets.
-Both val and rng are 32-bit unsigned integer values, rem is an octet value or
+ range, the size of the current range, a single buffered output byte, and a
+ count of additional carry-propagating output bytes.
+Both val and rng are 32-bit unsigned integer values, rem is a byte value or
less than 255 or the special value -1, and ext is an unsigned integer with at
least 11 bits.
This state vector is initialized at the start of each each frame to the value
These are used to perform carry propagation in the renormalization loop below.
Each iteration of this loop produces 9 bits of output, consisting of 8 data
bits and a carry flag.
-The encoder cannot determine the final value of the output octets until it
+The encoder cannot determine the final value of the output bytes until it
propagates these carry flags.
Therefore the reference implementation buffers a single non-propagating output
- octet (i.e., one less than 255) in rem and keeps a count of additional
- propagating (i.e., 255) output octets in ext.
+ byte (i.e., one less than 255) in rem and keeps a count of additional
+ propagating (i.e., 255) output bytes in ext.
An implementation may choose to use any mathematically equivalent scheme to
perform carry propagation.
</t>
Then,
<list style="symbols">
<t>
-If the buffered octet rem contains a value other than -1, the encoder outputs
- the octet (rem + b).
-Otherwise, if rem is -1, no octet is output.
+If the buffered byte rem contains a value other than -1, the encoder outputs
+ the byte (rem + b).
+Otherwise, if rem is -1, no byte is output.
</t>
<t>
-If ext is non-zero, then the encoder outputs ext octets---all with a value of 0
+If ext is non-zero, then the encoder outputs ext bytes---all with a value of 0
if b is set, or 255 if b is unset---and sets ext to 0.
</t>
<t>
ec_enc_bits() (entenc.c).
Because the raw bits may continue into the last byte output by the range coder
if there is room in the low-order bits, the encoder must be prepared to merge
- these values into a single octet.
+ these values into a single byte.
The procedure in <xref target="encoder-finalizing"/> does this in a way that
ensures both the range coded data and the raw bits can be decoded
successfully.
end = (end<<8) & 0x7FFFFFFF .
]]></artwork>
</figure>
-Finally, if the buffered output octet, rem, is neither zero nor the special
+Finally, if the buffered output byte, rem, is neither zero nor the special
value -1, or the carry count, ext, is greater than zero, then 9 zero bits are
sent to the carry buffer to flush it to the output buffer.
When outputting the final byte from the range coder, if it would overlap any
The LTP coefficients are quantized using the method described in
<xref target='ltp_quantizer_overview_section'/>, and the quantized LTP
coefficients are used to compute the LTP residual signal.
- This LTP residual signal is the input to an LPC analysis where the LPCs are
+ This LTP residual signal is the input to an LPC analysis where the LPC coefficients are
estimated using Burg's method <xref target="Burg"/>, such that the residual energy is minimized.
- The estimated LPCs are converted to a Line Spectral Frequency (LSF) vector
+ The estimated LPC coefficients are converted to a Line Spectral Frequency (LSF) vector
and quantized as described in <xref target='lsf_quantizer_overview_section'/>.
After quantization, the quantized LSF vector is converted back to LPC
coefficients using the full procedure in <xref target="silk_nlsfs"/>.
</t>
<section title="Burg's Method">
<t>
-The main purpose of LPC coding in SILK is to reduce the bitrate by
+The main purpose of linear prediction in SILK is to reduce the bitrate by
minimizing the residual energy.
At least at high bitrates, perceptual aspects are handled
independently by the noise shaping filter.
<section title="Bit Allocation">
<t>The encoder must use exactly the same bit allocation process as used by the decoder
and described in <xref target="allocation"/>. The three mechanisms that can be used by the
-encoder to adjust the bitrate on a frame-by-frame basis are band boost, allocation trim,
+encoder to adjust the bitrate on a frame-by-frame basis are band boost, allocation trim,
and band skipping.
</t>
a decoder's output MUST also be
within the thresholds specified by the opus_compare.c tool (included
with the code) when compared to the reference implementation for each of the
- test vectors provided (see <xref target="test-vectors"></xref>) and for each output
+ test vectors provided (see <xref target="test-vectors"></xref>) and for each output
sampling rate and channel count supported. In addition, a compliant
decoder implementation MUST have the same final range decoder state as that of the
- reference decoder. It is therefore RECOMMENDED that the
+ reference decoder. It is therefore RECOMMENDED that the
decoder implement the same functional behavior as the reference.
-
+
A decoder implementation is not required to support all output sampling
- rates or all output channel counts.
+ rates or all output channel counts.
</t>
<section title="Testing">
<t>In addition to indicating whether the test vector comparison passes, the opus_compare tool
outputs an "Opus quality metric" that indicates how well the tested decoder matches the
reference implementation. A quality of 0 corresponds to the passing threshold, while
-a quality of 100 means that the output of the tested decoder is identical to the reference
-implementation. The passing threshold was calibrated in such a way that it corresponds to
+a quality of 100 is the highest possible value and means that the output of the tested decoder is identical to the reference
+implementation. The passing threshold (quality 0) was calibrated in such a way that it corresponds to
additive white noise with a 48 dB SNR (similar to what can be obtained on a cassette deck).
It is still possible for an implementation to sound very good with such a low quality measure
(e.g. if the deviation is due to inaudible phase distortion), but unless this is verified by
-listening tests, it is RECOMMENDED that implementations achive a quality above 90 for 48 kHz
-decoding. For other sampling rates, it is normal for the quality metric to be lower
+listening tests, it is RECOMMENDED that implementations achieve a quality above 90 for 48 kHz
+decoding. For other sampling rates, it is normal for the quality metric to be lower
(typically as low as 50 even for a good implementation) because of harmless mismatch with
the delay and phase of the internal sampling rate conversion.
</t>
needed (for either complexity or latency reasons). Because Opus Custom is
optional, streams encoded using Opus Custom cannot be expected to be decodable by all Opus
implementations. Also, because no in-band mechanism exists for specifying the sampling
-rate and frame size of Opus Custom streams, out-of-band signaling is required.
+rate and frame size of Opus Custom streams, out-of-band signaling is required.
In Opus Custom operation, only the CELT layer is available, using the opus_custom_* function
calls in opus_custom.h.
</t>
</section>
<section title="Copying Conditions">
-<t>The authors agree to grant third parties the irrevocable right to copy, use and distribute
-the work (excluding Code Components available under the simplified BSD license), with or
-without modification, in any medium, without royalty, provided that, unless separate
-permission is granted, redistributed modified works do not contain misleading author, version,
+<t>The authors agree to grant third parties the irrevocable right to copy, use and distribute
+the work (excluding Code Components available under the simplified BSD license), with or
+without modification, in any medium, without royalty, provided that, unless separate
+permission is granted, redistributed modified works do not contain misleading author, version,
name of work, or endorsement information.</t>
</section>
</front>
</reference>
-
+
<reference anchor="Opus-git" target="git://git.xiph.org/opus.git">
<front>
<title>Opus Git Repository</title>
<author initials="G." surname="Maxwell" fullname="Gregory Maxwell"><organization/></author>
</front>
<seriesInfo name="IEEE Trans. on Audio, Speech and Language Processing, Vol. 18, No. 1, pp. 58-67" value="2010" />
-</reference>
+</reference>
+
+
+<reference anchor="Zwicker61">
+<front>
+<title>Subdivision of the audible frequency range into critical bands</title>
+<author initials="E." surname="Zwicker" fullname="E. Zwicker"><organization/></author>
+<date month="February" year="1961" />
+</front>
+<seriesInfo name="The Journal of the Acoustical Society of America, Vol. 33, No 2" value="p. 248" />
+</reference>
</references>
this standard is available in a
<xref target='Opus-git'>Git repository</xref>.
Releases and other resources are available at
- <xref target='Opus-website'/>. However, although that implementation is expected to
+ <xref target='Opus-website'/>. However, although that implementation is expected to
remain conformant with the standard, it is the code in this document that shall
- remain normative.
+ remain normative.
</t>
</section>