1 <?xml version="1.0" encoding="utf-8"?>
2 <!DOCTYPE rfc SYSTEM 'rfc2629.dtd'>
3 <?rfc toc="yes" symrefs="yes" ?>
5 <rfc ipr="trust200902" category="std" docName="draft-ietf-codec-opus-11">
8 <title abbrev="Interactive Audio Codec">Definition of the Opus Audio Codec</title>
11 <author initials="JM" surname="Valin" fullname="Jean-Marc Valin">
12 <organization>Mozilla Corporation</organization>
15 <street>650 Castro Street</street>
16 <city>Mountain View</city>
19 <country>USA</country>
21 <phone>+1 650 903-0800</phone>
22 <email>jmvalin@jmvalin.ca</email>
26 <author initials="K." surname="Vos" fullname="Koen Vos">
27 <organization>Skype Technologies S.A.</organization>
30 <street>Soder Malarstrand 43</street>
31 <city>Stockholm</city>
36 <phone>+46 73 085 7619</phone>
37 <email>koen.vos@skype.net</email>
41 <author initials="T." surname="Terriberry" fullname="Timothy B. Terriberry">
42 <organization>Mozilla Corporation</organization>
45 <street>650 Castro Street</street>
46 <city>Mountain View</city>
49 <country>USA</country>
51 <phone>+1 650 903-0800</phone>
52 <email>tterriberry@mozilla.com</email>
56 <date day="17" month="February" year="2012" />
60 <workgroup></workgroup>
64 This document defines the Opus interactive speech and audio codec.
65 Opus is designed to handle a wide range of interactive audio applications,
66 including Voice over IP, videoconferencing, in-game chat, and even live,
67 distributed music performances.
68 It scales from low bitrate narrowband speech at 6 kb/s to very high quality
69 stereo music at 510 kb/s.
70 Opus uses both linear prediction (LP) and the Modified Discrete Cosine
71 Transform (MDCT) to achieve good compression of both speech and music.
78 <section anchor="introduction" title="Introduction">
80 The Opus codec is a real-time interactive audio codec designed to meet the requirements
81 described in <xref target="requirements"></xref>.
82 It is composed of a linear
83 prediction (LP)-based layer and a Modified Discrete Cosine Transform
85 The main idea behind using two layers is that in speech, linear prediction
86 techniques (such as CELP) code low frequencies more efficiently than transform
87 (e.g., MDCT) domain techniques, while the situation is reversed for music and
88 higher speech frequencies.
89 Thus a codec with both layers available can operate over a wider range than
90 either one alone and, by combining them, achieve better quality than either
95 The primary normative part of this specification is provided by the source code
96 in <xref target="ref-implementation"></xref>.
97 Only the decoder portion of this software is normative, though a
98 significant amount of code is shared by both the encoder and decoder.
99 <xref target="conformance"/> provides a decoder conformance test.
100 The decoder contains a great deal of integer and fixed-point arithmetic which
101 must be performed exactly, including all rounding considerations, so any
102 useful specification requires domain-specific symbolic language to adequately
103 define these operations.
105 conflict between the symbolic representation and the included reference
106 implementation must be resolved. For the practical reasons of compatibility and
107 testability it would be advantageous to give the reference implementation
108 priority in any disagreement. The C language is also one of the most
109 widely understood human-readable symbolic representations for machine
111 For these reasons this RFC uses the reference implementation as the sole
112 symbolic representation of the codec.
115 <t>While the symbolic representation is unambiguous and complete it is not
116 always the easiest way to understand the codec's operation. For this reason
117 this document also describes significant parts of the codec in English and
118 takes the opportunity to explain the rationale behind many of the more
119 surprising elements of the design. These descriptions are intended to be
120 accurate and informative, but the limitations of common English sometimes
121 result in ambiguity, so it is expected that the reader will always read
122 them alongside the symbolic representation. Numerous references to the
123 implementation are provided for this purpose. The descriptions sometimes
124 differ from the reference in ordering or through mathematical simplification
125 wherever such deviation makes an explanation easier to understand.
126 For example, the right shift and left shift operations in the reference
127 implementation are often described using division and multiplication in the text.
128 In general, the text is focused on the "what" and "why" while the symbolic
129 representation most clearly provides the "how".
132 <section anchor="notation" title="Notation and Conventions">
134 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
135 "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
136 interpreted as described in RFC 2119 <xref target="rfc2119"></xref>.
139 Even when using floating-point, various operations in the codec require
140 bit-exact fixed-point behavior.
141 The notation "Q<n>", where n is an integer, denotes the number of binary
142 digits to the right of the decimal point in a fixed-point number.
143 For example, a signed Q14 value in a 16-bit word can represent values from
144 -2.0 to 1.99993896484375, inclusive.
145 This notation is for informational purposes only.
146 Arithmetic, when described, always operates on the underlying integer.
147 E.g., the text will explicitly indicate any shifts required after a
151 Expressions, where included in the text, follow C operator rules and
152 precedence, with the exception that the syntax "x**y" indicates x raised to
154 The text also makes use of the following functions:
157 <section anchor="min" toc="exclude" title="min(x,y)">
159 The smallest of two values x and y.
163 <section anchor="max" toc="exclude" title="max(x,y)">
165 The largest of two values x and y.
169 <section anchor="clamp" toc="exclude" title="clamp(lo,x,hi)">
170 <figure align="center">
171 <artwork align="center"><![CDATA[
172 clamp(lo,x,hi) = max(lo,min(x,hi))
176 With this definition, if lo > hi, the lower bound is the one that
181 <section anchor="sign" toc="exclude" title="sign(x)">
184 <figure align="center">
185 <artwork align="center"><![CDATA[
187 sign(x) = < 0, x == 0 ,
194 <section anchor="log2" toc="exclude" title="log2(f)">
196 The base-two logarithm of f.
200 <section anchor="ilog" toc="exclude" title="ilog(n)">
202 The minimum number of bits required to store a positive integer n in two's
203 complement notation, or 0 for a non-positive integer n.
204 <figure align="center">
205 <artwork align="center"><![CDATA[
208 ( floor(log2(n))+1, n > 0
212 <list style="symbols">
228 <section anchor="overview" title="Opus Codec Overview">
231 The Opus codec scales from 6 kb/s narrowband mono speech to 510 kb/s
232 fullband stereo music, with algorithmic delays ranging from 5 ms to
234 At any given time, either the LP layer, the MDCT layer, or both, may be active.
235 It can seamlessly switch between all of its various operating modes, giving it
236 a great deal of flexibility to adapt to varying content and network
237 conditions without renegotiating the current session.
238 The codec allows input and output of various audio bandwidths, defined as
241 <texttable anchor="audio-bandwidth">
242 <ttcol>Abbreviation</ttcol>
243 <ttcol align="right">Audio Bandwidth</ttcol>
244 <ttcol align="right">Sample Rate (Effective)</ttcol>
245 <c>NB (narrowband)</c> <c>4 kHz</c> <c>8 kHz</c>
246 <c>MB (medium-band)</c> <c>6 kHz</c> <c>12 kHz</c>
247 <c>WB (wideband)</c> <c>8 kHz</c> <c>16 kHz</c>
248 <c>SWB (super-wideband)</c> <c>12 kHz</c> <c>24 kHz</c>
249 <c>FB (fullband)</c> <c>20 kHz (*)</c> <c>48 kHz</c>
252 (*) Although the sampling theorem allows a bandwidth as large as half the
253 sampling rate, Opus never codes audio above 20 kHz, as that is the
254 generally accepted upper limit of human hearing.
258 Opus defines super-wideband (SWB) with an effective sample rate of 24 kHz,
259 unlike some other audio coding standards that use 32 kHz.
260 This was chosen for a number of reasons.
261 The band layout in the MDCT layer naturally allows skipping coefficients for
262 frequencies over 12 kHz, but does not allow cleanly dropping just those
263 frequencies over 16 kHz.
264 A sample rate of 24 kHz also makes resampling in the MDCT layer easier,
265 as 24 evenly divides 48, and when 24 kHz is sufficient, it can save
266 computation in other processing, such as Acoustic Echo Cancellation (AEC).
267 Experimental changes to the band layout to allow a 16 kHz cutoff
268 (32 kHz effective sample rate) showed potential quality degradations at
269 other sample rates, and at typical bitrates the number of bits saved by using
270 such a cutoff instead of coding in fullband (FB) mode is very small.
271 Therefore, if an application wishes to process a signal sampled at 32 kHz,
272 it should just use FB.
276 The LP layer is based on the
277 <eref target='http://developer.skype.com/silk'>SILK</eref> codec
278 <xref target="SILK"></xref>.
279 It supports NB, MB, or WB audio and frame sizes from 10 ms to 60 ms,
280 and requires an additional 5 ms look-ahead for noise shaping estimation.
281 A small additional delay (up to 1.5 ms) may be required for sampling rate
283 Like Vorbis and many other modern codecs, SILK is inherently designed for
284 variable-bitrate (VBR) coding, though the encoder can also produce
285 constant-bitrate (CBR) streams.
286 The version of SILK used in Opus is substantially modified from, and not
287 compatible with, the stand-alone SILK codec previously deployed by Skype.
288 This document does not serve to define that format, but those interested in the
289 original SILK codec should see <xref target="SILK"/> instead.
293 The MDCT layer is based on the
294 <eref target='http://www.celt-codec.org/'>CELT</eref> codec
295 <xref target="CELT"></xref>.
296 It supports NB, WB, SWB, or FB audio and frame sizes from 2.5 ms to
297 20 ms, and requires an additional 2.5 ms look-ahead due to the
298 overlapping MDCT windows.
299 The CELT codec is inherently designed for CBR coding, but unlike many CBR
300 codecs it is not limited to a set of predetermined rates.
301 It internally allocates bits to exactly fill any given target budget, and an
302 encoder can produce a VBR stream by varying the target on a per-frame basis.
303 The MDCT layer is not used for speech when the audio bandwidth is WB or less,
304 as it is not useful there.
305 On the other hand, non-speech signals are not always adequately coded using
306 linear prediction, so for music only the MDCT layer should be used.
310 A "Hybrid" mode allows the use of both layers simultaneously with a frame size
311 of 10 or 20 ms and a SWB or FB audio bandwidth.
312 Each frame is split into a low frequency signal and a high frequency signal,
313 with a cutoff of 8 kHz.
314 The LP layer then codes the low frequency signal, followed by the MDCT layer
315 coding the high frequency signal.
316 In the MDCT layer, all bands below 8 kHz are discarded, so there is no
317 coding redundancy between the two layers.
321 The sample rate (in contrast to the actual audio bandwidth) can be chosen
322 independently on the encoder and decoder side, e.g., a fullband signal can be
323 decoded as wideband, or vice versa.
324 This approach ensures a sender and receiver can always interoperate, regardless
325 of the capabilities of their actual audio hardware.
326 Internally, the LP layer always operates at a sample rate of twice the audio
327 bandwidth, up to a maximum of 16 kHz, which it continues to use for SWB
329 The decoder simply resamples its output to support different sample rates.
330 The MDCT layer always operates internally at a sample rate of 48 kHz.
331 Since all the supported sample rates evenly divide this rate, and since the
332 the decoder may easily zero out the high frequency portion of the spectrum in
333 the frequency domain, it can simply decimate the MDCT layer output to achieve
334 the other supported sample rates very cheaply.
338 After conversion to the common, desired output sample rate, the decoder simply
339 adds the output from the two layers together.
340 To compensate for the different look-ahead required by each layer, the CELT
341 encoder input is delayed by an additional 2.7 ms.
342 This ensures that low frequencies and high frequencies arrive at the same time.
343 This extra delay may be reduced by an encoder by using less look-ahead for noise
344 shaping or using a simpler resampler in the LP layer, but this will reduce
346 However, the base 2.5 ms look-ahead in the CELT layer cannot be reduced in
347 the encoder because it is needed for the MDCT overlap, whose size is fixed by
352 Both layers use the same entropy coder, avoiding any waste from "padding bits"
354 The hybrid approach makes it easy to support both CBR and VBR coding.
355 Although the LP layer is VBR, the bit allocation of the MDCT layer can produce
356 a final stream that is CBR by using all the bits left unused by the LP layer.
359 <section title="Control Parameters">
361 The Opus codec includes a number of control parameters which can be changed dynamically during
362 regular operation of the codec, without interrupting the audio stream from the encoder to the decoder.
363 These parameters only affect the encoder since any impact they have on the bit-stream is signaled
364 in-band such that a decoder can decode any Opus stream without any out-of-band signaling. Any Opus
365 implementation can add or modify these control parameters without affecting interoperability. The most
366 important encoder control parameters in the reference encoder are listed below.
369 <section title="Bitrate" toc="exlcude">
371 Opus supports all bitrates from 6 kb/s to 510 kb/s. All other parameters being
372 equal, higher bitrate results in higher quality. For a frame size of 20 ms, these
373 are the bitrate "sweet spots" for Opus in various configurations:
374 <list style="symbols">
375 <t>8-12 kb/s for NB speech,</t>
376 <t>16-20 kb/s for WB speech,</t>
377 <t>28-40 kb/s for FB speech,</t>
378 <t>48-64 kb/s for FB mono music, and</t>
379 <t>64-128 kb/s for FB stereo music.</t>
384 <section title="Number of Channels (Mono/Stereo)" toc="exlcude">
386 Opus can transmit either mono or stereo frames within a single stream.
387 When decoding a mono frame in a stereo decoder, the left and right channels are
388 identical, and when decoding a stereo frame in a mono decoder, the mono output
389 is the average of the left and right channels.
390 In some cases, it is desirable to encode a stereo input stream in mono (e.g.,
391 because the bitrate is too low to encode stereo with sufficient quality).
392 The number of channels encoded can be selected in real-time, but by default the
393 reference encoder attempts to make the best decision possible given the
398 <section title="Audio Bandwidth" toc="exlcude">
400 The audio bandwidths supported by Opus are listed in
401 <xref target="audio-bandwidth"/>.
402 Just like for the number of channels, any decoder can decode audio encoded at
404 For example, any Opus decoder operating at 8 kHz can decode a FB Opus
405 frame, and any Opus decoder operating at 48 kHz can decode a NB frame.
406 Similarly, the reference encoder can take a 48 kHz input signal and
408 The higher the audio bandwidth, the higher the required bitrate to achieve
410 The audio bandwidth can be explicitly specified in real-time, but by default
411 the reference encoder attempts to make the best bandwidth decision possible
412 given the current bitrate.
417 <section title="Frame Duration" toc="exlcude">
419 Opus can encode frames of 2.5, 5, 10, 20, 40 or 60 ms.
420 It can also combine multiple frames into packets of up to 120 ms.
421 For real-time applications, sending fewer packets per second reduces the
422 bitrate, since it reduces the overhead from IP, UDP, and RTP headers.
423 However, it increases latency and sensitivity to packet losses, as losing one
424 packet constitutes a loss of a bigger chunk of audio.
425 Increasing the frame duration also slightly improves coding efficiency, but the
426 gain becomes small for frame sizes above 20 ms.
427 For this reason, 20 ms frames are a good choice for most applications.
431 <section title="Complexity" toc="exlcude">
433 There are various aspects of the Opus encoding process where trade-offs
434 can be made between CPU complexity and quality/bitrate. In the reference
435 encoder, the complexity is selected using an integer from 0 to 10, where
436 0 is the lowest complexity and 10 is the highest. Examples of
437 computations for which such trade-offs may occur are:
438 <list style="symbols">
439 <t>The order of the pitch analysis whitening filter,</t>
440 <t>The order of the short-term noise shaping filter,</t>
441 <t>The number of states in delayed decision quantization of the
442 residual signal, and</t>
443 <t>The use of certain bit-stream features such as variable time-frequency
444 resolution and the pitch post-filter.</t>
449 <section title="Packet Loss Resilience" toc="exlcude">
451 Audio codecs often exploit inter-frame correlations to reduce the
452 bitrate at a cost in error propagation: after losing one packet
453 several packets need to be received before the decoder is able to
454 accurately reconstruct the speech signal. The extent to which Opus
455 exploits inter-frame dependencies can be adjusted on the fly to
456 choose a trade-off between bitrate and amount of error propagation.
460 <section title="Forward Error Correction (FEC)" toc="exlcude">
462 Another mechanism providing robustness against packet loss is the in-band
463 Forward Error Correction (FEC). Packets that are determined to
464 contain perceptually important speech information, such as onsets or
465 transients, are encoded again at a lower bitrate and this re-encoded
466 information is added to a subsequent packet.
470 <section title="Constant/Variable Bitrate" toc="exlcude">
472 Opus is more efficient when operating with variable bitrate (VBR), which is
473 the default. However, in some (rare) applications, constant bitrate (CBR)
474 is required. There are two main reasons to operate in CBR mode:
475 <list style="symbols">
476 <t>When the transport only supports a fixed size for each compressed frame</t>
477 <t>When security is important <spanx style="emph">and</spanx> the input audio
478 not a normal conversation but is highly constrained (e.g. yes/no, recorded prompts)
479 <xref target="SRTP-VBR"></xref> </t>
482 When low-latency transmission is required over a relatively slow connection, then
483 constrained VBR can also be used. This uses VBR in a way that simulates a
484 "bit reservoir" and is equivalent to what MP3 and AAC call CBR (i.e. not true
485 CBR due to the bit reservoir).
489 <section title="Discontinuous Transmission (DTX)" toc="exlcude">
491 Discontinuous Transmission (DTX) reduces the bitrate during silence
492 or background noise. When DTX is enabled, only one frame is encoded
493 every 400 milliseconds.
501 <section anchor="modes" title="Internal Framing">
504 The Opus encoder produces "packets", which are each a contiguous set of bytes
505 meant to be transmitted as a single unit.
506 The packets described here do not include such things as IP, UDP, or RTP
507 headers which are normally found in a transport-layer packet.
508 A single packet may contain multiple audio frames, so long as they share a
509 common set of parameters, including the operating mode, audio bandwidth, frame
510 size, and channel count (mono vs. stereo).
511 This section describes the possible combinations of these parameters and the
512 internal framing used to pack multiple frames into a single packet.
513 This framing is not self-delimiting.
514 Instead, it assumes that a higher layer (such as UDP or RTP or Ogg or Matroska)
515 will communicate the length, in bytes, of the packet, and it uses this
516 information to reduce the framing overhead in the packet itself.
517 A decoder implementation MUST support the framing described in this section.
518 An alternative, self-delimiting variant of the framing is described in
519 <xref target="self-delimiting-framing"/>.
520 Support for that variant is OPTIONAL.
523 <section anchor="toc_byte" title="The TOC Byte">
525 An Opus packet begins with a single-byte table-of-contents (TOC) header that
526 signals which of the various modes and configurations a given packet uses.
527 It is composed of a frame count code, "c", a stereo flag, "s", and a
528 configuration number, "config", arranged as illustrated in
529 <xref target="toc_byte_fig"/>.
530 A description of each of these fields follows.
533 <figure anchor="toc_byte_fig" title="The TOC byte">
534 <artwork align="center"><![CDATA[
544 The top five bits of the TOC byte, labeled "config", encode one of 32 possible
545 configurations of operating mode, audio bandwidth, and frame size.
546 As described, the LP layer and MDCT layer can be combined in three possible
548 <list style="numbers">
549 <t>An LP-only mode for use in low bitrate connections with an audio bandwidth
551 <t>A Hybrid (LP+MDCT) mode for SWB or FB speech at medium bitrates, and</t>
552 <t>An MDCT-only mode for very low delay speech transmission as well as music
553 transmission (NB to FB).</t>
555 The 32 possible configurations each identify which one of these operating modes
556 the packet uses, as well as the audio bandwidth and the frame size.
557 <xref target="config_bits"/> lists the parameters for each configuration.
559 <texttable anchor="config_bits" title="TOC Byte Configuration Parameters">
560 <ttcol>Configuration Number(s)</ttcol>
562 <ttcol>Bandwidth</ttcol>
563 <ttcol>Frame Sizes</ttcol>
564 <c>0...3</c> <c>SILK-only</c> <c>NB</c> <c>10, 20, 40, 60 ms</c>
565 <c>4...7</c> <c>SILK-only</c> <c>MB</c> <c>10, 20, 40, 60 ms</c>
566 <c>8...11</c> <c>SILK-only</c> <c>WB</c> <c>10, 20, 40, 60 ms</c>
567 <c>12...13</c> <c>Hybrid</c> <c>SWB</c> <c>10, 20 ms</c>
568 <c>14...15</c> <c>Hybrid</c> <c>FB</c> <c>10, 20 ms</c>
569 <c>16...19</c> <c>CELT-only</c> <c>NB</c> <c>2.5, 5, 10, 20 ms</c>
570 <c>20...23</c> <c>CELT-only</c> <c>WB</c> <c>2.5, 5, 10, 20 ms</c>
571 <c>24...27</c> <c>CELT-only</c> <c>SWB</c> <c>2.5, 5, 10, 20 ms</c>
572 <c>28...31</c> <c>CELT-only</c> <c>FB</c> <c>2.5, 5, 10, 20 ms</c>
575 The configuration numbers in each range (e.g., 0...3 for NB SILK-only)
576 correspond to the various choices of frame size, in the same order.
577 For example, configuration 0 has a 10 ms frame size and configuration 3
578 has a 60 ms frame size.
582 One additional bit, labeled "s", signals mono vs. stereo, with 0 indicating
583 mono and 1 indicating stereo.
587 The remaining two bits of the TOC byte, labeled "c", code the number of frames
588 per packet (codes 0 to 3) as follows:
589 <list style="symbols">
590 <t>0: 1 frame in the packet</t>
591 <t>1: 2 frames in the packet, each with equal compressed size</t>
592 <t>2: 2 frames in the packet, with different compressed sizes</t>
593 <t>3: an arbitrary number of frames in the packet</t>
595 This draft refers to a packet as a code 0 packet, code 1 packet, etc., based on
600 A well-formed Opus packet MUST contain at least one byte with the TOC
601 information, though the frame(s) within a packet MAY be zero bytes long.
605 <section title="Frame Packing">
608 This section describes how frames are packed according to each possible value
609 of "c" in the TOC byte.
612 <section anchor="frame-length-coding" title="Frame Length Coding">
614 When a packet contains multiple VBR frames (i.e., code 2 or 3), the compressed
615 length of one or more of these frames is indicated with a one- or two-byte
616 sequence, with the meaning of the first byte as follows:
617 <list style="symbols">
618 <t>0: No frame (discontinuous transmission (DTX) or lost packet)</t>
619 <t>1...251: Length of the frame in bytes</t>
620 <t>252...255: A second byte is needed. The total length is (len[1]*4)+len[0]</t>
625 The special length 0 indicates that no frame is available, either because it
626 was dropped during transmission by some intermediary or because the encoder
627 chose not to transmit it.
628 A length of 0 is valid for any Opus frame in any mode.
632 The maximum representable length is 255*4+255=1275 bytes.
633 For 20 ms frames, this represents a bitrate of 510 kb/s, which is
634 approximately the highest useful rate for lossily compressed fullband stereo
636 Beyond this point, lossless codecs are more appropriate.
637 It is also roughly the maximum useful rate of the MDCT layer, as shortly
638 thereafter quality no longer improves with additional bits due to limitations
639 on the codebook sizes.
643 No length is transmitted for the last frame in a VBR packet, or for any of the
644 frames in a CBR packet, as it can be inferred from the total size of the
645 packet and the size of all other data in the packet.
646 However, the length of any individual frame MUST NOT exceed 1275 bytes, to
647 allow for repacketization by gateways, conference bridges, or other software.
651 <section title="Code 0: One Frame in the Packet">
654 For code 0 packets, the TOC byte is immediately followed by N-1 bytes
655 of compressed data for a single frame (where N is the size of the packet),
656 as illustrated in <xref target="code0_packet"/>.
658 <figure anchor="code0_packet" title="A Code 0 Packet" align="center">
659 <artwork align="center"><![CDATA[
661 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
662 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
665 | Compressed frame 1 (N-1 bytes)... :
668 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
673 <section title="Code 1: Two Frames in the Packet, Each with Equal Compressed Size">
675 For code 1 packets, the TOC byte is immediately followed by the
676 (N-1)/2 bytes of compressed data for the first frame, followed by
677 (N-1)/2 bytes of compressed data for the second frame, as illustrated in
678 <xref target="code1_packet"/>.
679 The number of payload bytes available for compressed data, N-1, MUST be even
680 for all code 1 packets.
682 <figure anchor="code1_packet" title="A Code 1 Packet" align="center">
683 <artwork align="center"><![CDATA[
685 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
686 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
689 | Compressed frame 1 ((N-1)/2 bytes)... |
690 : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
692 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ :
693 | Compressed frame 2 ((N-1)/2 bytes)... |
696 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
701 <section title="Code 2: Two Frames in the Packet, with Different Compressed Sizes">
703 For code 2 packets, the TOC byte is followed by a one- or two-byte sequence
704 indicating the length of the first frame (marked N1 in the figure below),
705 followed by N1 bytes of compressed data for the first frame.
706 The remaining N-N1-2 or N-N1-3 bytes are the compressed data for the
708 This is illustrated in <xref target="code2_packet"/>.
709 A code 2 packet MUST contain enough bytes to represent a valid length.
710 For example, a 1-byte code 2 packet is always invalid, and a 2-byte code 2
711 packet whose second byte is in the range 252...255 is also invalid.
712 The length of the first frame, N1, MUST also be no larger than the size of the
713 payload remaining after decoding that length for all code 2 packets.
714 This makes, for example, a 2-byte code 2 packet with a second byte in the range
715 1...251 invalid as well (the only valid 2-byte code 2 packet is one where the
716 length of both frames is zero).
718 <figure anchor="code2_packet" title="A Code 2 Packet" align="center">
719 <artwork align="center"><![CDATA[
721 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
722 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
723 |0|1|s| config | N1 (1-2 bytes): |
724 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ :
725 | Compressed frame 1 (N1 bytes)... |
726 : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
728 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
729 | Compressed frame 2... :
732 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
737 <section title="Code 3: An Arbitrary Number of Frames in the Packet">
739 Code 3 packets may encode an arbitrary number of frames, as well as additional
740 padding, called "Opus padding" to indicate that this padding is added at the
741 Opus layer, rather than at the transport layer.
742 Code 3 packets MUST have at least 2 bytes.
743 The TOC byte is followed by a byte encoding the number of frames in the packet
744 in bits 0 to 5 (marked "M" in the figure below), with bit 6 indicating whether
745 or not Opus padding is inserted (marked "p" in the figure below), and bit 7
746 indicating VBR (marked "v" in the figure below).
747 M MUST NOT be zero, and the audio duration contained within a packet MUST NOT
749 This limits the maximum frame count for any frame size to 48 (for 2.5 ms
750 frames), with lower limits for longer frame sizes.
751 <xref target="frame_count_byte"/> illustrates the layout of the frame count
754 <figure anchor="frame_count_byte" title="The frame count byte">
755 <artwork align="center"><![CDATA[
764 When Opus padding is used, the number of bytes of padding is encoded in the
765 bytes following the frame count byte.
766 Values from 0...254 indicate that 0...254 bytes of padding are included,
767 in addition to the byte(s) used to indicate the size of the padding.
768 If the value is 255, then the size of the additional padding is 254 bytes,
769 plus the padding value encoded in the next byte.
770 There MUST be at least one more byte in the packet in this case.
771 By using the value 255 multiple times, it is possible to create a packet of any
772 specific, desired size.
773 The additional padding bytes appear at the end of the packet, and MUST be set
774 to zero by the encoder to avoid creating a covert channel.
775 The decoder MUST accept any value for the padding bytes, however.
776 Let P be the total amount of padding, including both the trailing padding bytes
777 themselves and the header bytes used to indicate how many trailing bytes there
779 Then P MUST be no more than N-2.
782 In the CBR case, the compressed length of each frame in bytes is equal to the
783 number of remaining bytes in the packet after subtracting the (optional)
784 padding, (N-2-P), divided by M.
785 This number MUST be a non-negative integer multiple of M.
786 The compressed data for all M frames then follows, each of size
787 (N-2-P)/M bytes, as illustrated in <xref target="code3cbr_packet"/>.
790 <figure anchor="code3cbr_packet" title="A CBR Code 3 Packet" align="center">
791 <artwork align="center"><![CDATA[
793 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
794 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
795 |1|1|s| config | M |p|0| Padding length (Optional) :
796 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
798 : Compressed frame 1 ((N-2-P)/M bytes)... :
800 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
802 : Compressed frame 2 ((N-2-P)/M bytes)... :
804 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
808 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
810 : Compressed frame M ((N-2-P)/M bytes)... :
812 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
813 : Opus Padding (Optional)... |
814 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
819 In the VBR case, the (optional) padding length is followed by M-1 frame
820 lengths (indicated by "N1" to "N[M-1]" in the figure below), each encoded in a
821 one- or two-byte sequence as described above.
822 The packet MUST contain enough data for the M-1 lengths after removing the
823 (optional) padding, and the sum of these lengths MUST be no larger than the
824 number of bytes remaining in the packet after decoding them.
825 The compressed data for all M frames follows, each frame consisting of the
826 indicated number of bytes, with the final frame consuming any remaining bytes
827 before the final padding, as illustrated in <xref target="code3cbr_packet"/>.
828 The number of header bytes (TOC byte, frame count byte, padding length bytes,
829 and frame length bytes), plus the length of the first M-1 frames themselves,
830 plus the length of the padding MUST be no larger than N, the total size of the
834 <figure anchor="code3vbr_packet" title="A VBR Code 3 Packet" align="center">
835 <artwork align="center"><![CDATA[
837 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
838 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
839 |1|1|s| config | M |p|1| Padding length (Optional) :
840 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
841 : N1 (1-2 bytes): N2 (1-2 bytes): ... : N[M-1] |
842 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
844 : Compressed frame 1 (N1 bytes)... :
846 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
848 : Compressed frame 2 (N2 bytes)... :
850 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
854 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
856 : Compressed frame M... :
858 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
859 : Opus Padding (Optional)... |
860 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
866 <section anchor="examples" title="Examples">
868 Simplest case, one NB mono 20 ms SILK frame:
874 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
875 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
876 |0|0|0| 1 | compressed data... :
877 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
882 Two FB mono 5 ms CELT frames of the same compressed size:
888 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
889 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
890 |1|0|0| 29 | compressed data... :
891 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
896 Two FB mono 20 ms Hybrid frames of different compressed size:
902 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
903 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
904 |1|1|0| 15 | 2 |0|1| N1 | |
905 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
906 | compressed data... :
907 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
912 Four FB stereo 20 ms CELT frames of the same compressed size:
918 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
919 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
920 |1|1|1| 31 | 4 |0|0| compressed data... :
921 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
926 <section title="Extending Opus">
928 A receiver MUST NOT process packets which violate any of the rules above as
930 They are reserved for future applications, such as in-band headers (containing
932 These constraints are summarized here for reference:
933 <list style="symbols">
934 <t>Packets are at least one byte.</t>
935 <t>No implicit frame length is larger than 1275 bytes.</t>
936 <t>Code 1 packets have an odd total length, N, so that (N-1)/2 is an
938 <t>Code 2 packets have enough bytes after the TOC for a valid frame length, and
939 that length is no larger than the number of bytes remaining in the packet.</t>
940 <t>Code 3 packets contain at least one frame, but no more than 120 ms of
942 <t>The length of a CBR code 3 packet, N, is at least two bytes, the size of the
943 padding, P (including both the padding length bytes in the header and the
944 trailing padding bytes) is no more than N-2, and the frame count, M, satisfies
945 the constraint that (N-2-P) is a non-negative integer multiple of M.</t>
946 <t>VBR code 3 packets are large enough to contain all the header bytes (TOC
947 byte, frame count byte, any padding length bytes, and any frame length bytes),
948 plus the length of the first M-1 frames, plus any trailing padding bytes.</t>
955 <section title="Opus Decoder">
957 The Opus decoder consists of two main blocks: the SILK decoder and the CELT
959 At any given time, one or both of the SILK and CELT decoders may be active.
960 The output of the Opus decode is the sum of the outputs from the SILK and CELT
961 decoders with proper sample rate conversion and delay compensation on the SILK
962 side, and optional decimation (when decoding to sample rates less than
963 48 kHz) on the CELT side, as illustrated in the block diagram below.
968 +---------+ +------------+
970 +->| Decoder |--->| Rate |----+
971 Bit- +---------+ | | | | Conversion | v
972 stream | Range |---+ +---------+ +------------+ /---\ Audio
973 ------->| Decoder | | + |------>
974 | |---+ +---------+ +------------+ \---/
975 +---------+ | | CELT | | Decimation | ^
976 +->| Decoder |--->| (Optional) |----+
978 +---------+ +------------+
983 <section anchor="range-decoder" title="Range Decoder">
985 Opus uses an entropy coder based on <xref target="range-coding"></xref>,
986 which is itself a rediscovery of the FIFO arithmetic code introduced by <xref target="coding-thesis"></xref>.
987 It is very similar to arithmetic encoding, except that encoding is done with
988 digits in any base instead of with bits,
989 so it is faster when using larger bases (i.e., an octet). All of the
990 calculations in the range coder must use bit-exact integer arithmetic.
993 Symbols may also be coded as "raw bits" packed directly into the bitstream,
994 bypassing the range coder.
995 These are packed backwards starting at the end of the frame, as illustrated in
996 <xref target="rawbits-example"/>.
997 This reduces complexity and makes the stream more resilient to bit errors, as
998 corruption in the raw bits will not desynchronize the decoding process, unlike
999 corruption in the input to the range decoder.
1000 Raw bits are only used in the CELT layer.
1003 <figure anchor="rawbits-example" title="Illustrative example of packing range
1004 coder and raw bits data">
1005 <artwork align="center"><![CDATA[
1007 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
1008 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1009 | Range coder data (packed MSB to LSB) -> :
1012 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1013 : | <- Boundary occurs at an arbitrary bit position :
1015 : <- Raw bits data (packed LSB to MSB) |
1016 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1021 Each symbol coded by the range coder is drawn from a finite alphabet and coded
1022 in a separate "context", which describes the size of the alphabet and the
1023 relative frequency of each symbol in that alphabet.
1026 Suppose there is a context with n symbols, identified with an index that ranges
1028 The parameters needed to encode or decode symbol k in this context are
1029 represented by a three-tuple (fl[k], fh[k], ft), with
1030 0 <= fl[k] < fh[k] <= ft <= 65535.
1031 The values of this tuple are derived from the probability model for the
1032 symbol, represented by traditional "frequency counts".
1033 Because Opus uses static contexts these are not updated as symbols are decoded.
1034 Let f[i] be the frequency of symbol i.
1035 Then the three-tuple corresponding to symbol k is given by
1037 <figure align="center">
1038 <artwork align="center"><![CDATA[
1041 fl[k] = \ f[i], fh[k] = fl[k] + f[k], ft = \ f[i]
1047 The range decoder extracts the symbols and integers encoded using the range
1048 encoder in <xref target="range-encoder"/>.
1049 The range decoder maintains an internal state vector composed of the two-tuple
1050 (val, rng), representing the difference between the high end of the
1051 current range and the actual coded value, minus one, and the size of the
1052 current range, respectively.
1053 Both val and rng are 32-bit unsigned integer values.
1054 The decoder initializes rng to 128 and initializes val to 127 minus the top 7
1055 bits of the first input octet.
1056 It saves the remaining bit for use in the renormalization procedure described
1057 in <xref target="range-decoder-renorm"/>, which the decoder invokes
1058 immediately after initialization to read additional bits and establish the
1059 invariant that rng > 2**23.
1062 <section anchor="decoding-symbols" title="Decoding Symbols">
1064 Decoding a symbol is a two-step process.
1065 The first step determines a 16-bit unsigned value fs, which lies within the
1066 range of some symbol in the current context.
1067 The second step updates the range decoder state with the three-tuple
1068 (fl[k], fh[k], ft) corresponding to that symbol.
1071 The first step is implemented by ec_decode() (entdec.c), which computes
1072 <figure align="center">
1073 <artwork align="center"><![CDATA[
1075 fs = ft - min(------ + 1, ft) .
1079 The divisions here are exact integer division.
1082 The decoder then identifies the symbol in the current context corresponding to
1083 fs; i.e., the value of k whose three-tuple (fl[k], fh[k], ft)
1084 satisfies fl[k] <= fs < fh[k].
1085 It uses this tuple to update val according to
1086 <figure align="center">
1087 <artwork align="center"><![CDATA[
1089 val = val - --- * (ft - fh[k]) .
1093 If fl[k] is greater than zero, then the decoder updates rng using
1094 <figure align="center">
1095 <artwork align="center"><![CDATA[
1097 rng = --- * (fh[k] - fl[k]) .
1101 Otherwise, it updates rng using
1102 <figure align="center">
1103 <artwork align="center"><![CDATA[
1105 rng = rng - --- * (ft - fh[k]) .
1111 Using a special case for the first symbol (rather than the last symbol, as is
1112 commonly done in other arithmetic coders) ensures that all the truncation
1113 error from the finite precision arithmetic accumulates in symbol 0.
1114 This makes the cost of coding a 0 slightly smaller, on average, than its
1115 estimated probability indicates and makes the cost of coding any other symbol
1117 When contexts are designed so that 0 is the most probable symbol, which is
1118 often the case, this strategy minimizes the inefficiency introduced by the
1120 It also makes some of the special-case decoding routines in
1121 <xref target="decoding-alternate"/> particularly simple.
1124 After the updates, implemented by ec_dec_update() (entdec.c), the decoder
1125 normalizes the range using the procedure in the next section, and returns the
1129 <section anchor="range-decoder-renorm" title="Renormalization">
1131 To normalize the range, the decoder repeats the following process, implemented
1132 by ec_dec_normalize() (entdec.c), until rng > 2**23.
1133 If rng is already greater than 2**23, the entire process is skipped.
1134 First, it sets rng to (rng<<8).
1135 Then it reads the next octet of the payload and combines it with the left-over
1136 bit buffered from the previous octet to form the 8-bit value sym.
1137 It takes the left-over bit as the high bit (bit 7) of sym, and the top 7 bits
1138 of the octet it just read as the other 7 bits of sym.
1139 The remaining bit in the octet just read is buffered for use in the next
1141 If no more input octets remain, it uses zero bits instead.
1143 <figure align="center">
1144 <artwork align="center"><![CDATA[
1145 val = ((val<<8) + (255-sym)) & 0x7FFFFFFF .
1150 It is normal and expected that the range decoder will read several bytes
1151 into the raw bits data (if any) at the end of the packet by the time the frame
1152 is completely decoded, as illustrated in <xref target="finalize-example"/>.
1153 This same data MUST also be returned as raw bits when requested.
1154 The encoder is expected to terminate the stream in such a way that the decoder
1155 will decode the intended values regardless of the data contained in the raw
1157 <xref target="encoder-finalizing"/> describes a procedure for doing this.
1158 If the range decoder consumes all of the bytes belonging to the current frame,
1159 it MUST continue to use zero when any further input bytes are required, even
1160 if there is additional data in the current packet from padding or other
1164 <figure anchor="finalize-example" title="Illustrative example of raw bits
1165 overlapping range coder data">
1166 <artwork align="center"><![CDATA[
1168 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
1169 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1170 : | <----------- Overlap region ------------> | :
1171 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1173 | End of data buffered by the range coder |
1174 ...-----------------------------------------------+
1176 | End of data consumed by raw bits
1177 +-------------------------------------------------------...
1183 <section anchor="decoding-alternate" title="Alternate Decoding Methods">
1185 The reference implementation uses three additional decoding methods that are
1186 exactly equivalent to the above, but make assumptions and simplifications that
1187 allow for a more efficient implementation.
1189 <section anchor="ec_decode_bin" title="ec_decode_bin()">
1191 The first is ec_decode_bin() (entdec.c), defined using the parameter ftb
1193 It is mathematically equivalent to calling ec_decode() with
1194 ft = (1<<ftb), but avoids one of the divisions.
1197 <section anchor="ec_dec_bit_logp" title="ec_dec_bit_logp()">
1199 The next is ec_dec_bit_logp() (entdec.c), which decodes a single binary symbol,
1200 replacing both the ec_decode() and ec_dec_update() steps.
1201 The context is described by a single parameter, logp, which is the absolute
1202 value of the base-2 logarithm of the probability of a "1".
1203 It is mathematically equivalent to calling ec_decode() with
1204 ft = (1<<logp), followed by ec_dec_update() with
1205 the 3-tuple (fl[k] = 0,
1206 fh[k] = (1<<logp) - 1,
1207 ft = (1<<logp)) if the returned value
1208 of fs is less than (1<<logp) - 1 (a "0" was decoded), and with
1209 (fl[k] = (1<<logp) - 1,
1210 fh[k] = ft = (1<<logp)) otherwise (a "1" was
1212 The implementation requires no multiplications or divisions.
1215 <section anchor="ec_dec_icdf" title="ec_dec_icdf()">
1217 The last is ec_dec_icdf() (entdec.c), which decodes a single symbol with a
1218 table-based context of up to 8 bits, also replacing both the ec_decode() and
1219 ec_dec_update() steps, as well as the search for the decoded symbol in between.
1220 The context is described by two parameters, an icdf
1221 ("inverse" cumulative distribution function) table and ftb.
1222 As with ec_decode_bin(), (1<<ftb) is equivalent to ft.
1223 idcf[k], on the other hand, stores (1<<ftb)-fh[k], which is equal to
1224 (1<<ftb) - fl[k+1].
1225 fl[0] is assumed to be 0, and the table is terminated by a value of 0 (where
1226 fh[k] == ft).
1229 The function is mathematically equivalent to calling ec_decode() with
1230 ft = (1<<ftb), using the returned value fs to search the table
1231 for the first entry where fs < (1<<ftb)-icdf[k], and
1232 calling ec_dec_update() with
1233 fl[k] = (1<<ftb) - icdf[k-1] (or 0
1234 if k == 0), fh[k] = (1<<ftb) - idcf[k],
1235 and ft = (1<<ftb).
1236 Combining the search with the update allows the division to be replaced by a
1237 series of multiplications (which are usually much cheaper), and using an
1238 inverse CDF allows the use of an ftb as large as 8 in an 8-bit table without
1240 This is the primary interface with the range decoder in the SILK layer, though
1241 it is used in a few places in the CELT layer as well.
1244 Although icdf[k] is more convenient for the code, the frequency counts, f[k],
1245 are a more natural representation of the probability distribution function
1246 (PDF) for a given symbol.
1247 Therefore this draft lists the latter, not the former, when describing the
1248 context in which a symbol is coded as a list, e.g., {4, 4, 4, 4}/16 for a
1249 uniform context with four possible values and ft = 16.
1250 The value of ft after the slash is always the sum of the entries in the PDF,
1251 but is included for convenience.
1252 Contexts with identical probabilities, f[k]/ft, but different values of ft
1253 (or equivalently, ftb) are not the same, and cannot, in general, be used in
1254 place of one another.
1255 An icdf table is also not capable of representing a PDF where the first symbol
1257 In such contexts, ec_dec_icdf() can decode the symbol by using a table that
1258 drops the entries for any initial zero-probability values and adding the
1259 constant offset of the first value with a non-zero probability to its return
1265 <section anchor="decoding-bits" title="Decoding Raw Bits">
1267 The raw bits used by the CELT layer are packed at the end of the packet, with
1268 the least significant bit of the first value packed in the least significant
1269 bit of the last byte, filling up to the most significant bit in the last byte,
1270 continuing on to the least significant bit of the penultimate byte, and so on.
1271 The reference implementation reads them using ec_dec_bits() (entdec.c).
1272 Because the range decoder must read several bytes ahead in the stream, as
1273 described in <xref target="range-decoder-renorm"/>, the input consumed by the
1274 raw bits MAY overlap with the input consumed by the range coder, and a decoder
1276 The format should render it impossible to attempt to read more raw bits than
1277 there are actual bits in the frame, though a decoder MAY wish to check for
1278 this and report an error.
1282 <section anchor="ec_dec_uint" title="Decoding Uniformly Distributed Integers">
1284 The function ec_dec_uint() (entdec.c) decodes one of ft equiprobable values in
1285 the range 0 to (ft - 1), inclusive, each with a frequency of 1,
1286 where ft may be as large as (2**32 - 1).
1287 Because ec_decode() is limited to a total frequency of (2**16 - 1),
1288 it splits up the value into a range coded symbol representing up to 8 of the
1289 high bits, and, if necessary, raw bits representing the remainder of the
1291 The limit of 8 bits in the range coded symbol is a trade-off between
1292 implementation complexity, modeling error (since the symbols no longer truly
1293 have equal coding cost), and rounding error introduced by the range coder
1294 itself (which gets larger as more bits are included).
1295 Using raw bits reduces the maximum number of divisions required in the worst
1296 case, but means that it may be possible to decode a value outside the range
1297 0 to (ft - 1), inclusive.
1301 ec_dec_uint() takes a single, positive parameter, ft, which is not necessarily
1302 a power of two, and returns an integer, t, whose value lies between 0 and
1303 (ft - 1), inclusive.
1304 Let ftb = ilog(ft - 1), i.e., the number of bits required
1305 to store (ft - 1) in two's complement notation.
1306 If ftb is 8 or less, then t is decoded with t = ec_decode(ft), and
1307 the range coder state is updated using the three-tuple (t, t + 1,
1311 If ftb is greater than 8, then the top 8 bits of t are decoded using
1312 <figure align="center">
1313 <artwork align="center"><![CDATA[
1314 t = ec_decode(((ft - 1) >> (ftb - 8)) + 1) ,
1317 the decoder state is updated using the three-tuple
1318 (t, t + 1,
1319 ((ft - 1) >> (ftb - 8)) + 1),
1320 and the remaining bits are decoded as raw bits, setting
1321 <figure align="center">
1322 <artwork align="center"><![CDATA[
1323 t = (t << (ftb - 8)) | ec_dec_bits(ftb - 8) .
1326 If, at this point, t >= ft, then the current frame is corrupt.
1327 In that case, the decoder should assume there has been an error in the coding,
1328 decoding, or transmission and SHOULD take measures to conceal the
1329 error and/or report to the application that the error has occurred.
1334 <section anchor="decoder-tell" title="Current Bit Usage">
1336 The bit allocation routines in the CELT decoder need a conservative upper bound
1337 on the number of bits that have been used from the current frame thus far,
1338 including both range coder bits and raw bits.
1339 This drives allocation decisions that must match those made in the encoder.
1340 The upper bound is computed in the reference implementation to whole-bit
1341 precision by the function ec_tell() (entcode.h) and to fractional 1/8th bit
1342 precision by the function ec_tell_frac() (entcode.c).
1343 Like all operations in the range coder, it must be implemented in a bit-exact
1344 manner, and must produce exactly the same value returned by the same functions
1345 in the encoder after encoding the same symbols.
1348 ec_tell() is guaranteed to return ceil(ec_tell_frac()/8.0).
1349 In various places the codec will check to ensure there is enough room to
1350 contain a symbol before attempting to decode it.
1351 In practice, although the number of bits used so far is an upper bound,
1352 decoding a symbol whose probability model suggests it has a worst-case cost of
1353 p 1/8th bits may actually advance the return value of ec_tell_frac() by
1354 p-1, p, or p+1 1/8th bits, due to approximation error in that upper bound,
1355 truncation error in the range coder, and for large values of ft, modeling
1356 error in ec_dec_uint().
1359 However, this error is bounded, and periodic calls to ec_tell() or
1360 ec_tell_frac() at precisely defined points in the decoding process prevent it
1362 For a range coder symbol that requires a whole number of bits (i.e.,
1363 for which ft/(fh[k] - fl[k]) is a power of two), where there are at
1364 least p 1/8th bits available, decoding the symbol will never cause ec_tell() or
1365 ec_tell_frac() to exceed the size of the frame ("bust the budget").
1366 In this case the return value of ec_tell_frac() will only advance by more than
1367 p 1/8th bits if there was an additional, fractional number of bits remaining,
1368 and it will never advance beyond the next whole-bit boundary, which is safe,
1369 since frames always contain a whole number of bits.
1370 However, when p is not a whole number of bits, an extra 1/8th bit is required
1371 to ensure that decoding the symbol will not bust the budget.
1374 The reference implementation keeps track of the total number of whole bits that
1375 have been processed by the decoder so far in the variable nbits_total,
1376 including the (possibly fractional) number of bits that are currently
1377 buffered, but not consumed, inside the range coder.
1378 nbits_total is initialized to 9 just before the initial range renormalization
1379 process completes (or equivalently, it can be initialized to 33 after the
1380 first renormalization).
1381 The extra two bits over the actual amount buffered by the range coder
1382 guarantees that it is an upper bound and that there is enough room for the
1383 encoder to terminate the stream.
1384 Each iteration through the range coder's renormalization loop increases
1386 Reading raw bits increases nbits_total by the number of raw bits read.
1389 <section anchor="ec_tell" title="ec_tell()">
1391 The whole number of bits buffered in rng may be estimated via l = ilog(rng).
1392 ec_tell() then becomes a simple matter of removing these bits from the total.
1393 It returns (nbits_total - l).
1396 In a newly initialized decoder, before any symbols have been read, this reports
1397 that 1 bit has been used.
1398 This is the bit reserved for termination of the encoder.
1402 <section anchor="ec_tell_frac" title="ec_tell_frac()">
1404 ec_tell_frac() estimates the number of bits buffered in rng to fractional
1406 Since rng must be greater than 2**23 after renormalization, l must be at least
1409 <figure align="center">
1410 <artwork align="center">
1412 r_Q15 = rng >> (l-16) ,
1415 so that 32768 <= r_Q15 < 65536, an unsigned Q15 value representing the
1416 fractional part of rng.
1417 Then the following procedure can be used to add one bit of precision to l.
1419 <figure align="center">
1420 <artwork align="center">
1422 r_Q15 = (r_Q15*r_Q15) >> 15 .
1425 Then add the 16th bit of r_Q15 to l via
1426 <figure align="center">
1427 <artwork align="center">
1429 l = 2*l + (r_Q15 >> 16) .
1432 Finally, if this bit was a 1, reduce r_Q15 by a factor of two via
1433 <figure align="center">
1434 <artwork align="center">
1436 r_Q15 = r_Q15 >> 1 ,
1439 so that it once again lies in the range 32768 <= r_Q15 < 65536.
1442 This procedure is repeated three times to extend l to 1/8th bit precision.
1443 ec_tell_frac() then returns (nbits_total*8 - l).
1451 <section anchor="silk_decoder_outline" title="SILK Decoder">
1453 The decoder's LP layer uses a modified version of the SILK codec (herein simply
1454 called "SILK"), which runs a decoded excitation signal through adaptive
1455 long-term and short-term prediction synthesis filters.
1456 It runs at NB, MB, and WB sample rates internally.
1457 When used in a SWB or FB Hybrid frame, the LP layer itself still only runs in
1461 <section title="SILK Decoder Modules">
1463 An overview of the decoder is given in <xref target="silk_decoder_figure"/>.
1465 <figure align="center" anchor="silk_decoder_figure" title="SILK Decoder">
1466 <artwork align="center">
1468 +---------+ +------------+
1469 -->| Range |--->| Decode |---------------------------+
1470 1 | Decoder | 2 | Parameters |----------+ 5 |
1471 +---------+ +------------+ 4 | |
1474 +------------+ +------------+ +------------+
1475 | Generate |-->| LTP |-->| LPC |
1476 | Excitation | | Synthesis | | Synthesis |
1477 +------------+ +------------+ +------------+
1480 +-------------------+----------------+
1482 | +------------+ +-------------+
1483 +-->| Stereo |-->| Sample Rate |-->
1484 | Unmixing | 7 | Conversion | 8
1485 +------------+ +-------------+
1487 1: Range encoded bitstream
1489 3: Pulses, LSBs, and signs
1490 4: Pitch lags, LTP coefficients
1491 5: LPC coefficients and gains
1492 6: Decoded signal (mono or mid-side stereo)
1493 7: Unmixed signal (mono or left-right stereo)
1500 The decoder feeds the bitstream (1) to the range decoder from
1501 <xref target="range-decoder"/>, and then decodes the parameters in it (2)
1502 using the procedures detailed in
1503 Sections <xref format="counter" target="silk_header_bits"/>
1504 through <xref format="counter" target="silk_signs"/>.
1505 These parameters (3, 4, 5) are used to generate an excitation signal (see
1506 <xref target="silk_excitation_reconstruction"/>), which is fed to an optional
1507 long-term prediction (LTP) filter (voiced frames only, see
1508 <xref target="silk_ltp_synthesis"/>) and then a short-term prediction filter
1509 (see <xref target="silk_lpc_synthesis"/>), producing the decoded signal (6).
1510 For stereo streams, the mid-side representation is converted to separate left
1511 and right channels (7).
1512 The result is finally resampled to the desired output sample rate (e.g.,
1513 48 kHz) so that the resampled signal (8) can be mixed with the CELT
1519 <section anchor="silk_layer_organization" title="LP Layer Organization">
1522 Internally, the LP layer of a single Opus frame is composed of either a single
1523 10 ms regular SILK frame or between one and three 20 ms regular SILK
1525 A stereo Opus frame may double the number of regular SILK frames (up to a total
1526 of six), since it includes separate frames for a mid channel and, optionally,
1528 Optional Low Bit-Rate Redundancy (LBRR) frames, which are reduced-bitrate
1529 encodings of previous SILK frames, may be included to aid in recovery from
1531 If present, these appear before the regular SILK frames.
1532 They are in most respects identical to regular, active SILK frames, except that
1533 they are usually encoded with a lower bitrate.
1534 This draft uses "SILK frame" to refer to either one and "regular SILK frame" if
1535 it needs to draw a distinction between the two.
1538 Logically, each SILK frame is in turn composed of either two or four 5 ms
1540 Various parameters, such as the quantization gain of the excitation and the
1541 pitch lag and filter coefficients can vary on a subframe-by-subframe basis.
1542 Physically, the parameters for each subframe are interleaved in the bitstream,
1543 as described in the relevant sections for each parameter.
1546 All of these frames and subframes are decoded from the same range coder, with
1547 no padding between them.
1548 Thus packing multiple SILK frames in a single Opus frame saves, on average,
1549 half a byte per SILK frame.
1550 It also allows some parameters to be predicted from prior SILK frames in the
1551 same Opus frame, since this does not degrade packet loss robustness (beyond
1552 any penalty for merely using fewer, larger packets to store multiple frames).
1556 Stereo support in SILK uses a variant of mid-side coding, allowing a mono
1557 decoder to simply decode the mid channel.
1558 However, the data for the two channels is interleaved, so a mono decoder must
1559 still unpack the data for the side channel.
1560 It would be required to do so anyway for Hybrid Opus frames, or to support
1561 decoding individual 20 ms frames.
1565 <xref target="silk_symbols"/> summarizes the overall grouping of the contents of
1567 Figures <xref format="counter" target="silk_mono_60ms_frame"/>
1568 and <xref format="counter" target="silk_stereo_60ms_frame"/> illustrate
1569 the ordering of the various SILK frames for a 60 ms Opus frame, for both
1570 mono and stereo, respectively.
1573 <texttable anchor="silk_symbols"
1574 title="Organization of the SILK layer of an Opus frame">
1575 <ttcol align="center">Symbol(s)</ttcol>
1576 <ttcol align="center">PDF(s)</ttcol>
1577 <ttcol align="center">Condition</ttcol>
1587 <c>Per-frame LBRR flags</c>
1588 <c><xref target="silk_lbrr_flag_pdfs"/></c>
1589 <c><xref target="silk_lbrr_flags"/></c>
1591 <c>LBRR Frame(s)</c>
1592 <c><xref target="silk_frame"/></c>
1593 <c><xref target="silk_lbrr_flags"/></c>
1595 <c>Regular SILK Frame(s)</c>
1596 <c><xref target="silk_frame"/></c>
1601 <figure align="center" anchor="silk_mono_60ms_frame"
1602 title="A 60 ms Mono Frame">
1603 <artwork align="center"><![CDATA[
1604 +---------------------------------+
1606 +---------------------------------+
1608 +---------------------------------+
1609 | Per-Frame LBRR Flags (Optional) |
1610 +---------------------------------+
1611 | LBRR Frame 1 (Optional) |
1612 +---------------------------------+
1613 | LBRR Frame 2 (Optional) |
1614 +---------------------------------+
1615 | LBRR Frame 3 (Optional) |
1616 +---------------------------------+
1617 | Regular SILK Frame 1 |
1618 +---------------------------------+
1619 | Regular SILK Frame 2 |
1620 +---------------------------------+
1621 | Regular SILK Frame 3 |
1622 +---------------------------------+
1626 <figure align="center" anchor="silk_stereo_60ms_frame"
1627 title="A 60 ms Stereo Frame">
1628 <artwork align="center"><![CDATA[
1629 +---------------------------------------+
1631 +---------------------------------------+
1633 +---------------------------------------+
1635 +---------------------------------------+
1637 +---------------------------------------+
1638 | Mid Per-Frame LBRR Flags (Optional) |
1639 +---------------------------------------+
1640 | Side Per-Frame LBRR Flags (Optional) |
1641 +---------------------------------------+
1642 | Mid LBRR Frame 1 (Optional) |
1643 +---------------------------------------+
1644 | Side LBRR Frame 1 (Optional) |
1645 +---------------------------------------+
1646 | Mid LBRR Frame 2 (Optional) |
1647 +---------------------------------------+
1648 | Side LBRR Frame 2 (Optional) |
1649 +---------------------------------------+
1650 | Mid LBRR Frame 3 (Optional) |
1651 +---------------------------------------+
1652 | Side LBRR Frame 3 (Optional) |
1653 +---------------------------------------+
1654 | Mid Regular SILK Frame 1 |
1655 +---------------------------------------+
1656 | Side Regular SILK Frame 1 (Optional) |
1657 +---------------------------------------+
1658 | Mid Regular SILK Frame 2 |
1659 +---------------------------------------+
1660 | Side Regular SILK Frame 2 (Optional) |
1661 +---------------------------------------+
1662 | Mid Regular SILK Frame 3 |
1663 +---------------------------------------+
1664 | Side Regular SILK Frame 3 (Optional) |
1665 +---------------------------------------+
1671 <section anchor="silk_header_bits" title="Header Bits">
1673 The LP layer begins with two to eight header bits, decoded in silk_Decode()
1675 These consist of one Voice Activity Detection (VAD) bit per frame (up to 3),
1676 followed by a single flag indicating the presence of LBRR frames.
1677 For a stereo packet, these first flags correspond to the mid channel, and a
1678 second set of flags is included for the side channel.
1681 Because these are the first symbols decoded by the range coder and because they
1682 are coded as binary values with uniform probability, they can be extracted
1683 directly from the most significant bits of the first byte of compressed data.
1684 Thus, a receiver can determine if an Opus frame contains any active SILK frames
1685 without the overhead of using the range decoder.
1689 <section anchor="silk_lbrr_flags" title="Per-Frame LBRR Flags">
1691 For Opus frames longer than 20 ms, a set of LBRR flags is
1692 decoded for each channel that has its LBRR flag set.
1693 Each set contains one flag per 20 ms SILK frame.
1694 40 ms Opus frames use the 2-frame LBRR flag PDF from
1695 <xref target="silk_lbrr_flag_pdfs"/>, and 60 ms Opus frames use the
1696 3-frame LBRR flag PDF.
1697 For each channel, the resulting 2- or 3-bit integer contains the corresponding
1698 LBRR flag for each frame, packed in order from the LSB to the MSB.
1701 <texttable anchor="silk_lbrr_flag_pdfs" title="LBRR Flag PDFs">
1702 <ttcol>Frame Size</ttcol>
1704 <c>40 ms</c> <c>{0, 53, 53, 150}/256</c>
1705 <c>60 ms</c> <c>{0, 41, 20, 29, 41, 15, 28, 82}/256</c>
1709 A 10 or 20 ms Opus frame does not contain any per-frame LBRR flags,
1710 as there may be at most one LBRR frame per channel.
1711 The global LBRR flag in the header bits (see <xref target="silk_header_bits"/>)
1712 is already sufficient to indicate the presence of that single LBRR frame.
1717 <section anchor="silk_lbrr_frames" title="LBRR Frames">
1719 The LBRR frames, if present, contain an encoded representation of the signal
1720 immediately prior to the current Opus frame as if it were encoded with the
1721 current mode, frame size, audio bandwidth, and channel count, even if those
1722 differ from the prior Opus frame.
1723 When one of these parameters changes from one Opus frame to the next, this
1724 implies that the LBRR frames of the current Opus frame may not be simple
1725 drop-in replacements for the contents of the previous Opus frame.
1729 For example, when switching from 20 ms to 60 ms, the 60 ms Opus
1730 frame may contain LBRR frames covering up to three prior 20 ms Opus
1731 frames, even if those frames already contained LBRR frames covering some of
1732 the same time periods.
1733 When switching from 20 ms to 10 ms, the 10 ms Opus frame can
1734 contain an LBRR frame covering at most half the prior 20 ms Opus frame,
1735 potentially leaving a hole that needs to be concealed from even a single
1737 When switching from mono to stereo, the LBRR frames in the first stereo Opus
1738 frame MAY contain a non-trivial side channel.
1742 In order to properly produce LBRR frames under all conditions, an encoder might
1743 need to buffer up to 60 ms of audio and re-encode it during these
1745 However, the reference implementation opts to disable LBRR frames at the
1746 transition point for simplicity.
1750 The LBRR frames immediately follow the LBRR flags, prior to any regular SILK
1752 <xref target="silk_frame"/> describes their exact contents.
1753 LBRR frames do not include their own separate VAD flags.
1754 LBRR frames are only meant to be transmitted for active speech, thus all LBRR
1755 frames are treated as active.
1759 In a stereo Opus frame longer than 20 ms, although the per-frame LBRR
1760 flags for the mid channel are coded as a unit before the per-frame LBRR flags
1761 for the side channel, the LBRR frames themselves are interleaved.
1762 The decoder parses an LBRR frame for the mid channel of a given 20 ms
1763 interval (if present) and then immediately parses the corresponding LBRR
1764 frame for the side channel (if present), before proceeding to the next
1765 20 ms interval.
1769 <section anchor="silk_regular_frames" title="Regular SILK Frames">
1771 The regular SILK frame(s) follow the LBRR frames (if any).
1772 <xref target="silk_frame"/> describes their contents, as well.
1773 Unlike the LBRR frames, a regular SILK frame is coded for each time interval in
1774 an Opus frame, even if the corresponding VAD flags are unset.
1775 For stereo Opus frames longer than 20 ms, the regular mid and side SILK
1776 frames for each 20 ms interval are interleaved, just as with the LBRR
1778 The side frame may be skipped by coding an appropriate flag, as detailed in
1779 <xref target="silk_mid_only_flag"/>.
1783 <section anchor="silk_frame" title="SILK Frame Contents">
1785 Each SILK frame includes a set of side information that encodes
1786 <list style="symbols">
1787 <t>The frame type and quantization type (<xref target="silk_frame_type"/>),</t>
1788 <t>Quantization gains (<xref target="silk_gains"/>),</t>
1789 <t>Short-term prediction filter coefficients (<xref target="silk_nlsfs"/>),</t>
1790 <t>An LSF interpolation weight (<xref target="silk_nlsf_interpolation"/>),</t>
1792 Long-term prediction filter lags and gains (<xref target="silk_ltp_params"/>),
1795 <t>A linear congruential generator (LCG) seed (<xref target="silk_seed"/>).</t>
1797 The quantized excitation signal (see <xref target="silk_excitation"/>) follows
1798 these at the end of the frame.
1799 <xref target="silk_frame_symbols"/> details the overall organization of a
1803 <texttable anchor="silk_frame_symbols"
1804 title="Order of the symbols in an individual SILK frame">
1805 <ttcol align="center">Symbol(s)</ttcol>
1806 <ttcol align="center">PDF(s)</ttcol>
1807 <ttcol align="center">Condition</ttcol>
1809 <c>Stereo Prediction Weights</c>
1810 <c><xref target="silk_stereo_pred_pdfs"/></c>
1811 <c><xref target="silk_stereo_pred"/></c>
1813 <c>Mid-only Flag</c>
1814 <c><xref target="silk_mid_only_pdf"/></c>
1815 <c><xref target="silk_mid_only_flag"/></c>
1818 <c><xref target="silk_frame_type"/></c>
1821 <c>Subframe Gains</c>
1822 <c><xref target="silk_gains"/></c>
1825 <c>Normalized LSF Stage 1 Index</c>
1826 <c><xref target="silk_nlsf_stage1_pdfs"/></c>
1829 <c>Normalized LSF Stage 2 Residual</c>
1830 <c><xref target="silk_nlsf_stage2"/></c>
1833 <c>Normalized LSF Interpolation Weight</c>
1834 <c><xref target="silk_nlsf_interp_pdf"/></c>
1835 <c>20 ms frame</c>
1837 <c>Primary Pitch Lag</c>
1838 <c><xref target="silk_ltp_lags"/></c>
1841 <c>Subframe Pitch Contour</c>
1842 <c><xref target="silk_pitch_contour_pdfs"/></c>
1845 <c>Periodicity Index</c>
1846 <c><xref target="silk_perindex_pdf"/></c>
1850 <c><xref target="silk_ltp_filter_pdfs"/></c>
1854 <c><xref target="silk_ltp_scaling_pdf"/></c>
1855 <c><xref target="silk_ltp_scaling"/></c>
1858 <c><xref target="silk_seed_pdf"/></c>
1861 <c>Excitation Rate Level</c>
1862 <c><xref target="silk_rate_level_pdfs"/></c>
1865 <c>Excitation Pulse Counts</c>
1866 <c><xref target="silk_pulse_count_pdfs"/></c>
1869 <c>Excitation Pulse Locations</c>
1870 <c><xref target="silk_pulse_locations"/></c>
1871 <c>Non-zero pulse count</c>
1873 <c>Excitation LSBs</c>
1874 <c><xref target="silk_shell_lsb_pdf"/></c>
1875 <c><xref target="silk_pulse_counts"/></c>
1877 <c>Excitation Signs</c>
1878 <c><xref target="silk_sign_pdfs"/></c>
1883 <section anchor="silk_stereo_pred" toc="include"
1884 title="Stereo Prediction Weights">
1886 A SILK frame corresponding to the mid channel of a stereo Opus frame begins
1887 with a pair of side channel prediction weights, designed such that zeros
1888 indicate normal mid-side coupling.
1889 Since these weights can change on every frame, the first portion of each frame
1890 linearly interpolates between the previous weights and the current ones, using
1891 zeros for the previous weights if none are available.
1892 These prediction weights are never included in a mono Opus frame, and the
1893 previous weights are reset to zeros on any transition from mono to stereo.
1894 They are also not included in an LBRR frame for the side channel, even if the
1895 LBRR flags indicate the corresponding mid channel was not coded.
1896 In that case, the previous weights are used, again substituting in zeros if no
1897 previous weights are available since the last decoder reset
1898 (see <xref target="decoder-reset"/>).
1902 To summarize, these weights are coded if and only if
1903 <list style="symbols">
1904 <t>This is a stereo Opus frame (<xref target="toc_byte"/>), and</t>
1905 <t>The current SILK frame corresponds to the mid channel.</t>
1910 The prediction weights are coded in three separate pieces, which are decoded
1911 by silk_stereo_decode_pred() (decode_stereo_pred.c).
1912 The first piece jointly codes the high-order part of a table index for both
1914 The second piece codes the low-order part of each table index.
1915 The third piece codes an offset used to linearly interpolate between table
1917 The details are as follows.
1921 Let n be an index decoded with the 25-element stage-1 PDF in
1922 <xref target="silk_stereo_pred_pdfs"/>.
1923 Then let i0 and i1 be indices decoded with the stage-2 and stage-3 PDFs in
1924 <xref target="silk_stereo_pred_pdfs"/>, respectively, and let i2 and i3
1925 be two more indices decoded with the stage-2 and stage-3 PDFs, all in that
1929 <texttable anchor="silk_stereo_pred_pdfs" title="Stereo Weight PDFs">
1930 <ttcol align="left">Stage</ttcol>
1931 <ttcol align="left">PDF</ttcol>
1937 1, 1, 1, 2, 7}/256</c>
1940 <c>{85, 86, 85}/256</c>
1943 <c>{51, 51, 52, 51, 51}/256</c>
1947 Then use n, i0, and i2 to form two table indices, wi0 and wi1, according to
1948 <figure align="center">
1949 <artwork align="center"><![CDATA[
1954 where the division is exact integer division.
1955 The range of these indices is 0 to 14, inclusive.
1956 Let w[i] be the i'th weight from <xref target="silk_stereo_weights_table"/>.
1957 Then the two prediction weights, w0_Q13 and w1_Q13, are
1958 <figure align="center">
1959 <artwork align="center"><![CDATA[
1961 + ((w_Q13[wi1+1] - w_Q13[wi1])*6554) >> 16)*(2*i3 + 1)
1964 + ((w_Q13[wi0+1] - w_Q13[wi0])*6554) >> 16)*(2*i1 + 1)
1968 N.b., w1_Q13 is computed first here, because w0_Q13 depends on it.
1971 <texttable anchor="silk_stereo_weights_table"
1972 title="Stereo Weight Table">
1973 <ttcol align="left">Index</ttcol>
1974 <ttcol align="right">Weight (Q13)</ttcol>
1975 <c>0</c> <c>-13732</c>
1976 <c>1</c> <c>-10050</c>
1977 <c>2</c> <c>-8266</c>
1978 <c>3</c> <c>-7526</c>
1979 <c>4</c> <c>-6500</c>
1980 <c>5</c> <c>-5000</c>
1981 <c>6</c> <c>-2950</c>
1982 <c>7</c> <c>-820</c>
1984 <c>9</c> <c>2950</c>
1985 <c>10</c> <c>5000</c>
1986 <c>11</c> <c>6500</c>
1987 <c>12</c> <c>7526</c>
1988 <c>13</c> <c>8266</c>
1989 <c>14</c> <c>10050</c>
1990 <c>15</c> <c>13732</c>
1995 <section anchor="silk_mid_only_flag" toc="include" title="Mid-only Flag">
1997 A flag appears after the stereo prediction weights that indicates if only the
1998 mid channel is coded for this time interval.
1999 It appears only when
2000 <list style="symbols">
2001 <t>This is a stereo Opus frame (see <xref target="toc_byte"/>),</t>
2002 <t>The current SILK frame corresponds to the mid channel, and</t>
2004 <list style="symbols">
2005 <t>This is a regular SILK frame where the VAD flags
2006 (see <xref target="silk_header_bits"/>) indicate that the corresponding side
2007 channel is not active.</t>
2009 This is an LBRR frame where the LBRR flags
2010 (see <xref target="silk_header_bits"/> and <xref target="silk_lbrr_flags"/>)
2011 indicate that the corresponding side channel is not coded.
2016 It is omitted when there are no stereo weights, for all of the same reasons.
2017 It is also omitted for a regular SILK frame when the VAD flag of the
2018 corresponding side channel frame is set (indicating it is active).
2019 The side channel must be coded in this case, making the mid-only flag
2021 It is also omitted for an LBRR frame when the corresponding LBRR flags
2022 indicate the side channel is coded.
2026 When the flag is present, the decoder reads a single value using the PDF in
2027 <xref target="silk_mid_only_pdf"/>, as implemented in
2028 silk_stereo_decode_mid_only() (decode_stereo_pred.c).
2029 If the flag is set, then there is no corresponding SILK frame for the side
2030 channel, the entire decoding process for the side channel is skipped, and
2031 zeros are fed to the stereo unmixing process (see
2032 <xref target="silk_stereo_unmixing"/>) instead.
2033 As stated above, LBRR frames still include this flag when the LBRR flag
2034 indicates that the side channel is not coded.
2035 In that case, if this flag is zero (indicating that there should be a side
2036 channel), then Packet Loss Concealment (PLC, see
2037 <xref target="Packet Loss Concealment"/>) SHOULD be invoked to recover a
2038 side channel signal.
2041 <texttable anchor="silk_mid_only_pdf" title="Mid-only Flag PDF">
2042 <ttcol align="left">PDF</ttcol>
2043 <c>{192, 64}/256</c>
2048 <section anchor="silk_frame_type" toc="include" title="Frame Type">
2050 Each SILK frame contains a single "frame type" symbol that jointly codes the
2051 signal type and quantization offset type of the corresponding frame.
2052 If the current frame is a regular SILK frame whose VAD bit was not set (an
2053 "inactive" frame), then the frame type symbol takes on a value of either 0 or
2054 1 and is decoded using the first PDF in <xref target="silk_frame_type_pdfs"/>.
2055 If the frame is an LBRR frame or a regular SILK frame whose VAD flag was set
2056 (an "active" frame), then the value of the symbol may range from 2 to 5,
2057 inclusive, and is decoded using the second PDF in
2058 <xref target="silk_frame_type_pdfs"/>.
2059 <xref target="silk_frame_type_table"/> translates between the value of the
2060 frame type symbol and the corresponding signal type and quantization offset
2064 <texttable anchor="silk_frame_type_pdfs" title="Frame Type PDFs">
2065 <ttcol>VAD Flag</ttcol>
2067 <c>Inactive</c> <c>{26, 230, 0, 0, 0, 0}/256</c>
2068 <c>Active</c> <c>{0, 0, 24, 74, 148, 10}/256</c>
2071 <texttable anchor="silk_frame_type_table"
2072 title="Signal Type and Quantization Offset Type from Frame Type">
2073 <ttcol>Frame Type</ttcol>
2074 <ttcol>Signal Type</ttcol>
2075 <ttcol align="right">Quantization Offset Type</ttcol>
2076 <c>0</c> <c>Inactive</c> <c>Low</c>
2077 <c>1</c> <c>Inactive</c> <c>High</c>
2078 <c>2</c> <c>Unvoiced</c> <c>Low</c>
2079 <c>3</c> <c>Unvoiced</c> <c>High</c>
2080 <c>4</c> <c>Voiced</c> <c>Low</c>
2081 <c>5</c> <c>Voiced</c> <c>High</c>
2086 <section anchor="silk_gains" toc="include" title="Subframe Gains">
2088 A separate quantization gain is coded for each 5 ms subframe.
2089 These gains control the step size between quantization levels of the excitation
2090 signal and, therefore, the quality of the reconstruction.
2091 They are independent of the pitch gains coded for voiced frames.
2092 The quantization gains are themselves uniformly quantized to 6 bits on a
2093 log scale, giving them a resolution of approximately 1.369 dB and a range
2094 of approximately 1.94 dB to 88.21 dB.
2097 The subframe gains are either coded independently, or relative to the gain from
2098 the most recent coded subframe in the same channel.
2099 Independent coding is used if and only if
2100 <list style="symbols">
2102 This is the first subframe in the current SILK frame, and
2105 <list style="symbols">
2107 This is the first SILK frame of its type (LBRR or regular) for this channel in
2108 the current Opus frame, or
2111 The previous SILK frame of the same type (LBRR or regular) for this channel in
2112 the same Opus frame was not coded.
2120 In an independently coded subframe gain, the 3 most significant bits of the
2121 quantization gain are decoded using a PDF selected from
2122 <xref target="silk_independent_gain_msb_pdfs"/> based on the decoded signal
2123 type (see <xref target="silk_frame_type"/>).
2126 <texttable anchor="silk_independent_gain_msb_pdfs"
2127 title="PDFs for Independent Quantization Gain MSB Coding">
2128 <ttcol align="left">Signal Type</ttcol>
2129 <ttcol align="left">PDF</ttcol>
2130 <c>Inactive</c> <c>{32, 112, 68, 29, 12, 1, 1, 1}/256</c>
2131 <c>Unvoiced</c> <c>{2, 17, 45, 60, 62, 47, 19, 4}/256</c>
2132 <c>Voiced</c> <c>{1, 3, 26, 71, 94, 50, 9, 2}/256</c>
2136 The 3 least significant bits are decoded using a uniform PDF:
2138 <texttable anchor="silk_independent_gain_lsb_pdf"
2139 title="PDF for Independent Quantization Gain LSB Coding">
2140 <ttcol align="left">PDF</ttcol>
2141 <c>{32, 32, 32, 32, 32, 32, 32, 32}/256</c>
2145 These 6 bits are combined to form a gain index between 0 and 63.
2146 When the gain for the previous subframe is available, then the current gain is
2148 <figure align="center">
2149 <artwork align="center"><![CDATA[
2150 log_gain = max(gain_index, previous_log_gain - 16) .
2153 This may help some implementations limit the change in precision of their
2154 internal LTP history.
2155 The indices which this clamp applies to cannot simply be removed from the
2156 codebook, because the previous gain index will not be available after packet
2158 This step is skipped after a decoder reset, and in the side channel if the
2159 previous frame in the side channel was not coded, since there is no previous
2161 It MAY also be skipped after packet loss.
2165 For subframes which do not have an independent gain (including the first
2166 subframe of frames not listed as using independent coding above), the
2167 quantization gain is coded relative to the gain from the previous subframe (in
2169 The PDF in <xref target="silk_delta_gain_pdf"/> yields a delta gain index
2170 between 0 and 40, inclusive.
2172 <texttable anchor="silk_delta_gain_pdf"
2173 title="PDF for Delta Quantization Gain Coding">
2174 <ttcol align="left">PDF</ttcol>
2175 <c>{6, 5, 11, 31, 132, 21, 8, 4,
2176 3, 2, 2, 2, 1, 1, 1, 1,
2177 1, 1, 1, 1, 1, 1, 1, 1,
2178 1, 1, 1, 1, 1, 1, 1, 1,
2179 1, 1, 1, 1, 1, 1, 1, 1, 1}/256</c>
2182 The following formula translates this index into a quantization gain for the
2183 current subframe using the gain from the previous subframe:
2184 <figure align="center">
2185 <artwork align="center"><![CDATA[
2186 log_gain = clamp(0, max(2*gain_index - 16,
2187 previous_log_gain + gain_index - 4), 63) .
2192 silk_gains_dequant() (gain_quant.c) dequantizes log_gain for the k'th subframe
2193 and converts it into a linear Q16 scale factor via
2194 <figure align="center">
2195 <artwork align="center"><![CDATA[
2196 gain_Q16[k] = silk_log2lin((0x1D1C71*log_gain>>16) + 2090)
2201 The function silk_log2lin() (log2lin.c) computes an approximation of
2202 2**(inLog_Q7/128.0), where inLog_Q7 is its Q7 input.
2203 Let i = inLog_Q7>>7 be the integer part of inLogQ7 and
2204 f = inLog_Q7&127 be the fractional part.
2206 <figure align="center">
2207 <artwork align="center"><![CDATA[
2208 (1<<i) + ((-174*f*(128-f)>>16)+f)*((1<<i)>>7)
2211 yields the approximate exponential.
2212 The final Q16 gain values lies between 81920 and 1686110208, inclusive
2213 (representing scale factors of 1.25 to 25728, respectively).
2217 <section anchor="silk_nlsfs" toc="include" title="Normalized Line Spectral
2218 Frequency (LSF) and Linear Predictive Coding (LPC) Coefficients">
2220 A set of normalized Line Spectral Frequency (LSF) coefficients follow the
2221 quantization gains in the bitstream, and represent the Linear Predictive
2222 Coding (LPC) coefficients for the current SILK frame.
2223 Once decoded, the normalized LSFs form an increasing list of Q15 values between
2225 These represent the interleaved zeros on the unit circle between 0 and pi
2226 (hence "normalized") in the standard decomposition of the LPC filter into a
2227 symmetric part and an anti-symmetric part (P and Q in
2228 <xref target="silk_nlsf2lpc"/>).
2229 Because of non-linear effects in the decoding process, an implementation SHOULD
2230 match the fixed-point arithmetic described in this section exactly.
2231 An encoder SHOULD also use the same process.
2234 The normalized LSFs are coded using a two-stage vector quantizer (VQ)
2235 (<xref target="silk_nlsf_stage1"/> and <xref target="silk_nlsf_stage2"/>).
2236 NB and MB frames use an order-10 predictor, while WB frames use an order-16
2237 predictor, and thus have different sets of tables.
2238 After reconstructing the normalized LSFs
2239 (<xref target="silk_nlsf_reconstruction"/>), the decoder runs them through a
2240 stabilization process (<xref target="silk_nlsf_stabilization"/>), interpolates
2241 them between frames (<xref target="silk_nlsf_interpolation"/>), converts them
2242 back into LPC coefficients (<xref target="silk_nlsf2lpc"/>), and then runs
2243 them through further processes to limit the range of the coefficients
2244 (<xref target="silk_lpc_range_limit"/>) and the gain of the filter
2245 (<xref target="silk_lpc_gain_limit"/>).
2246 All of this is necessary to ensure the reconstruction process is stable.
2249 <section anchor="silk_nlsf_stage1" title="Stage 1 Normalized LSF Decoding">
2251 The first VQ stage uses a 32-element codebook, coded with one of the PDFs in
2252 <xref target="silk_nlsf_stage1_pdfs"/>, depending on the audio bandwidth and
2253 the signal type of the current SILK frame.
2254 This yields a single index, I1, for the entire frame.
2255 This indexes an element in a coarse codebook, selects the PDFs for the
2256 second stage of the VQ, and selects the prediction weights used to remove
2257 intra-frame redundancy from the second stage.
2258 The actual codebook elements are listed in
2259 <xref target="silk_nlsf_nbmb_codebook"/> and
2260 <xref target="silk_nlsf_wb_codebook"/>, but they are not needed until the last
2261 stages of reconstructing the LSF coefficients.
2264 <texttable anchor="silk_nlsf_stage1_pdfs"
2265 title="PDFs for Normalized LSF Index Stage-1 Decoding">
2266 <ttcol align="left">Audio Bandwidth</ttcol>
2267 <ttcol align="left">Signal Type</ttcol>
2268 <ttcol align="left">PDF</ttcol>
2269 <c>NB or MB</c> <c>Inactive or unvoiced</c>
2271 {44, 34, 30, 19, 21, 12, 11, 3,
2272 3, 2, 16, 2, 2, 1, 5, 2,
2273 1, 3, 3, 1, 1, 2, 2, 2,
2274 3, 1, 9, 9, 2, 7, 2, 1}/256
2276 <c>NB or MB</c> <c>Voiced</c>
2278 {1, 10, 1, 8, 3, 8, 8, 14,
2279 13, 14, 1, 14, 12, 13, 11, 11,
2280 12, 11, 10, 10, 11, 8, 9, 8,
2281 7, 8, 1, 1, 6, 1, 6, 5}/256
2283 <c>WB</c> <c>Inactive or unvoiced</c>
2285 {31, 21, 3, 17, 1, 8, 17, 4,
2286 1, 18, 16, 4, 2, 3, 1, 10,
2287 1, 3, 16, 11, 16, 2, 2, 3,
2288 2, 11, 1, 4, 9, 8, 7, 3}/256
2290 <c>WB</c> <c>Voiced</c>
2292 {1, 4, 16, 5, 18, 11, 5, 14,
2293 15, 1, 3, 12, 13, 14, 14, 6,
2294 14, 12, 2, 6, 1, 12, 12, 11,
2295 10, 3, 10, 5, 1, 1, 1, 3}/256
2301 <section anchor="silk_nlsf_stage2" title="Stage 2 Normalized LSF Decoding">
2303 A total of 16 PDFs are available for the LSF residual in the second stage: the
2304 8 (a...h) for NB and MB frames given in
2305 <xref target="silk_nlsf_stage2_nbmb_pdfs"/>, and the 8 (i...p) for WB frames
2306 given in <xref target="silk_nlsf_stage2_wb_pdfs"/>.
2307 Which PDF is used for which coefficient is driven by the index, I1,
2308 decoded in the first stage.
2309 <xref target="silk_nlsf_nbmb_stage2_cb_sel"/> lists the letter of the
2310 corresponding PDF for each normalized LSF coefficient for NB and MB, and
2311 <xref target="silk_nlsf_wb_stage2_cb_sel"/> lists the same information for WB.
2314 <texttable anchor="silk_nlsf_stage2_nbmb_pdfs"
2315 title="PDFs for NB/MB Normalized LSF Index Stage-2 Decoding">
2316 <ttcol align="left">Codebook</ttcol>
2317 <ttcol align="left">PDF</ttcol>
2318 <c>a</c> <c>{1, 1, 1, 15, 224, 11, 1, 1, 1}/256</c>
2319 <c>b</c> <c>{1, 1, 2, 34, 183, 32, 1, 1, 1}/256</c>
2320 <c>c</c> <c>{1, 1, 4, 42, 149, 55, 2, 1, 1}/256</c>
2321 <c>d</c> <c>{1, 1, 8, 52, 123, 61, 8, 1, 1}/256</c>
2322 <c>e</c> <c>{1, 3, 16, 53, 101, 74, 6, 1, 1}/256</c>
2323 <c>f</c> <c>{1, 3, 17, 55, 90, 73, 15, 1, 1}/256</c>
2324 <c>g</c> <c>{1, 7, 24, 53, 74, 67, 26, 3, 1}/256</c>
2325 <c>h</c> <c>{1, 1, 18, 63, 78, 58, 30, 6, 1}/256</c>
2328 <texttable anchor="silk_nlsf_stage2_wb_pdfs"
2329 title="PDFs for WB Normalized LSF Index Stage-2 Decoding">
2330 <ttcol align="left">Codebook</ttcol>
2331 <ttcol align="left">PDF</ttcol>
2332 <c>i</c> <c>{1, 1, 1, 9, 232, 9, 1, 1, 1}/256</c>
2333 <c>j</c> <c>{1, 1, 2, 28, 186, 35, 1, 1, 1}/256</c>
2334 <c>k</c> <c>{1, 1, 3, 42, 152, 53, 2, 1, 1}/256</c>
2335 <c>l</c> <c>{1, 1, 10, 49, 126, 65, 2, 1, 1}/256</c>
2336 <c>m</c> <c>{1, 4, 19, 48, 100, 77, 5, 1, 1}/256</c>
2337 <c>n</c> <c>{1, 1, 14, 54, 100, 72, 12, 1, 1}/256</c>
2338 <c>o</c> <c>{1, 1, 15, 61, 87, 61, 25, 4, 1}/256</c>
2339 <c>p</c> <c>{1, 7, 21, 50, 77, 81, 17, 1, 1}/256</c>
2342 <texttable anchor="silk_nlsf_nbmb_stage2_cb_sel"
2343 title="Codebook Selection for NB/MB Normalized LSF Index Stage 2 Decoding">
2345 <ttcol>Coefficient</ttcol>
2347 <c><spanx style="vbare">0 1 2 3 4 5 6 7 8 9</spanx></c>
2349 <c><spanx style="vbare">a a a a a a a a a a</spanx></c>
2351 <c><spanx style="vbare">b d b c c b c b b b</spanx></c>
2353 <c><spanx style="vbare">c b b b b b b b b b</spanx></c>
2355 <c><spanx style="vbare">b c c c c b c b b b</spanx></c>
2357 <c><spanx style="vbare">c d d d d c c c c c</spanx></c>
2359 <c><spanx style="vbare">a f d d c c c c b b</spanx></c>
2361 <c><spanx style="vbare">a c c c c c c c c b</spanx></c>
2363 <c><spanx style="vbare">c d g e e e f e f f</spanx></c>
2365 <c><spanx style="vbare">c e f f e f e g e e</spanx></c>
2367 <c><spanx style="vbare">c e e h e f e f f e</spanx></c>
2369 <c><spanx style="vbare">e d d d c d c c c c</spanx></c>
2371 <c><spanx style="vbare">b f f g e f e f f f</spanx></c>
2373 <c><spanx style="vbare">c h e g f f f f f f</spanx></c>
2375 <c><spanx style="vbare">c h f f f f f g f e</spanx></c>
2377 <c><spanx style="vbare">d d f e e f e f e e</spanx></c>
2379 <c><spanx style="vbare">c d d f f e e e e e</spanx></c>
2381 <c><spanx style="vbare">c e e g e f e f f f</spanx></c>
2383 <c><spanx style="vbare">c f e g f f f e f e</spanx></c>
2385 <c><spanx style="vbare">c h e f e f e f f f</spanx></c>
2387 <c><spanx style="vbare">c f e g h g f g f e</spanx></c>
2389 <c><spanx style="vbare">d g h e g f f g e f</spanx></c>
2391 <c><spanx style="vbare">c h g e e e f e f f</spanx></c>
2393 <c><spanx style="vbare">e f f e g g f g f e</spanx></c>
2395 <c><spanx style="vbare">c f f g f g e g e e</spanx></c>
2397 <c><spanx style="vbare">e f f f d h e f f e</spanx></c>
2399 <c><spanx style="vbare">c d e f f g e f f e</spanx></c>
2401 <c><spanx style="vbare">c d c d d e c d d d</spanx></c>
2403 <c><spanx style="vbare">b b c c c c c d c c</spanx></c>
2405 <c><spanx style="vbare">e f f g g g f g e f</spanx></c>
2407 <c><spanx style="vbare">d f f e e e e d d c</spanx></c>
2409 <c><spanx style="vbare">c f d h f f e e f e</spanx></c>
2411 <c><spanx style="vbare">e e f e f g f g f e</spanx></c>
2414 <texttable anchor="silk_nlsf_wb_stage2_cb_sel"
2415 title="Codebook Selection for WB Normalized LSF Index Stage 2 Decoding">
2417 <ttcol>Coefficient</ttcol>
2419 <c><spanx style="vbare">0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15</spanx></c>
2421 <c><spanx style="vbare">i i i i i i i i i i i i i i i i</spanx></c>
2423 <c><spanx style="vbare">k l l l l l k k k k k j j j i l</spanx></c>
2425 <c><spanx style="vbare">k n n l p m m n k n m n n m l l</spanx></c>
2427 <c><spanx style="vbare">i k j k k j j j j j i i i i i j</spanx></c>
2429 <c><spanx style="vbare">i o n m o m p n m m m n n m m l</spanx></c>
2431 <c><spanx style="vbare">i l n n m l l n l l l l l l k m</spanx></c>
2433 <c><spanx style="vbare">i i i i i i i i i i i i i i i i</spanx></c>
2435 <c><spanx style="vbare">i k o l p k n l m n n m l l k l</spanx></c>
2437 <c><spanx style="vbare">i o k o o m n m o n m m n l l l</spanx></c>
2439 <c><spanx style="vbare">k j i i i i i i i i i i i i i i</spanx></c>
2441 <c><spanx style="vbare">i j i i i i i i i i i i i i i j</spanx></c>
2443 <c><spanx style="vbare">k k l m n l l l l l l l k k j l</spanx></c>
2445 <c><spanx style="vbare">k k l l m l l l l l l l l k j l</spanx></c>
2447 <c><spanx style="vbare">l m m m o m m n l n m m n m l m</spanx></c>
2449 <c><spanx style="vbare">i o m n m p n k o n p m m l n l</spanx></c>
2451 <c><spanx style="vbare">i j i j j j j j j j i i i i j i</spanx></c>
2453 <c><spanx style="vbare">j o n p n m n l m n m m m l l m</spanx></c>
2455 <c><spanx style="vbare">j l l m m l l n k l l n n n l m</spanx></c>
2457 <c><spanx style="vbare">k l l k k k l k j k j k j j j m</spanx></c>
2459 <c><spanx style="vbare">i k l n l l k k k j j i i i i i</spanx></c>
2461 <c><spanx style="vbare">l m l n l l k k j j j j j k k m</spanx></c>
2463 <c><spanx style="vbare">k o l p p m n m n l n l l k l l</spanx></c>
2465 <c><spanx style="vbare">k l n o o l n l m m l l l l k m</spanx></c>
2467 <c><spanx style="vbare">j l l m m m m l n n n l j j j j</spanx></c>
2469 <c><spanx style="vbare">k n l o o m p m m n l m m l l l</spanx></c>
2471 <c><spanx style="vbare">i o j j i i i i i i i i i i i i</spanx></c>
2473 <c><spanx style="vbare">i o o l n k n n l m m p p m m m</spanx></c>
2475 <c><spanx style="vbare">l l p l n m l l l k k l l l k l</spanx></c>
2477 <c><spanx style="vbare">i i j i i i k j k j j k k k j j</spanx></c>
2479 <c><spanx style="vbare">i l k n l l k l k j i i j i i j</spanx></c>
2481 <c><spanx style="vbare">l n n m p n l l k l k k j i j i</spanx></c>
2483 <c><spanx style="vbare">k l n l m l l l k j k o m i i i</spanx></c>
2487 Decoding the second stage residual proceeds as follows.
2488 For each coefficient, the decoder reads a symbol using the PDF corresponding to
2489 I1 from either <xref target="silk_nlsf_nbmb_stage2_cb_sel"/> or
2490 <xref target="silk_nlsf_wb_stage2_cb_sel"/>, and subtracts 4 from the result
2491 to give an index in the range -4 to 4, inclusive.
2492 If the index is either -4 or 4, it reads a second symbol using the PDF in
2493 <xref target="silk_nlsf_ext_pdf"/>, and adds the value of this second symbol
2494 to the index, using the same sign.
2495 This gives the index, I2[k], a total range of -10 to 10, inclusive.
2498 <texttable anchor="silk_nlsf_ext_pdf"
2499 title="PDF for Normalized LSF Index Extension Decoding">
2500 <ttcol align="left">PDF</ttcol>
2501 <c>{156, 60, 24, 9, 4, 2, 1}/256</c>
2505 The decoded indices from both stages are translated back into normalized LSF
2506 coefficients in silk_NLSF_decode() (NLSF_decode.c).
2507 The stage-2 indices represent residuals after both the first stage of the VQ
2508 and a separate backwards-prediction step.
2509 The backwards prediction process in the encoder subtracts a prediction from
2510 each residual formed by a multiple of the coefficient that follows it.
2511 The decoder must undo this process.
2512 <xref target="silk_nlsf_pred_weights"/> contains lists of prediction weights
2513 for each coefficient.
2514 There are two lists for NB and MB, and another two lists for WB, giving two
2515 possible prediction weights for each coefficient.
2518 <texttable anchor="silk_nlsf_pred_weights"
2519 title="Prediction Weights for Normalized LSF Decoding">
2520 <ttcol align="left">Coefficient</ttcol>
2521 <ttcol align="right">A</ttcol>
2522 <ttcol align="right">B</ttcol>
2523 <ttcol align="right">C</ttcol>
2524 <ttcol align="right">D</ttcol>
2525 <c>0</c> <c>179</c> <c>116</c> <c>175</c> <c>68</c>
2526 <c>1</c> <c>138</c> <c>67</c> <c>148</c> <c>62</c>
2527 <c>2</c> <c>140</c> <c>82</c> <c>160</c> <c>66</c>
2528 <c>3</c> <c>148</c> <c>59</c> <c>176</c> <c>60</c>
2529 <c>4</c> <c>151</c> <c>92</c> <c>178</c> <c>72</c>
2530 <c>5</c> <c>149</c> <c>72</c> <c>173</c> <c>117</c>
2531 <c>6</c> <c>153</c> <c>100</c> <c>174</c> <c>85</c>
2532 <c>7</c> <c>151</c> <c>89</c> <c>164</c> <c>90</c>
2533 <c>8</c> <c>163</c> <c>92</c> <c>177</c> <c>118</c>
2534 <c>9</c> <c/> <c/> <c>174</c> <c>136</c>
2535 <c>10</c> <c/> <c/> <c>196</c> <c>151</c>
2536 <c>11</c> <c/> <c/> <c>182</c> <c>142</c>
2537 <c>12</c> <c/> <c/> <c>198</c> <c>160</c>
2538 <c>13</c> <c/> <c/> <c>192</c> <c>142</c>
2539 <c>14</c> <c/> <c/> <c>182</c> <c>155</c>
2543 The prediction is undone using the procedure implemented in
2544 silk_NLSF_residual_dequant() (NLSF_decode.c), which is as follows.
2545 Each coefficient selects its prediction weight from one of the two lists based
2546 on the stage-1 index, I1.
2547 <xref target="silk_nlsf_nbmb_weight_sel"/> gives the selections for each
2548 coefficient for NB and MB, and <xref target="silk_nlsf_wb_weight_sel"/> gives
2549 the selections for WB.
2550 Let d_LPC be the order of the codebook, i.e., 10 for NB and MB, and 16 for WB,
2551 and let pred_Q8[k] be the weight for the k'th coefficient selected by this
2552 process for 0 <= k < d_LPC-1.
2553 Then, the stage-2 residual for each coefficient is computed via
2554 <figure align="center">
2555 <artwork align="center"><![CDATA[
2556 res_Q10[k] = (k+1 < d_LPC ? (res_Q10[k+1]*pred_Q8[k])>>8 : 0)
2557 + ((((I2[k]<<10) - sign(I2[k])*102)*qstep)>>16) ,
2560 where qstep is the Q16 quantization step size, which is 11796 for NB and MB
2561 and 9830 for WB (representing step sizes of approximately 0.18 and 0.15,
2565 <texttable anchor="silk_nlsf_nbmb_weight_sel"
2566 title="Prediction Weight Selection for NB/MB Normalized LSF Decoding">
2568 <ttcol>Coefficient</ttcol>
2570 <c><spanx style="vbare">0 1 2 3 4 5 6 7 8</spanx></c>
2572 <c><spanx style="vbare">A B A A A A A A A</spanx></c>
2574 <c><spanx style="vbare">B A A A A A A A A</spanx></c>
2576 <c><spanx style="vbare">A A A A A A A A A</spanx></c>
2578 <c><spanx style="vbare">B B B A A A A B A</spanx></c>
2580 <c><spanx style="vbare">A B A A A A A A A</spanx></c>
2582 <c><spanx style="vbare">A B A A A A A A A</spanx></c>
2584 <c><spanx style="vbare">B A B B A A A B A</spanx></c>
2586 <c><spanx style="vbare">A B B A A B B A A</spanx></c>
2588 <c><spanx style="vbare">A A B B A B A B B</spanx></c>
2590 <c><spanx style="vbare">A A B B A A B B B</spanx></c>
2592 <c><spanx style="vbare">A A A A A A A A A</spanx></c>
2594 <c><spanx style="vbare">A B A B B B B B A</spanx></c>
2596 <c><spanx style="vbare">A B A B B B B B A</spanx></c>
2598 <c><spanx style="vbare">A B B B B B B B A</spanx></c>
2600 <c><spanx style="vbare">B A B B A B B B B</spanx></c>
2602 <c><spanx style="vbare">A B B B B B A B A</spanx></c>
2604 <c><spanx style="vbare">A A B B A B A B A</spanx></c>
2606 <c><spanx style="vbare">A A B B B A B B B</spanx></c>
2608 <c><spanx style="vbare">A B B A A B B B A</spanx></c>
2610 <c><spanx style="vbare">A A A B B B A B A</spanx></c>
2612 <c><spanx style="vbare">A B B A A B A B A</spanx></c>
2614 <c><spanx style="vbare">A B B A A A B B A</spanx></c>
2616 <c><spanx style="vbare">A A A A A B B B B</spanx></c>
2618 <c><spanx style="vbare">A A B B A A A B B</spanx></c>
2620 <c><spanx style="vbare">A A A B A B B B B</spanx></c>
2622 <c><spanx style="vbare">A B B B B B B B A</spanx></c>
2624 <c><spanx style="vbare">A A A A A A A A A</spanx></c>
2626 <c><spanx style="vbare">A A A A A A A A A</spanx></c>
2628 <c><spanx style="vbare">A A B A B B A B A</spanx></c>
2630 <c><spanx style="vbare">B A A B A A A A A</spanx></c>
2632 <c><spanx style="vbare">A A A B B A B A B</spanx></c>
2634 <c><spanx style="vbare">B A B B A B B B B</spanx></c>
2637 <texttable anchor="silk_nlsf_wb_weight_sel"
2638 title="Prediction Weight Selection for WB Normalized LSF Decoding">
2640 <ttcol>Coefficient</ttcol>
2642 <c><spanx style="vbare">0 1 2 3 4 5 6 7 8 9 10 11 12 13 14</spanx></c>
2644 <c><spanx style="vbare">C C C C C C C C C C C C C C D</spanx></c>
2646 <c><spanx style="vbare">C C C C C C C C C C C C C C C</spanx></c>
2648 <c><spanx style="vbare">C C D C C D D D C D D D D C C</spanx></c>
2650 <c><spanx style="vbare">C C C C C C C C C C C C D C C</spanx></c>
2652 <c><spanx style="vbare">C D D C D C D D C D D D D D C</spanx></c>
2654 <c><spanx style="vbare">C C D C C C C C C C C C C C C</spanx></c>
2656 <c><spanx style="vbare">D C C C C C C C C C C D C D C</spanx></c>
2658 <c><spanx style="vbare">C D D C C C D C D D D C D C D</spanx></c>
2660 <c><spanx style="vbare">C D C D D C D C D C D D D D D</spanx></c>
2662 <c><spanx style="vbare">C C C C C C C C C C C C C C D</spanx></c>
2664 <c><spanx style="vbare">C D C C C C C C C C C C C C C</spanx></c>
2666 <c><spanx style="vbare">C C D C D D D D D D D C D C C</spanx></c>
2668 <c><spanx style="vbare">C C D C C D C D C D C C D C C</spanx></c>
2670 <c><spanx style="vbare">C C C C D D C D C D D D D C C</spanx></c>
2672 <c><spanx style="vbare">C D C C C D D C D D D C D D D</spanx></c>
2674 <c><spanx style="vbare">C C D D C C C C C C C C D D C</spanx></c>
2676 <c><spanx style="vbare">C D D C D C D D D D D C D C C</spanx></c>
2678 <c><spanx style="vbare">C C D C C C C D C C D D D C C</spanx></c>
2680 <c><spanx style="vbare">C C C C C C C C C C C C C C D</spanx></c>
2682 <c><spanx style="vbare">C C C C C C C C C C C C D C C</spanx></c>
2684 <c><spanx style="vbare">C C C C C C C C C C C C C C C</spanx></c>
2686 <c><spanx style="vbare">C D C D C D D C D C D C D D C</spanx></c>
2688 <c><spanx style="vbare">C C D D D D C D D C C D D C C</spanx></c>
2690 <c><spanx style="vbare">C D D C D C D C D C C C C D C</spanx></c>
2692 <c><spanx style="vbare">C C C D D C D C D D D D D D D</spanx></c>
2694 <c><spanx style="vbare">C C C C C C C C C C C C C C D</spanx></c>
2696 <c><spanx style="vbare">C D D C C C D D C C D D D D D</spanx></c>
2698 <c><spanx style="vbare">C C C C C D C D D D D C D D D</spanx></c>
2700 <c><spanx style="vbare">C C C C C C C C C C C C C C D</spanx></c>
2702 <c><spanx style="vbare">C C C C C C C C C C C C C C D</spanx></c>
2704 <c><spanx style="vbare">D C C C C C C C C C C D C C C</spanx></c>
2706 <c><spanx style="vbare">C C D C C D D D C C D C C D C</spanx></c>
2711 <section anchor="silk_nlsf_reconstruction"
2712 title="Reconstructing the Normalized LSF Coefficients">
2714 Once the stage-1 index I1 and the stage-2 residual res_Q10[] have been decoded,
2715 the final normalized LSF coefficients can be reconstructed.
2718 The spectral distortion introduced by the quantization of each LSF coefficient
2719 varies, so the stage-2 residual is weighted accordingly, using the
2720 low-complexity Inverse Harmonic Mean Weighting (IHMW) function proposed in
2721 <xref target="laroia-icassp"/>.
2722 The weights are derived directly from the stage-1 codebook vector.
2723 Let cb1_Q8[k] be the k'th entry of the stage-1 codebook vector from
2724 <xref target="silk_nlsf_nbmb_codebook"/> or
2725 <xref target="silk_nlsf_wb_codebook"/>.
2726 Then for 0 <= k < d_LPC the following expression
2727 computes the square of the weight as a Q18 value:
2728 <figure align="center">
2729 <artwork align="center">
2731 w2_Q18[k] = (1024/(cb1_Q8[k] - cb1_Q8[k-1])
2732 + 1024/(cb1_Q8[k+1] - cb1_Q8[k])) << 16 ,
2736 where cb1_Q8[-1] = 0 and cb1_Q8[d_LPC] = 256, and the
2737 division is exact integer division.
2738 This is reduced to an unsquared, Q9 value using the following square-root
2740 <figure align="center">
2741 <artwork align="center"><![CDATA[
2743 f = (w2_Q18[k]>>(i-8)) & 127
2744 y = ((i&1) ? 32768 : 46214) >> ((32-i)>>1)
2745 w_Q9[k] = y + ((213*f*y)>>16)
2748 The cb1_Q8[] vector completely determines these weights, and they may be
2749 tabulated and stored as 13-bit unsigned values (with a range of 1819 to 5227,
2750 inclusive) to avoid computing them when decoding.
2751 The reference implementation already requires code to compute these weights on
2752 unquantized coefficients in the encoder, in silk_NLSF_VQ_weights_laroia()
2753 (NLSF_VQ_weights_laroia.c) and its callers, so it reuses that code in the
2754 decoder instead of using a pre-computed table to reduce the amount of ROM
2758 <texttable anchor="silk_nlsf_nbmb_codebook"
2759 title="Codebook Vectors for NB/MB Normalized LSF Stage 1 Decoding">
2761 <ttcol>Codebook (Q8)</ttcol>
2763 <c><spanx style="vbare"> 0 1 2 3 4 5 6 7 8 9</spanx></c>
2765 <c><spanx style="vbare">12 35 60 83 108 132 157 180 206 228</spanx></c>
2767 <c><spanx style="vbare">15 32 55 77 101 125 151 175 201 225</spanx></c>
2769 <c><spanx style="vbare">19 42 66 89 114 137 162 184 209 230</spanx></c>
2771 <c><spanx style="vbare">12 25 50 72 97 120 147 172 200 223</spanx></c>
2773 <c><spanx style="vbare">26 44 69 90 114 135 159 180 205 225</spanx></c>
2775 <c><spanx style="vbare">13 22 53 80 106 130 156 180 205 228</spanx></c>
2777 <c><spanx style="vbare">15 25 44 64 90 115 142 168 196 222</spanx></c>
2779 <c><spanx style="vbare">19 24 62 82 100 120 145 168 190 214</spanx></c>
2781 <c><spanx style="vbare">22 31 50 79 103 120 151 170 203 227</spanx></c>
2783 <c><spanx style="vbare">21 29 45 65 106 124 150 171 196 224</spanx></c>
2785 <c><spanx style="vbare">30 49 75 97 121 142 165 186 209 229</spanx></c>
2787 <c><spanx style="vbare">19 25 52 70 93 116 143 166 192 219</spanx></c>
2789 <c><spanx style="vbare">26 34 62 75 97 118 145 167 194 217</spanx></c>
2791 <c><spanx style="vbare">25 33 56 70 91 113 143 165 196 223</spanx></c>
2793 <c><spanx style="vbare">21 34 51 72 97 117 145 171 196 222</spanx></c>
2795 <c><spanx style="vbare">20 29 50 67 90 117 144 168 197 221</spanx></c>
2797 <c><spanx style="vbare">22 31 48 66 95 117 146 168 196 222</spanx></c>
2799 <c><spanx style="vbare">24 33 51 77 116 134 158 180 200 224</spanx></c>
2801 <c><spanx style="vbare">21 28 70 87 106 124 149 170 194 217</spanx></c>
2803 <c><spanx style="vbare">26 33 53 64 83 117 152 173 204 225</spanx></c>
2805 <c><spanx style="vbare">27 34 65 95 108 129 155 174 210 225</spanx></c>
2807 <c><spanx style="vbare">20 26 72 99 113 131 154 176 200 219</spanx></c>
2809 <c><spanx style="vbare">34 43 61 78 93 114 155 177 205 229</spanx></c>
2811 <c><spanx style="vbare">23 29 54 97 124 138 163 179 209 229</spanx></c>
2813 <c><spanx style="vbare">30 38 56 89 118 129 158 178 200 231</spanx></c>
2815 <c><spanx style="vbare">21 29 49 63 85 111 142 163 193 222</spanx></c>
2817 <c><spanx style="vbare">27 48 77 103 133 158 179 196 215 232</spanx></c>
2819 <c><spanx style="vbare">29 47 74 99 124 151 176 198 220 237</spanx></c>
2821 <c><spanx style="vbare">33 42 61 76 93 121 155 174 207 225</spanx></c>
2823 <c><spanx style="vbare">29 53 87 112 136 154 170 188 208 227</spanx></c>
2825 <c><spanx style="vbare">24 30 52 84 131 150 166 186 203 229</spanx></c>
2827 <c><spanx style="vbare">37 48 64 84 104 118 156 177 201 230</spanx></c>
2830 <texttable anchor="silk_nlsf_wb_codebook"
2831 title="Codebook Vectors for WB Normalized LSF Stage 1 Decoding">
2833 <ttcol>Codebook (Q8)</ttcol>
2835 <c><spanx style="vbare"> 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15</spanx></c>
2837 <c><spanx style="vbare"> 7 23 38 54 69 85 100 116 131 147 162 178 193 208 223 239</spanx></c>
2839 <c><spanx style="vbare">13 25 41 55 69 83 98 112 127 142 157 171 187 203 220 236</spanx></c>
2841 <c><spanx style="vbare">15 21 34 51 61 78 92 106 126 136 152 167 185 205 225 240</spanx></c>
2843 <c><spanx style="vbare">10 21 36 50 63 79 95 110 126 141 157 173 189 205 221 237</spanx></c>
2845 <c><spanx style="vbare">17 20 37 51 59 78 89 107 123 134 150 164 184 205 224 240</spanx></c>
2847 <c><spanx style="vbare">10 15 32 51 67 81 96 112 129 142 158 173 189 204 220 236</spanx></c>
2849 <c><spanx style="vbare"> 8 21 37 51 65 79 98 113 126 138 155 168 179 192 209 218</spanx></c>
2851 <c><spanx style="vbare">12 15 34 55 63 78 87 108 118 131 148 167 185 203 219 236</spanx></c>
2853 <c><spanx style="vbare">16 19 32 36 56 79 91 108 118 136 154 171 186 204 220 237</spanx></c>
2855 <c><spanx style="vbare">11 28 43 58 74 89 105 120 135 150 165 180 196 211 226 241</spanx></c>
2857 <c><spanx style="vbare"> 6 16 33 46 60 75 92 107 123 137 156 169 185 199 214 225</spanx></c>
2859 <c><spanx style="vbare">11 19 30 44 57 74 89 105 121 135 152 169 186 202 218 234</spanx></c>
2861 <c><spanx style="vbare">12 19 29 46 57 71 88 100 120 132 148 165 182 199 216 233</spanx></c>
2863 <c><spanx style="vbare">17 23 35 46 56 77 92 106 123 134 152 167 185 204 222 237</spanx></c>
2865 <c><spanx style="vbare">14 17 45 53 63 75 89 107 115 132 151 171 188 206 221 240</spanx></c>
2867 <c><spanx style="vbare"> 9 16 29 40 56 71 88 103 119 137 154 171 189 205 222 237</spanx></c>
2869 <c><spanx style="vbare">16 19 36 48 57 76 87 105 118 132 150 167 185 202 218 236</spanx></c>
2871 <c><spanx style="vbare">12 17 29 54 71 81 94 104 126 136 149 164 182 201 221 237</spanx></c>
2873 <c><spanx style="vbare">15 28 47 62 79 97 115 129 142 155 168 180 194 208 223 238</spanx></c>
2875 <c><spanx style="vbare"> 8 14 30 45 62 78 94 111 127 143 159 175 192 207 223 239</spanx></c>
2877 <c><spanx style="vbare">17 30 49 62 79 92 107 119 132 145 160 174 190 204 220 235</spanx></c>
2879 <c><spanx style="vbare">14 19 36 45 61 76 91 108 121 138 154 172 189 205 222 238</spanx></c>
2881 <c><spanx style="vbare">12 18 31 45 60 76 91 107 123 138 154 171 187 204 221 236</spanx></c>
2883 <c><spanx style="vbare">13 17 31 43 53 70 83 103 114 131 149 167 185 203 220 237</spanx></c>
2885 <c><spanx style="vbare">17 22 35 42 58 78 93 110 125 139 155 170 188 206 224 240</spanx></c>
2887 <c><spanx style="vbare"> 8 15 34 50 67 83 99 115 131 146 162 178 193 209 224 239</spanx></c>
2889 <c><spanx style="vbare">13 16 41 66 73 86 95 111 128 137 150 163 183 206 225 241</spanx></c>
2891 <c><spanx style="vbare">17 25 37 52 63 75 92 102 119 132 144 160 175 191 212 231</spanx></c>
2893 <c><spanx style="vbare">19 31 49 65 83 100 117 133 147 161 174 187 200 213 227 242</spanx></c>
2895 <c><spanx style="vbare">18 31 52 68 88 103 117 126 138 149 163 177 192 207 223 239</spanx></c>
2897 <c><spanx style="vbare">16 29 47 61 76 90 106 119 133 147 161 176 193 209 224 240</spanx></c>
2899 <c><spanx style="vbare">15 21 35 50 61 73 86 97 110 119 129 141 175 198 218 237</spanx></c>
2903 Given the stage-1 codebook entry cb1_Q8[], the stage-2 residual res_Q10[], and
2904 their corresponding weights, w_Q9[], the reconstructed normalized LSF
2906 <figure align="center">
2907 <artwork align="center"><![CDATA[
2908 NLSF_Q15[k] = clamp(0,
2909 (cb1_Q8[k]<<7) + (res_Q10[k]<<14)/w_Q9[k], 32767) ,
2912 where the division is exact integer division.
2913 However, nothing in either the reconstruction process or the
2914 quantization process in the encoder thus far guarantees that the coefficients
2915 are monotonically increasing and separated well enough to ensure a stable
2917 When using the reference encoder, roughly 2% of frames violate this constraint.
2918 The next section describes a stabilization procedure used to make these
2924 <section anchor="silk_nlsf_stabilization" title="Normalized LSF Stabilization">
2926 The normalized LSF stabilization procedure is implemented in
2927 silk_NLSF_stabilize() (NLSF_stabilize.c).
2928 This process ensures that consecutive values of the normalized LSF
2929 coefficients, NLSF_Q15[], are spaced some minimum distance apart
2930 (predetermined to be the 0.01 percentile of a large training set).
2931 <xref target="silk_nlsf_min_spacing"/> gives the minimum spacings for NB and MB
2932 and those for WB, where row k is the minimum allowed value of
2933 NLSF_Q[k]-NLSF_Q[k-1].
2934 For the purposes of computing this spacing for the first and last coefficient,
2935 NLSF_Q15[-1] is taken to be 0, and NLSF_Q15[d_LPC] is taken to be 32768.
2938 <texttable anchor="silk_nlsf_min_spacing"
2939 title="Minimum Spacing for Normalized LSF Coefficients">
2940 <ttcol>Coefficient</ttcol>
2941 <ttcol align="right">NB and MB</ttcol>
2942 <ttcol align="right">WB</ttcol>
2943 <c>0</c> <c>250</c> <c>100</c>
2944 <c>1</c> <c>3</c> <c>3</c>
2945 <c>2</c> <c>6</c> <c>40</c>
2946 <c>3</c> <c>3</c> <c>3</c>
2947 <c>4</c> <c>3</c> <c>3</c>
2948 <c>5</c> <c>3</c> <c>3</c>
2949 <c>6</c> <c>4</c> <c>5</c>
2950 <c>7</c> <c>3</c> <c>14</c>
2951 <c>8</c> <c>3</c> <c>14</c>
2952 <c>9</c> <c>3</c> <c>10</c>
2953 <c>10</c> <c>461</c> <c>11</c>
2954 <c>11</c> <c/> <c>3</c>
2955 <c>12</c> <c/> <c>8</c>
2956 <c>13</c> <c/> <c>9</c>
2957 <c>14</c> <c/> <c>7</c>
2958 <c>15</c> <c/> <c>3</c>
2959 <c>16</c> <c/> <c>347</c>
2963 The procedure starts off by trying to make small adjustments which attempt to
2964 minimize the amount of distortion introduced.
2965 After 20 such adjustments, it falls back to a more direct method which
2966 guarantees the constraints are enforced but may require large adjustments.
2969 Let NDeltaMin_Q15[k] be the minimum required spacing for the current audio
2970 bandwidth from <xref target="silk_nlsf_min_spacing"/>.
2971 First, the procedure finds the index i where
2972 NLSF_Q15[i] - NLSF_Q15[i-1] - NDeltaMin_Q15[i] is the
2973 smallest, breaking ties by using the lower value of i.
2974 If this value is non-negative, then the stabilization stops; the coefficients
2975 satisfy all the constraints.
2976 Otherwise, if i == 0, it sets NLSF_Q15[0] to NDeltaMin_Q15[0], and if
2977 i == d_LPC, it sets NLSF_Q15[d_LPC-1] to
2978 (32768 - NDeltaMin_Q15[d_LPC]).
2979 For all other values of i, both NLSF_Q15[i-1] and NLSF_Q15[i] are updated as
2981 <figure align="center">
2982 <artwork align="center"><![CDATA[
2985 min_center_Q15 = (NDeltaMin[i]>>1) + \ NDeltaMin[k]
2990 max_center_Q15 = 32768 - (NDeltaMin[i]>>1) - \ NDeltaMin[k]
2993 center_freq_Q15 = clamp(min_center_Q15[i],
2994 (NLSF_Q15[i-1] + NLSF_Q15[i] + 1)>>1,
2997 NLSF_Q15[i-1] = center_freq_Q15 - (NDeltaMin_Q15[i]>>1)
2999 NLSF_Q15[i] = NLSF_Q15[i-1] + NDeltaMin_Q15[i] .
3002 Then the procedure repeats again, until it has either executed 20 times or
3003 has stopped because the coefficients satisfy all the constraints.
3006 After the 20th repetition of the above procedure, the following fallback
3007 procedure executes once.
3008 First, the values of NLSF_Q15[k] for 0 <= k < d_LPC
3009 are sorted in ascending order.
3010 Then for each value of k from 0 to d_LPC-1, NLSF_Q15[k] is set to
3011 <figure align="center">
3012 <artwork align="center"><![CDATA[
3013 max(NLSF_Q15[k], NLSF_Q15[k-1] + NDeltaMin_Q15[k]) .
3016 Next, for each value of k from d_LPC-1 down to 0, NLSF_Q15[k] is set to
3017 <figure align="center">
3018 <artwork align="center"><![CDATA[
3019 min(NLSF_Q15[k], NLSF_Q15[k+1] - NDeltaMin_Q15[k+1]) .
3026 <section anchor="silk_nlsf_interpolation" title="Normalized LSF Interpolation">
3028 For 20 ms SILK frames, the first half of the frame (i.e., the first two
3029 subframes) may use normalized LSF coefficients that are interpolated between
3030 the decoded LSFs for the most recent coded frame (in the same channel) and the
3032 A Q2 interpolation factor follows the LSF coefficient indices in the bitstream,
3033 which is decoded using the PDF in <xref target="silk_nlsf_interp_pdf"/>.
3034 This happens in silk_decode_indices() (decode_indices.c).
3036 <list style="symbols">
3037 <t>An uncoded regular SILK frame in the side channel, or</t>
3038 <t>A decoder reset (see <xref target="decoder-reset"/>),</t>
3040 the decoder still decodes this factor, but ignores its value and always uses
3042 For 10 ms SILK frames, this factor is not stored at all.
3045 <texttable anchor="silk_nlsf_interp_pdf"
3046 title="PDF for Normalized LSF Interpolation Index">
3048 <c>{13, 22, 29, 11, 181}/256</c>
3052 Let n2_Q15[k] be the normalized LSF coefficients decoded by the procedure in
3053 <xref target="silk_nlsfs"/>, n0_Q15[k] be the LSF coefficients
3054 decoded for the prior frame, and w_Q2 be the interpolation factor.
3055 Then the normalized LSF coefficients used for the first half of a 20 ms
3056 frame, n1_Q15[k], are
3057 <figure align="center">
3058 <artwork align="center"><![CDATA[
3059 n1_Q15[k] = n0_Q15[k] + (w_Q2*(n2_Q15[k] - n0_Q15[k]) >> 2) .
3062 This interpolation is performed in silk_decode_parameters()
3063 (decode_parameters.c).
3067 <section anchor="silk_nlsf2lpc"
3068 title="Converting Normalized LSFs to LPC Coefficients">
3070 Any LPC filter A(z) can be split into a symmetric part P(z) and an
3071 anti-symmetric part Q(z) such that
3072 <figure align="center">
3073 <artwork align="center"><![CDATA[
3076 A(z) = 1 - \ a[k] * z = - * (P(z) + Q(z))
3082 <figure align="center">
3083 <artwork align="center"><![CDATA[
3085 P(z) = A(z) + z * A(z )
3088 Q(z) = A(z) - z * A(z ) .
3091 The even normalized LSF coefficients correspond to a pair of conjugate roots of
3092 P(z), while the odd coefficients correspond to a pair of conjugate roots of
3093 Q(z), all of which lie on the unit circle.
3094 In addition, P(z) has a root at pi and Q(z) has a root at 0.
3095 Thus, they may be reconstructed mathematically from a set of normalized LSF
3096 coefficients, n[k], as
3097 <figure align="center">
3098 <artwork align="center"><![CDATA[
3101 P(z) = (1 + z ) * | | (1 - 2*cos(pi*n[2*k])*z + z )
3106 Q(z) = (1 - z ) * | | (1 - 2*cos(pi*n[2*k+1])*z + z )
3112 However, SILK performs this reconstruction using a fixed-point approximation so
3113 that all decoders can reproduce it in a bit-exact manner to avoid prediction
3115 The function silk_NLSF2A() (NLSF2A.c) implements this procedure.
3118 To start, it approximates cos(pi*n[k]) using a table lookup with linear
3120 The encoder SHOULD use the inverse of this piecewise linear approximation,
3121 rather than the true inverse of the cosine function, when deriving the
3122 normalized LSF coefficients.
3123 These values are also re-ordered to improve numerical accuracy when
3124 constructing the LPC polynomials.
3127 <texttable anchor="silk_nlsf_orderings"
3128 title="LSF Ordering for Polynomial Evaluation">
3129 <ttcol>Coefficient</ttcol>
3130 <ttcol align="right">NB and MB</ttcol>
3131 <ttcol align="right">WB</ttcol>
3132 <c>0</c> <c>0</c> <c>0</c>
3133 <c>1</c> <c>9</c> <c>15</c>
3134 <c>2</c> <c>6</c> <c>8</c>
3135 <c>3</c> <c>3</c> <c>7</c>
3136 <c>4</c> <c>4</c> <c>4</c>
3137 <c>5</c> <c>5</c> <c>11</c>
3138 <c>6</c> <c>8</c> <c>12</c>
3139 <c>7</c> <c>1</c> <c>3</c>
3140 <c>8</c> <c>2</c> <c>2</c>
3141 <c>9</c> <c>7</c> <c>13</c>
3142 <c>10</c> <c/> <c>10</c>
3143 <c>11</c> <c/> <c>5</c>
3144 <c>12</c> <c/> <c>6</c>
3145 <c>13</c> <c/> <c>9</c>
3146 <c>14</c> <c/> <c>14</c>
3147 <c>15</c> <c/> <c>1</c>
3151 The top 7 bits of each normalized LSF coefficient index a value in the table,
3152 and the next 8 bits interpolate between it and the next value.
3153 Let i = (n[k] >> 8) be the integer index and
3154 f = (n[k] & 255) be the fractional part of a given
3156 Then the re-ordered, approximated cosine, c_Q17[ordering[k]], is
3157 <figure align="center">
3158 <artwork align="center"><![CDATA[
3159 c_Q17[ordering[k]] = (cos_Q12[i]*256
3160 + (cos_Q12[i+1]-cos_Q12[i])*f + 4) >> 3 ,
3163 where ordering[k] is the k'th entry of the column of
3164 <xref target="silk_nlsf_orderings"/> corresponding to the current audio
3165 bandwidth and cos_Q12[i] is the i'th entry of <xref target="silk_cos_table"/>.
3168 <texttable anchor="silk_cos_table"
3169 title="Q12 Cosine Table for LSF Conversion">
3170 <ttcol align="right">i</ttcol>
3171 <ttcol align="right">+0</ttcol>
3172 <ttcol align="right">+1</ttcol>
3173 <ttcol align="right">+2</ttcol>
3174 <ttcol align="right">+3</ttcol>
3176 <c>4096</c> <c>4095</c> <c>4091</c> <c>4085</c>
3178 <c>4076</c> <c>4065</c> <c>4052</c> <c>4036</c>
3180 <c>4017</c> <c>3997</c> <c>3973</c> <c>3948</c>
3182 <c>3920</c> <c>3889</c> <c>3857</c> <c>3822</c>
3184 <c>3784</c> <c>3745</c> <c>3703</c> <c>3659</c>
3186 <c>3613</c> <c>3564</c> <c>3513</c> <c>3461</c>
3188 <c>3406</c> <c>3349</c> <c>3290</c> <c>3229</c>
3190 <c>3166</c> <c>3102</c> <c>3035</c> <c>2967</c>
3192 <c>2896</c> <c>2824</c> <c>2751</c> <c>2676</c>
3194 <c>2599</c> <c>2520</c> <c>2440</c> <c>2359</c>
3196 <c>2276</c> <c>2191</c> <c>2106</c> <c>2019</c>
3198 <c>1931</c> <c>1842</c> <c>1751</c> <c>1660</c>
3200 <c>1568</c> <c>1474</c> <c>1380</c> <c>1285</c>
3202 <c>1189</c> <c>1093</c> <c>995</c> <c>897</c>
3204 <c>799</c> <c>700</c> <c>601</c> <c>501</c>
3206 <c>401</c> <c>301</c> <c>201</c> <c>101</c>
3208 <c>0</c> <c>-101</c> <c>-201</c> <c>-301</c>
3210 <c>-401</c> <c>-501</c> <c>-601</c> <c>-700</c>
3212 <c>-799</c> <c>-897</c> <c>-995</c> <c>-1093</c>
3214 <c>-1189</c><c>-1285</c><c>-1380</c><c>-1474</c>
3216 <c>-1568</c><c>-1660</c><c>-1751</c><c>-1842</c>
3218 <c>-1931</c><c>-2019</c><c>-2106</c><c>-2191</c>
3220 <c>-2276</c><c>-2359</c><c>-2440</c><c>-2520</c>
3222 <c>-2599</c><c>-2676</c><c>-2751</c><c>-2824</c>
3224 <c>-2896</c><c>-2967</c><c>-3035</c><c>-3102</c>
3226 <c>-3166</c><c>-3229</c><c>-3290</c><c>-3349</c>
3228 <c>-3406</c><c>-3461</c><c>-3513</c><c>-3564</c>
3230 <c>-3613</c><c>-3659</c><c>-3703</c><c>-3745</c>
3232 <c>-3784</c><c>-3822</c><c>-3857</c><c>-3889</c>
3234 <c>-3920</c><c>-3948</c><c>-3973</c><c>-3997</c>
3236 <c>-4017</c><c>-4036</c><c>-4052</c><c>-4065</c>
3238 <c>-4076</c><c>-4085</c><c>-4091</c><c>-4095</c>
3240 <c>-4096</c> <c/> <c/> <c/>
3244 Given the list of cosine values, silk_NLSF2A_find_poly() (NLSF2A.c)
3245 computes the coefficients of P and Q, described here via a simple recurrence.
3246 Let p_Q16[k][j] and q_Q16[k][j] be the coefficients of the products of the
3247 first (k+1) root pairs for P and Q, with j indexing the coefficient number.
3248 Only the first (k+2) coefficients are needed, as the products are symmetric.
3249 Let p_Q16[0][0] = q_Q16[0][0] = 1<<16,
3250 p_Q16[0][1] = -c_Q17[0], q_Q16[0][1] = -c_Q17[1], and
3251 d2 = d_LPC/2.
3252 As boundary conditions, assume
3253 p_Q16[k][j] = q_Q16[k][j] = 0 for all
3255 Also, assume p_Q16[k][k+2] = p_Q16[k][k] and
3256 q_Q16[k][k+2] = q_Q16[k][k] (because of the symmetry).
3257 Then, for 0 < k < d2 and 0 <= j <= k+1,
3258 <figure align="center">
3259 <artwork align="center"><![CDATA[
3260 p_Q16[k][j] = p_Q16[k-1][j] + p_Q16[k-1][j-2]
3261 - ((c_Q17[2*k]*p_Q16[k-1][j-1] + 32768)>>16) ,
3263 q_Q16[k][j] = q_Q16[k-1][j] + q_Q16[k-1][j-2]
3264 - ((c_Q17[2*k+1]*q_Q16[k-1][j-1] + 32768)>>16) .
3267 The use of Q17 values for the cosine terms in an otherwise Q16 expression
3268 implicitly scales them by a factor of 2.
3269 The multiplications in this recurrence may require up to 48 bits of precision
3270 in the result to avoid overflow.
3271 In practice, each row of the recurrence only depends on the previous row, so an
3272 implementation does not need to store all of them.
3275 silk_NLSF2A() uses the values from the last row of this recurrence to
3276 reconstruct a 32-bit version of the LPC filter (without the leading 1.0
3277 coefficient), a32_Q17[k], 0 <= k < d2:
3278 <figure align="center">
3279 <artwork align="center"><![CDATA[
3280 a32_Q17[k] = -(q_Q16[d2-1][k+1] - q_Q16[d2-1][k])
3281 - (p_Q16[d2-1][k+1] + p_Q16[d2-1][k])) ,
3283 a32_Q17[d_LPC-k-1] = (q_Q16[d2-1][k+1] - q_Q16[d2-1][k])
3284 - (p_Q16[d2-1][k+1] + p_Q16[d2-1][k])) .
3287 The sum and difference of two terms from each of the p_Q16 and q_Q16
3288 coefficient lists reflect the (1 + z**-1) and
3289 (1 - z**-1) factors of P and Q, respectively.
3290 The promotion of the expression from Q16 to Q17 implicitly scales the result
3295 <section anchor="silk_lpc_range_limit"
3296 title="Limiting the Range of the LPC Coefficients">
3298 The a32_Q17[] coefficients are too large to fit in a 16-bit value, which
3299 significantly increases the cost of applying this filter in fixed-point
3301 Reducing them to Q12 precision doesn't incur any significant quality loss,
3302 but still does not guarantee they will fit.
3303 silk_NLSF2A() applies up to 10 rounds of bandwidth expansion to limit
3304 the dynamic range of these coefficients.
3305 Even floating-point decoders SHOULD perform these steps, to avoid mismatch.
3308 For each round, the process first finds the index k such that abs(a32_Q17[k])
3309 is largest, breaking ties by choosing the lowest value of k.
3310 Then, it computes the corresponding Q12 precision value, maxabs_Q12, subject to
3311 an upper bound to avoid overflow in subsequent computations:
3312 <figure align="center">
3313 <artwork align="center"><![CDATA[
3314 maxabs_Q12 = min((maxabs_Q17 + 16) >> 5, 163838) .
3317 If this is larger than 32767, the procedure derives the chirp factor,
3318 sc_Q16[0], to use in the bandwidth expansion as
3319 <figure align="center">
3320 <artwork align="center"><![CDATA[
3321 (maxabs_Q12 - 32767) << 14
3322 sc_Q16[0] = 65470 - -------------------------- ,
3323 (maxabs_Q12 * (k+1)) >> 2
3326 where the division here is exact integer division.
3327 This is an approximation of the chirp factor needed to reduce the target
3328 coefficient to 32767, though it is both less than 0.999 and, for
3329 k > 0 when maxabs_Q12 is much greater than 32767, still slightly
3333 silk_bwexpander_32() (bwexpander_32.c) performs the bandwidth expansion (again,
3334 only when maxabs_Q12 is greater than 32767) using the following recurrence:
3335 <figure align="center">
3336 <artwork align="center"><![CDATA[
3337 a32_Q17[k] = (a32_Q17[k]*sc_Q16[k]) >> 16
3339 sc_Q16[k+1] = (sc_Q16[0]*sc_Q16[k] + 32768) >> 16
3342 The first multiply may require up to 48 bits of precision in the result to
3344 The second multiply must be unsigned to avoid overflow with only 32 bits of
3346 The reference implementation uses a slightly more complex formulation that
3347 avoids the 32-bit overflow using signed multiplication, but is otherwise
3351 After 10 rounds of bandwidth expansion are performed, they are simply saturated
3353 <figure align="center">
3354 <artwork align="center"><![CDATA[
3355 a32_Q17[k] = clamp(-32768, (a32_Q17[k] + 16) >> 5, 32767) << 5 .
3358 Because this performs the actual saturation in the Q12 domain, but converts the
3359 coefficients back to the Q17 domain for the purposes of prediction gain
3360 limiting, this step must be performed after the 10th round of bandwidth
3361 expansion, regardless of whether or not the Q12 version of any coefficient
3362 still overflows a 16-bit integer.
3363 This saturation is not performed if maxabs_Q12 drops to 32767 or less prior to
3368 <section anchor="silk_lpc_gain_limit"
3369 title="Limiting the Prediction Gain of the LPC Filter">
3371 The prediction gain of an LPC synthesis filter is the square-root of the output
3372 energy when the filter is excited by a unit-energy impulse.
3373 Even if the Q12 coefficients would fit, the resulting filter may still have a
3374 significant gain (especially for voiced sounds), making the filter unstable.
3375 silk_NLSF2A() applies up to 18 additional rounds of bandwidth expansion to
3376 limit the prediction gain.
3377 Instead of controlling the amount of bandwidth expansion using the prediction
3378 gain itself (which may diverge to infinity for an unstable filter),
3379 silk_NLSF2A() uses silk_LPC_inverse_pred_gain_QA() (LPC_inv_pred_gain.c) to
3380 compute the reflection coefficients associated with the filter.
3381 The filter is stable if and only if the magnitude of these coefficients is
3382 sufficiently less than one.
3383 The reflection coefficients, rc[k], can be computed using a simple Levinson
3384 recurrence, initialized with the LPC coefficients
3385 a[d_LPC-1][n] = a[n], and then updated via
3386 <figure align="center">
3387 <artwork align="center"><![CDATA[
3390 a[k][n] - a[k][k-n-1]*rc[k]
3391 a[k-1][n] = --------------------------- .
3398 However, silk_LPC_inverse_pred_gain_QA() approximates this using fixed-point
3399 arithmetic to guarantee reproducible results across platforms and
3401 Since small changes in the coefficients can make a stable filter unstable, it
3402 takes the real Q12 coefficients that will be used during reconstruction as
3405 <figure align="center">
3406 <artwork align="center"><![CDATA[
3407 a32_Q12[n] = (a32_Q17[n] + 16) >> 5
3410 be the Q12 version of the LPC coefficients that will eventually be used.
3411 As a simple initial check, the decoder computes the DC response as
3412 <figure align="center">
3413 <artwork align="center"><![CDATA[
3416 DC_resp = \ a32_Q12[n]
3421 and if DC_resp > 4096, the filter is unstable.
3424 Increasing the precision of these Q12 coefficients to Q24 for intermediate
3425 computations allows more accurate computation of the reflection coefficients,
3426 so the decoder initializes the recurrence via
3427 <figure align="center">
3428 <artwork align="center"><![CDATA[
3429 a32_Q24[d_LPC-1][n] = a32_Q12[n] << 12 .
3432 Then for each k from d_LPC-1 down to 0, if
3433 abs(a32_Q24[k][k]) > 16773022, the filter is unstable and the
3435 Otherwise, row k-1 of a32_Q24 is computed from row k as
3436 <figure align="center">
3437 <artwork align="center"><![CDATA[
3438 rc_Q31[k] = -a32_Q24[k][k] << 7 ,
3440 div_Q30[k] = (1<<30) - (rc_Q31[k]*rc_Q31[k] >> 32) ,
3442 b1[k] = ilog(div_Q30[k]) ,
3444 b2[k] = b1[k] - 16 ,
3447 inv_Qb2[k] = ----------------------- ,
3448 div_Q30[k] >> (b2[k]+1)
3450 err_Q29[k] = (1<<29)
3451 - ((div_Q30[k]<<(15-b2[k]))*inv_Qb2[k] >> 16) ,