From 95e2bf7e4c8dae6bd1fd96b73a96afaf5c64f1e5 Mon Sep 17 00:00:00 2001
From: Gregory Maxwell
Date: Thu, 10 Feb 2011 16:04:59 0500
Subject: [PATCH] Some draft updates.
In particular, this partially corrects the description of CELT to
reflect the current bitstream.

doc/draftietfcodecopus.xml  180 ++++++++++++++++++++
1 file changed, 86 insertions(+), 94 deletions()
diff git a/doc/draftietfcodecopus.xml b/doc/draftietfcodecopus.xml
index d5615e09..caeca3c4 100644
 a/doc/draftietfcodecopus.xml
+++ b/doc/draftietfcodecopus.xml
@@ 96,11 +96,13 @@ that is CBR by using all the bits left unused by the SILK layer.
the SILK InternetDraft with the main exception that
SILK was modified to
use the same range coder as CELT. The implementation of the CELTbased MDCT
layer is available from the CELT website and is a more recent version (0.8.1)
+layer is available from the CELT website and is a more recent version
+(0.11.0)
of the CELT InternetDraft.
The main changes
include better support for 20 ms frames as well as the ability to encode
only the higher bands using a range coder partially filled by the SILK layer.
+include better support for 20 ms frames as well, the ability to encode
+only the higher bands using a range coder partially filled by the SILK
+layer, and a pre/post filter used to aid coding of highly tonal signals.
In addition to their frame size, the SILK and CELT codecs require
@@ 940,7 +942,9 @@ It is derived from a basic (full overlap) window that is the same as the one use
The MDCT output is divided into bands that are designed to match the ear's critical bands,
with the exception that each band has to be at least 3 bins wide. For each band, the encoder
+with the exception that each band has to be at least 3 bins wide for the
+smallest (2.5ms) frame size and the larger frame sizes use integer
+multiplies of the 2.5ms layout. For each band, the encoder
computes the energy that will later be encoded. Each band is then normalized by the
square root of the nonquantized energy, such that each band now forms a unit vector X.
The energy and the normalization are computed by compute_band_energies()
@@ 960,31 +964,32 @@ as implemented in quant_bands.c
The coarse quantization of the energy uses a fixed resolution of
6 dB and is the only place where entropy coding is used.
+The coarse quantization of the energy uses a fixed resolution of 6 dB.
To minimize the bitrate, prediction is applied both in time (using the previous frame)
and in frequency (using the previous bands). The 2D ztransform of
+and in frequency (using the previous bands). The prediction using the
+previous frame can be disabled, creating an "intra" frame where the energy
+is coded without reference to prior frames. An encoder is able to choose the
+mode used at will based on both loss robustness and efficiency
+considerations.
+The 2D ztransform of
the prediction filter is: A(z_l, z_b)=(1a*z_l^1)*(1z_b^1)/(1b*z_b^1)
where b is the band index and l is the frame index. The prediction coefficients are
a=0.8 and b=0.7 when not using intra energy and a=b=0 when using intra energy.
+where b is the band index and l is the frame index. The prediction coefficients
+applied depend on the frame size in use when not using intra energy and a=0 b=4915/32768
+when using intra energy.
The timedomain prediction is based on the final fine quantization of the previous
frame, while the frequency domain (within the current frame) prediction is based
on coarse quantization only (because the fine quantization has not been computed
yet). We approximate the ideal
probability distribution of the prediction error using a Laplace distribution. The
+yet). The prediction is clamped internally so that fixed point implementations with
+limited dynamic range to not suffer desynchronization. Identical prediction
+clamping must be implemented in all encoders and decoders.
+We approximate the ideal
+probability distribution of the prediction error using a Laplace distribution
+with seperate parameters for each frame size in intra and interframe modes. The
coarse energy quantization is performed by quant_coarse_energy() and
quant_coarse_energy() (quant_bands.c).



The Laplace distribution for each band is defined by a 16bit (Q15) decay parameter.
Thus, the value 0 has a frequency count of p[0]=2*(16384*(16384decay)/(16384+decay)). The
values +/ i each have a frequency count p[i] = (p[i1]*decay)>>14. The value of p[i] is always
rounded down (to avoid exceeding 32768 as the sum of all frequency counts), so it is possible
for the sum to be less than 32768. In that case additional values with a frequency count of 1 are encoded. The signed values corresponding to symbols 0, 1, 2, 3, 4, ...
are [0, +1, 1, +2, 2, ...]. The encoding of the Laplacedistributed values is
+quant_coarse_energy() (quant_bands.c). The encoding of the Laplacedistributed values is
implemented in ec_laplace_encode() (laplace.c).
+
@@ 1004,7 +1009,9 @@ If any bits are unused at the end of the encoding process, these bits are used t
increase the resolution of the fine energy encoding in some bands. Priority is given
to the bands for which the allocation () was rounded
down. At the same level of priority, lower bands are encoded first. Refinement bits
are added until there are no unused bits. This is implemented in quant_energy_finalise()
+are added until there is no more room for fine energy or until each band
+has gained an additional bit of precision or has the maximum fine
+energy precision. This is implemented in quant_energy_finalise()
(quant_bands.c).
@@ 1017,7 +1024,7 @@ are added until there are no unused bits. This is implemented in quant_energy_fi
Bit allocation is performed based only on information available to both
the encoder and decoder. The same calculations are performed in a bitexact
manner in both the encoder and decoder to ensure that the result is always
exactly the same. Any mismatch would cause an error in the decoded output.
+exactly the same. Any mismatch causes corruption of the decoded output.
The allocation is computed by compute_allocation() (rate.c),
which is used in both the encoder and the decoder.
@@ 1028,7 +1035,12 @@ bands each have a width of one Bark, this is equivalent to modeling the
masking occurring within each critical band, while ignoring interband
masking and tonevsnoise characteristics. While this is not an
optimal bit allocation, it provides good results without requiring the
transmission of any allocation information.
+transmission of any allocation information. Additionally, the encoder
+is able to signal alterations to the implicit allocation via
+two means: There is an entropy coded tilt parameter can be used to tilt the
+allocation to favor low or high frequencies, and there is a boost parameter
+which can be used to shift large amounts of additional precision into
+individual bands.
@@ 1037,48 +1049,38 @@ For every encoded or decoded frame, a target allocation must be computed
using the projected allocation. In the reference implementation this is
performed by compute_allocation() (rate.c).
The target computation begins by calculating the available space as the
number of whole bits which can be fit in the frame after Q1 is stored according
to the range coder (ec_[enc/dec]_tell()) and then multiplying by 8.
+number of eighthbits which can be fit in the frame after Q1 is stored according
+to the range coder (ec_tell_frac()) and reserving one eighthbit.
Then the two projected prototype allocations whose sums multiplied by 8 are nearest
to that value are determined. These two projected prototype allocations are then interpolated
by finding the highest integer interpolation coefficient in the range 08
such that the sum of the higher prototype times the coefficient, plus the
sum of the lower prototype multiplied by
the difference of 16 and the coefficient, is less than or equal to the
available sixteenthbits.
The reference implementation performs this step using a binary search in
interp_bits2pulses() (rate.c). The target
allocation is the interpolation coefficient times the higher prototype, plus
the lower prototype multiplied by the difference of 16 and the coefficient,
for each of the CELT bands.
+by finding the highest integer interpolation coefficient in the range 063
+such that the sum of the higher prototype times the coefficient divided by
+64 plus the sum of the lower prototype multiplied is less than or equal to the
+available eighthbits. During the interpolation a maximum allocation
+in each band is imposed along with a threshold hard minimum allocation for
+each band.
+Starting from the last coded band a binary decision is coded for each
+band over the minimum threshold to determine if that band should instead
+recieve only the minimum allocation. This process stops at the first
+nonminimum band, the first band to recieve an explicitly coded boost,
+or the first band in the frame, whichever comes first.
+The reference implementation performs this step in interp_bits2pulses()
+using a binary search for the interpolation. (rate.c).
Because the computed target will sometimes be somewhat smaller than the
available space, the excess space is divided by the number of bands, and this amount
is added equally to each band. Any remaining space is added to the target one
sixteenthbit at a time, starting from the first band. The new target now
matches the available space, in sixteenthbits, exactly.
+is added equally to each band which was not forced to the minimum value.
The allocation target is separated into a portion used for fine energy
and a portion used for the Spherical Vector Quantizer (PVQ). The fine energy
quantizer operates in wholebit steps. For each band the number of bits per
channel used for fine energy is calculated by 50 minus the log2_frac(), with
1/16 bit precision, of the number of MDCT bins in the band. That result is multiplied
by the number of bins in the band and again by twice the number of
channels, and then the value is set to zero if it is less than zero. Added
to that result is 16 times the number of MDCT bins times the number of
channels, and it is finally divided by 32 times the number of MDCT bins times the
number of channels. If the result times the number of channels is greater than than the
target divided by 16, the result is set to the target divided by the number of
channels divided by 16. Then if the value is greater than 7 it is reset to 7 because a
larger amount of fine energy resolution was determined not to be make an improvement in
perceived quality. The resulting number of fine energy bits per channel is
then multiplied by the number of channels and then by 16, and subtracted
from the target allocation. This final target allocation is what is used for the
PVQ.
+quantizer operates in wholebit steps and is allocated based on an offset
+fraction of the total usable space. Excess bits above the maximums are
+left unallocated and placed into the rolling balance maintained during
+the quantization process.
@@ 1100,7 +1102,7 @@ all integer codevectors y of N dimensions that satisfy sum(abs(y(j))) = K.
In bands where neither pitch nor folding is used, the PVQ is used to encode
+In bands where there are sufficient bits allocated the PVQ is used to encode
the unit vector that results from the normalization in
directly. Given a PVQ codevector y,
the unit vector X is obtained as X = y/y, where . denotes the
@@ 1109,19 +1111,19 @@ L2 norm.
Although the allocation is performed in 1/16 bit units, the quantization requires
+Although the allocation is performed in 1/8th bit units, the quantization requires
an integer number of pulses K. To do this, the encoder searches for the value
of K that produces the number of bits that is the nearest to the allocated value
(rounding down if exactly halfway between two values), subject to not exceeding
the total number of bits available. The computation is performed in 1/16 of
bits using log2_frac() and ec_enc_tell(). The number of codebooks entries can
be computed as explained in . The difference
+the total number of bits available. For efficiency reasons the search is performed against a
+precomputated allocation table which only permits some K values for each N. The number of
+codebooks entries can be computed as explained in . The difference
between the number of bits allocated and the number of bits used is accumulated to a
balance (initialised to zero) that helps adjusting the
allocation for the next bands. One third of the balance is subtracted from the
bit allocation of the next band to help achieving the target allocation. The only
+allocation for the next bands. One third of the balance is applied to the
+bit allocation of the each band to help achieving the target allocation. The only
exceptions are the band before the last and the last band, for which half the balance
and the whole balance are subtracted, respectively.
+and the whole balance are applied, respectively.
@@ 1179,12 +1181,13 @@ they are equivalent to the mathematical definition.
The indexing computations are performed using 32bit unsigned integers. For large codebooks,
32bit integers are not sufficient. Instead of using 64bit integers (or more), the encoding
is made slightly suboptimal by splitting each band into two equal (or nearequal) vectors of
size (N+1)/2 and N/2, respectively. The number of pulses in the first half, K1, is first encoded as an
integer in the range [0,K]. Then, two codebooks are encoded with V((N+1)/2, K1) and V(N/2, KK1).
The split operation is performed recursively, in case one (or both) of the split vectors
still requires more than 32 bits. For compatibility reasons, the handling of codebooks of more
than 32 bits MUST be implemented with the splitting method, even if 64bit arithmetic is available.
+is for these cases is handled by splitting each band into two equal vectors of
+size N/2 prior to quantization. A quantized gain parameter with precision
+derived from the current allocation is entropy coded to represent the relative gains of each side of
+the split and the entire quantization process is recursively applied.
+Multiple levels of splitting may be applied upto a frame size dependent limit.
+The same recursive mechanism is applied for the joint coding of stereo
+audio.
@@ 1193,7 +1196,8 @@ than 32 bits MUST be implemented with the splitting method, even if 64bit arith
When encoding a stereo stream, some parameters are shared across the left and right channels, while others are transmitted separately for each channel, or jointly encoded. Only one copy of the flags for the features, transients and pitch (pitch period and gains) are transmitted. The coarse and fine energy parameters are transmitted separately for each channel. Both the coarse energy and fine energy (including the remaining fine bits at the end of the stream) have the left and right bands interleaved in the stream, with the left band encoded first.
+When encoding a stereo stream, some parameters are shared across the left and right channels, while others are transmitted separately for each channel, or jointly encoded. Only one copy of the flags for the features, transients and pitch (pitch
+period and filter parameters) are transmitted. The coarse and fine energy parameters are transmitted separately for each channel. Both the coarse energy and fine energy (including the remaining fine bits at the end of the stream) have the left and right bands interleaved in the stream, with the left band encoded first.
@@ 1201,35 +1205,22 @@ The main difference between mono and stereo coding is the PVQ coding of the norm
From M and S, an angular parameter theta=2/pi*atan2(S, M) is computed. The theta parameter is converted to a Q14 fixedpoint parameter itheta, which is quantized on a scale from 0 to 1 with an interval of 2^qb, where qb = (b2*(N1)*(40log2_frac(N,4)))/(32*(N1)), b is the number of bits allocated to the band, and log2_frac() is defined in cwrs.c. From here on, the value of itheta MUST be treated in a bitexact manner since
both the encoder and decoder rely on it to infer the bit allocation.
+From M and S, an angular parameter theta=2/pi*atan2(S, M) is computed. The theta parameter is converted to a Q14 fixedpoint parameter itheta, which is quantized on a scale from 0 to 1 with an interval of 2^qb, where qb is
+based the number of bits allocated to the band. From here on, the value of itheta MUST be treated in a bitexact manner since both the encoder and decoder rely on it to infer the bit allocation.
Let m=M/M and s=S/S; m and s are separately encoded with the PVQ encoder described in . The number of bits allocated to m and s depends on the value of itheta. The number of bits allocated to coding m is obtained by:




imid = bitexact_cos(itheta);
iside = bitexact_cos(16384itheta);
delta = (N1)*(log2_frac(iside,6)log2_frac(imid,6))>>2;
qalloc = log2_frac((1<<qb)+1,4);
mbits = (bqalloc/2delta)/2;

+Let m=M/M and s=S/S; m and s are separately encoded with the PVQ encoder described in . The number of bits allocated to m and s depends on the value of itheta.
where bitexact_cos() is a fixedpoint cosine approximation that MUST be bitexact with the reference implementation
in mathops.h. The spectral folding operation is performed independently for the mid and side vectors.
After all the quantization is completed, the quantized energy is used along with the
quantized normalized band data to resynthesize the MDCT spectrum. The inverse MDCT () and the weighted overlapadd are applied and the signal is stored in the synthesis buffer so it can be used for pitch prediction.
The encoder MAY omit this step of the processing if it knows that it will not be using
the pitch predictor for the next few frames. If the deemphasis filter () is applied to this resynthesized
signal, then the output will be the same (within numerical precision) as the decoder's output.
+quantized normalized band data to resynthesize the MDCT spectrum. The inverse MDCT () and the weighted overlapadd are applied and the signal is stored in the synthesis
+buffer.
+The encoder MAY omit this step of the processing if it does not need the decoded output.
@@ 1604,9 +1595,9 @@ the latter shall take precedence.
Compliance with this specification means that a decoder's output MUST be
close enough to the output of the reference
implementation. This is measured using the opus_compare.m tool provided in
Appendix .
+within the thresholds specified compared to the reference implementation
+using the opus_compare.m tool in Appendix .
@@ 1626,11 +1617,12 @@ allow an attacker to attack transcoding gateways.
The reference implementation contains no known buffer overflow or cases where
a specially crafter packet or audio segment could cause a significant increase
in CPU load. However, on certain CPU architectures where denormalized
floatingpoint operations result and handled through exceptions, it is possible
for some audio content (e.g. silence or nearsilence) to cause such an increase
+floatingpoint operations are much slower it is possible for some audio content
+(e.g. silence or nearsilence) to cause such an increase
in CPU load. For such architectures, it is RECOMMENDED to add very small
floatingpoint offsets to prevent significant numbers of denormalized
operations. No such issue exists for the fixedpoint reference implementation.
+operations or to configure the hardware to zeroize denormal numbers.
+No such issue exists for the fixedpoint reference implementation.

2.11.0