NORTHWESTERN UNIVERSITY ON THE ENCODING OF THE ANCHOR FRAME IN VIDEO CODING A THESIS SUBMITTED TO THE GRADUATE SCHOOL IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
for the degree MASTER OF SCIENCE Field of Electrical Engineering
By Lisimachos Paul Kondi EVANSTON, ILLINOIS December 1996
TABLE OF CONTENTS
LIST OF FIGURES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2 An Overview of the H.263 Standard : : : : : : : : : : : : : : : : : : : : : : 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 The Source Coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Source Format . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Video Source Coding Algorithm . . . . . . . . . . . . . . . . . 2.2.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Motion Compensation . . . . . . . . . . . . . . . . . . . . . . 2.2.5 Block Transformation . . . . . . . . . . . . . . . . . . . . . . . 2.2.6 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.7 The Layering Structure . . . . . . . . . . . . . . . . . . . . . . 2.3 The Source Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Motion Compensation . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Inverse Quantization . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Inverse Discrete Cosine Transform . . . . . . . . . . . . . . . . ii v 1 4 4 6 6 7 7 9 10 11 12 13 13 13 14
LIST OF TABLES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : vii
2.3.4 Reconstruction of Blocks . . . . . . . . . . . . . . . . . . . . . 2.4 Optional Negotiable Coding Options . . . . . . . . . . . . . . . . . . 2.4.1 Unrestricted Motion Vector Mode . . . . . . . . . . . . . . . . 2.4.2 Syntax-based Arithmetic Coding Mode . . . . . . . . . . . . . 2.4.3 Advanced Prediction Mode . . . . . . . . . . . . . . . . . . . . 2.4.4 PB-frames Mode . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Elements of the Coding Algorithm that are not Part of Recommendation H.263 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Motion Estimation and Mode Selection . . . . . . . . . . . . . 22.214.171.124 126.96.36.199 188.8.131.52 184.108.40.206
15 16 16 16 16 17 17 17
Integer Pixel Motion Estimation : : : : : : : : : : : : 17 INTRA INTER Mode Selection : : : : : : : : : : : : 19 Half Pixel Search : : : : : : : : : : : : : : : : : : : : 20 16 16 8 8 Vector Decision : : : : : : : : : : : : : 21 22 25 30 35
2.5.2 Rate Control and Bu er Regulation . . . . . . . . . . . . . . . 3.1 Orthonormal Wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Biorthogonal Wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Extension to the Two-Dimensional Case . . . . . . . . . . . . . . . .
3 An Introduction to Wavelet Theory : : : : : : : : : : : : : : : : : : : : : : 25
4 A Progressive Method of Transmitting the Anchor Frame : : : : : : : : : : 37 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Encoding of the Wavelet Coe cients Using Zerotrees . . . . . . . . . 4.3 Progressive Transmission . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Extension of the Algorithm for Color Images . . . . . . . . . . . . . . 5 Approaches of Determining the best stopping point in the Coding of the 5.1 Formulation of the Problem . . . . . . . . . . . . . . . . . . . . . . . 5.2 Past Values of SNR1 n and SNR2 n are Known . . . . . . . . . . . . 5.2.1 A linear Model of Predicting the Next Value of SNR2 n . . . 5.2.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 5.3 SNR2 n is not Known . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Biharmonic Spline Interpolation in One Dimension . . . . . . 5.3.2 Biharmonic Spline Interpolation in 2 or More Dimensions . . . 5.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 44 45 46 48 56 60 62 63 37 38 40 42
Anchor Frame : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 44
6 Conclusions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 71 BIBLIOGRAPHY : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 73
LIST OF FIGURES
Figure No. 2.1 Outline block diagram of the video codec . . . . . . . . . . . . . . . . 2.2 Source Coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Halfpixel prediction by bilinear interpolation . . . . . . . . . . . . . . 3.1 Equivalent block diagram for equations 3-10 and 3-11 . . . . . . . 5.1 SNR1 n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 SNR2 n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Actual and predicted SNR2 n for N=5 . . . . . . . . . . . . . . . . . 5.4 Actual and predicted SNR2 n for N=1 . . . . . . . . . . . . . . . . . 5.5 SNR1 n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 SNR2 n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Actual and predicted SNR2 n for N=5 . . . . . . . . . . . . . . . . . 5.8 Actual and predicted SNR2 n for N=1 . . . . . . . . . . . . . . . . . 5.9 The biharmonic function wx is found by applying point forces to a thin elastic beam or spline . . . . . . . . . . . . . . . . . . . . . . . . 5.10 Actual values of SNR2 n . . . . . . . . . . . . . . . . . . . . . . . . . 5.11 Estimated values of SNR2 n . . . . . . . . . . . . . . . . . . . . . . . v 61 65 66 Page 5 8 20 28 49 51 52 53 54 55 57 58
5.12 Actual values of SNR2 n . . . . . . . . . . . . . . . . . . . . . . . . . 5.13 Estimated values of SNR2 n . . . . . . . . . . . . . . . . . . . . . . . 5.14 Actual values of SNR2 n . . . . . . . . . . . . . . . . . . . . . . . . . 5.15 Estimated values of SNR2 n . . . . . . . . . . . . . . . . . . . . . . .
67 68 69 70
LIST OF TABLES
Table No. 2.1 Number of pixels per line and number of lines for each of the H.263 picture formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Biharmonic Green Functions . . . . . . . . . . . . . . . . . . . . . . . 7 63 Page
In the approaches known today for low bit rate video coding, no particular attention is given to the e ect of the number of bits allocated to the rst intra, anchor frame of the image sequence on the overall quality of the reconstructed sequence. A xed Quantization Parameter QP or a xed number of bits is used for the anchor frame. However, this approach is certainly not optimal. Assuming that we are using a constant bit rate channel, the number of bits used for the anchor frame corresponds to a certain time delay. For example, if the bit rate of the channel is R bits per second and we use r bits for the anchor frame, the transmission of that frame will take r=R seconds and we will be able to code the next frame after this time passes. Any frames that become available to the coder while transmitting the anchor frame are discarded. The next frame to be coded will be the one that arrived on or just before the time when the transmission of the anchor frame ended. Clearly, there should be an optimal time to be alloted to the transmission of the anchor frame. Of course, this time will not be the same for all image sequences. If we spend a lot of bits for the intra frame, we will get a better quality of the reconstructed intra frame. In principle, better quality of the reconstructed intra frame will lead to better quality of the next reconstructed inter frame, as long as the time delay is not that great that will cause the correlation of the two frames to be too low. Thus, the time used for the intra frame should be large enough to yield a good quality 1
2 reconstructed frame but not too large so that the rst frame and the next frame to be coded are uncorrelated. As mentioned earlier, we are primarily interested in low bit rate video coding. Thus, we assume that only one intra frame, the the rst frame of the sequence, is sent, and all subsequent frames are coded using prediction inter frames. To make the decision on when we should stop coding the intra frame and start coding the next inter frame we should have some data on the image sequence. Thus, this decision has to be made in real time after the coding of the intra frame has started and some more frames are available to the coder. This leads us to modify the H.263 coder and use a progressive method to code the intra frame, so that we are able to stop coding it at any time. The method we use here is the Embedded Zerotree Wavelet method. Here, we assume that our channel is an error-free constant bit rate channel. Our discussion is applicable to this model. There are other channel models that are used for video coding such as variable bit rate channel models which are suitable for Asynchronous Transfer Mode ATM networks, as described in 1 . In this thesis, we rst give an overview of the H.263 standard which has been proposed for low bit rate video coding. Then, a review of wavelet theory and its application to image coding is given, since the Embedded Zerotree Wavelet method is based on the wavelet theory. In the next chapter, the Embedded Zerotree Wavelet method is described, which is used for the progressive transmission of the anchor frame. In the nal chapter, methods of determining the best point when we should stop the transmission of the anchor frame and start coding the rest of the sequence
3 are introduced.
An Overview of the H.263 Standard
Presently, video coding standardization activities are going on in a very fast pace. Both ISO-MPEG International Standards Organization-Motion Picture Experts Group and ITU-T International Telecommunication Union-Telecommunication Sector, formerly CCITT are working continuously and collaboratively to set standards for video communications. There are di erent video standards which are intended for use with various video services at di erent channel bit rates. For example, ITU-T H.261 is a standard for audiovisual services at p 64 kb s, ISO MPEG-1 is for digital storage media at 1.5 Mb s and ISO MPEG-2 is for video transmission at bit rates greater than 1.5 Mb s 5 . Currently, the standardization work is focused in the transmission of very low bit rate video. There is a divergence of interest between ITU-T and ISO-MPEG in the standardization of a video coding algorithm for very low bit rates. ISO-MPEG aims at de ning a generic video coding standard known as MPEG-4 to satisfy a set of functionalities which may or may not be operating at very low bit rates. ITU-T, on the other hand, is quite speci c about de ning a standard for a very low bit rate VLBR video coder and thus has come up with H.263, a draft near-term video coding standard for the transmission of video at less than 64 kb s 5 .
Figure 2.1. Outline block diagram of the video codec
H.263 is related to recommendation H.261 and the standards for MPEG-1 and MPEG-2 videocoding, but the emphasis is on very low bit rate coding. It uses 16 16 macroblocks, 8 8 subblocks, motion estimation and compensation, DCT transform of prediction errors, runlength coding and variable length codewords. A hybrid of inter-picture prediction to utilize temporal redundancy and transform coding of the remaining signal to reduce spatial redundancy is used. The decoder has motion compensation capability, allowing optional incorporation of this techique in the coder. Half pixel precision is used for the motion compensation, as opposed to Recommendation H.261 where full pixel prediction and a loop lter are used. Finally, variable length coding is used for the symbols to be transmitted. Figure 2.1 is an outline block diagram of the video coder and decoder.
6 In addition to the core H.263 coding algorithm, there are four negotiable coding options that will be described later. These options, which can be used together or separately, are: Unrestricted Motion Vector mode Syntax-based Arithmetic Coding mode Advanced Prediction mode PB-frames mode
2.2 The Source Coder 2.2.1 Source Format
Pictures are coded as luminance and two color di erence components Y , CB and
There are ve standarized picture formats: sub-QCIF, QCIF, CIF, 4CIF and 16CIF. For each of these picture formats, the luminance sampling structure is dx pixels per line and dy lines per picture, orthogonal. The sampling structure of each of the two color di erence components is at dx 2 pixels per line and dy 2 lines per picture, orthogonal. The values of dx, dy, dx 2 and dy 2 for each of the picture formats are given in Table 2.1.
7 Picture Format sub-QCIF QCIF CIF 4CIF 16CIF picture formats dx 128 176 352 704 dy 96 144 288 576 dx 2 dy 2 64 88 176 352 48 72 144 288 576
1408 1152 704
Table 2.1. Number of pixels per line and number of lines for each of the H.263 2.2.2 Video Source Coding Algorithm
The source coder is shown in generalized form in Figure 2.2.2 The main elements are: Prediction Block transformation, and Quantization
The prediction is inter-picture and may be augmented by motion compensation. The coding mode in which prediction is applied is called INTER. The coding mode in which no prediction is applied is called INTRA. The INTRA coding mode can be signalled at the picture level INTRA for I-pictures or INTER for P-pictures or at
Figure 2.2. Source Coder
9 the macroblock level in P-pictures. In the optional PB-frames mode B-pictures are always coded in INTER mode. The B-pictures are partly predicted bidirectionally.
2.2.4 Motion Compensation
The decoder will accept one vector per macroblock, or, if the Advanced Predictor mode is used, one or four vectors per macroblock. If the PB-frames mode is used, one additional delta vector can be transmitted per macroblock for adaptation of the motion vectors for prediction of the B-macroblock. The motion vectors are coded di erentially. To obtain each motion vector at the decoder, a vector di erence is added to a predictor vector. More about the coding of the motion vectors will be described later. A positive value of the horizontal or vertical component of the motion vector signi es that the prediction is formed from pixels in the referenced picture that are spatially to the right or below the pixels being predicted. Both the horizontal and vertical components of the motion vectors have integer or half-integer values. In the default prediction mode, these values are restricted to the range -16, 15.5 . In the Unrestricted Motion Vector Mode, the maximum range for vector components is -31.5, 31.5 , with the restriction that only values which are within a range of -16, 15.5 around the predictor for each motion vector component can be reached if the predictor is in the range -15.5, 16 . If the predictor is outside 15.5, 16 , all values within the range -31.5, 31.5 with the same value as the predictor plus the zero value can be reached.
2.2.5 Block Transformation
A separable two dimensional Discrete Cosine Transform DCT of size 8 8 is used for each block:
7 7 XX F u; v = 1 C uC v f x; y cos 2x + 1u=16 cos 2y + 1v=16 4
with u; v; x; y = 0; 1; 2; : : :; 7 where,
8 p 1= 2 if u = 0 C u = :1 otherwise 8 p 1= 2 if v = 0 C v = :1 otherwise
x; y are the spatial coordinates in the pixel domain, u; v are the coordinates in the transform domain.
For INTRA blocks, the Discrete Cosine Transform of the original sampled values of the Y , CB and CR is taken. For INTER blocks, the Discrete Cosine Transform of the di erence between the original sampled values and the prediction from the motion compensation is taken.
There is one quantizer for the rst coe cient of INTRA blocks and 31 quantizers for all other coe cients. Within a macroblock, the same quantizer is used for all coe cients except the rst one of INTRA blocks. The rst coe cient of INTRA blocks is nominally the transform DC value uniformly quantized with a step size of 8. Each of the other 31 quantizers use equally spaced reconstruction levels with a central dead-zone around zero and with a step size in the range 2 to 62. The Quantization Parameter QP is normally used to specify the quantizer. QP may take integer values from 1 to 31. The quantization step size is then 2 QP . The following de nitions of divisions are made: Integer division with truncation towards zero. Integer division with rounding to the nearest integer. examples: 3 2=2, -3 2=-2. Thus, if COF is a transform coe cient to be quantized and LEV EL is the absolute value of the quantized version of the transform coe cient, the quantization is done as follows: For INTRA blocks except for the DC coe cient:
LEV EL = jCOF j=2 QP
For INTER blocks all coe cients, including the DC:
LEV EL = jCOF j , QP=2=2 QP
For the DC coe cient of an INTRA block:
LEV EL = COF==8
2.2.7 The Layering Structure
The video multiplex is arranged in a hierarchical structure with four layers. From top to bottom, the layers are: Picture, Group of Blocks, Macroblock, and, Block. Each picture is divided into groups of blocks GOBs. A group of blocks GOB comprises of k 16 lines, depending on the picture format: k = 1 for sub-QCIF, QCIF and CIF, k = 2 for 4CIF and k = 4 for 16CIF. Each GOB is divided into macroblocks. A macroblock relates to 16 pixels by 16 lines of Y and the spatially corresponding 8 pixels by 8 lines of CB and CR. A macroblock consists of four luminance blocks and the two spatially corresponding color di erence blocks. Each luminance or chrominance block relates to 8 pixels by 8 lines of Y , CB or CR.
13 A Group of Blocks GOB comprises one macroblock row for sub-QCIF, QCIF and CIF, two macroblock rows for 4CIF and four macroblock rows for 16CIF.
2.3 The Source Decoder
The following procedures are done at the source decoder: Motion compensation, Inverse quantization, Inverse Discrete Cosine Transform, and, Reconstruction of blocks.
2.3.1 Motion Compensation
The motion vector for each macroblock is obtained by adding predictors to the vector di erences. In the case of one vector per macroblock i.e., when the Advanced Prediction Mode is not used, the candidate predictors are taken from three surrounding macroblocks. Also, half picture values are found using bilinear interpolation.
2.3.2 Inverse Quantization
It should be pointed out that the term inverse quantization" does not imply that the quantization process is invertible. The quantization operation is clearly
14 not invertible. This term simply implies the process of obtaining the reconstructed transform coe cient from the transmitted quantization level. Thus, if COF 0 is the reconstructed transform coe cient and LEV EL is the absolute value of the quantized version of the transform coe cient, the inverse quantization is done as follows: For INTER blocks and INTRA blocks, except for the DC coe cient:
jCOF 0j = 2 QP LEV EL + QP if LEV EL 6= 0, QP is odd : 2 QP LEV EL + QP , 1 if LEV EL 6= 0, QP is even
The sign of COF is then added to obtain COF 0:
if LEV EL = 0
COF 0 = SignCOF jCOF 0j
For the DC coe cient of an INTRA block:
COF 0 = LEV EL 8
2.3.3 Inverse Discrete Cosine Transform
After inverse quantization, the resulting 8 8 blocks are processed by a separable two-dimensional inverse Discrete Cosine Transform DCT of size 8 8. The output from the inverse transform ranges from -256 to 255 after clipping to be represented with 9 bits.
15 The inverse DCT is given by the following equation:
7 7 XX C uC vF u; v cos 2x + 1u=16 cos 2y + 1v=16 f x; y = 1 4
with u; v; x; y = 0; 1; 2; : : :; 7 where,
8 p 1= 2 if u = 0 C u = :1 otherwise 8 p 1= 2 if v = 0 C v = :1 otherwise
x; y are the spatial coordinates in the pixel domain, u; v are the coordinates in the transform domain.
2.3.4 Reconstruction of Blocks
After the inverse Discrete Cosine Transform, a reconstruction is formed for each luminance and chrominance block. For INTRA blocks, the reconstruction is equal to the result of the inverse transformation. For INTER blocks, the reconstruction is formed by summing the prediction and the result of the inverse transformation. The transform is performed on a pixel basis.
2.4 Optional Negotiable Coding Options 2.4.1 Unrestricted Motion Vector Mode
In the default prediction mode of H.263, motion vectors are restricted such that all pixels referenced by them are within the coded picture area. In the Unrestricted Motion Vector mode, however, this restriction is lifted and motion vectors are allowed to point outside the picture. When a pixel referenced by a motion vector is outside the coded picture area, an edge pixel is used instead. The edge pixel is found by limiting the motion vector to the last full pixel position inside the coded picture area. Limitation of the motion vector is performed on a pixel basis and separately for each component of the motion vector.
2.4.2 Syntax-based Arithmetic Coding Mode
In the Syntax-based Arithmetic Coding Mode, arithmetic coding is used instead of variable length coding. Clearly, the SNR and reconstructed pictures will be the same as if variable length coding was used, but signi cantly fewer bits will be produced.
2.4.3 Advanced Prediction Mode
In H.263, one motion vector per macroblock is used except when in Advanced Predictor mode. In this mode, either four 8 8 vectors or one 16 16 vector is transmitted for each macroblock. The encoder has to decide which type of vectors to use. H.263 does not specify how the decision is made. Four vectors use more bits, but give better prediction.
2.4.4 PB-frames Mode
A PB-frame consists of two pictures coded as one unit. Its name comes from the name of picture types in H.262 where there are P-pictures and B-pictures. Thus, a PB-frame consists of one P-picture and one B-picture. The P-picture is predicted from the previous decoded P-picture and the B-picture is predicted from both the previous decoded P-picture and the P-picture currently being decoded. The name B-picture" was chosen because parts of B-pictures may be bidirectionally predicted from both past and future pictures. With this coding option, the picture rate can be increased considerably without increasing the bit rate much.
2.5 Elements of the Coding Algorithm that are not Part of Recommendation H.263
The H.263 standard essentially describes the the bit stream and the decoder. There are certain elements such as the mode selections, the way the motion vectors are obtained and the bu er regulation and rate control, that are not described by the H.263 standard. Here, we describe these elements as they are done in our implementation which is based on an implementation by Telenor Research 6 .
2.5.1 Motion Estimation and Mode Selection
Motion estimation is performed on the luminance only. The Sum of Absolute Di erence SAD is used as an error measure.
220.127.116.11 Integer Pixel Motion Estimation.
18 As mentioned earlier, in the core H.263, the block size for vectors is 16 16. 8 8 vectors can be used as an option Advanced Prediction Mode. The search for the best motion vector is made with integer pixel displacement and for the Y component luminance. The comparisons are made between the incoming block and the displaced block in the previous original picture. A full search is used, and the search area is up to 15 pixels in the horizontal and vertical directions around the original macroblock position. The error measure is given by the equation:
SADN x; y =
16 16 or 8 8.
N N XX i=1 j =1
joriginal , previousj
for ,15 x; y 15 and N = 16 or 8, depending on whether the vectors are For the 16 16 vectors and to favor the zero vector, SAD160; 0 is reduced by 100. Thus, if there is no signi cant di erence in between the SAD for the zero vector and some other vector, the zero vector is preferred.
SAD16 0; 0 = SAD16 0; 0 , 100
integer pixel motion vector, V0.
The x; y pair that corresponds to the lowest SAD16 is chosen as the 16 16 Likewise, if Advanced Prediction Mode is used, the x; y pairs resulting in the lowest SAD8 are chosen to give the 4 8 8 vectors, V1, V2, V3 and V4. Then, the
SAD for the macroblock will be the sum of the four SAD8 :
SAD48 = SAD8V 1 + SAD8V 2 + SAD8 V 3 + SAD8V 4
Then, the total SADinter , used in the next section in the mode decision, is: If only 16 16 prediction is used:
SADinter = SAD16 x; y
If the Advanced Prediction Mode is used:
SADinter = minSAD16x; y; SAD48
18.104.22.168 INTRA INTER Mode Selection.
After the integer pixel motion estimation, the coder decides whether to use INTRA or INTER mode for each macroblock. The following parameters have to be calculated rst:
16 16 1 X X original MBmean = 256
i=1 j =1
16 16 XX
i=1 j =1
joriginal , MBmean j
The coder chooses INTRA mode for the speci c macroblock if:
A SADinter , 500
Figure 2.3. Halfpixel prediction by bilinear interpolation
Note that if SAD16 0; 0 is used, this is the value that has already been reduced by 100. If the coder decides to choose INTRA mode, no further operations are necessary. However, if INTER mode is chosen, the motion search continues with half pixel search around the VO position.
22.214.171.124 Half Pixel Search.
Half pixel search is performed for 16 16 vectors as well as for 8 8 vectors is the Advanced Prediction Mode is chosen. The search is performed on the luminance component of the macro block and the search area is half pixel around the target pixel pointed to by V0, V1, V2, V3 or V4. As in the integer search, SAD160; 0 is reduced by 100. The half pixel values are found using bilinear interpolation, as shown in gure 126.96.36.199.
21 The way the interpolation is done is de ned in the H.263 standard so that the decoder has a unique way of decoding the picture. The vector that results in the best match i.e. lower SAD during the half pixel search is named MV. MV consists of horizontal and vertical components MVx ; MVy , both measured in half pixel units. As mentioned earlier, in the default prediction mode, MVx and MVy are in the range -16,15.5 , whereas in the Advanced prediction mode, they are in the range -31.5,31.5 .
188.8.131.52 16 16 8 8 Vector Decision.
This section applies only if the Advanced Prediction Mode is used. Let SAD16x; y be the error measure for the best half pixel 16 16 vector including the subtraction of 100 if the vector is 0,0. Also, SAD48 is given by Eq. 2-11 with the exception that the SAD8 : are those that result from the new vectors with the half pixel re nement. Then, the criterion is as follows: If
SAD48 SAD16 , 100
then the 8 8 vectors are chosen to be transmitted. Otherwise, the 16 16 vector
2.5.2 Rate Control and Bu er Regulation
The H.263 standard does not specify a way in which the rate control will be performed in the coder. It just speci es a valid bit stream which can be uniquely interpreted by the decoder. Here, we assume a constant bit rate channel with no errors and we describe the rate control algorithm used. It should be noted that the rate control is not perfect, since H.263 speci es the change in QP to -2,2 between macroblock lines. The fuctuations in QP are generally fairly small, even when there are rapid changes in the scene content, such as fast movement or lighting change. With the proposed rate controller, the target bit rate will be achieved on the average but the amount of bits spent for each frame will not be constant, thus a bu er is required. The rate control algorithm is as follows: The rst INTRA picture is coded with QP = 16. After the rst picture the bu er content is set to:
R buffercontent = f R + 3 FR
Bi,1 = B
ning of each new macroblock line. The formula for calculating the new QP is:
For the following pictures the quantizer parameter QP is updated at the begin-
QPnew = QP i,11 + 1 B + 122B R 2B
where 1 B = Bi,1 , B and
mb 2 B = Bi;mb , MB B
QP i,1 The mean quantizer parameter for the previous picture. Bi,1 The number of bits spent for the previous picture. B The target number of bits per picture. mb Present macroblock number. MB Number of macroblocks in a picture. Bi;mb Tthe number of bits spent until now for the picture. B Channel bit rate. FR Frame rate of the source material typically 25 or 30 Hz. ftarget Target frame rate.
f0 Parameter to determine relation between ftarget and QP i,1 Default value = 10.
The rst two terms of this formula are constant for all macroblocks within a picture. The third term adjusts the quantizer parameter during coding of the picture. The calculated QPnew has to be adjusted so that the di erence between the old and the new QP is between -2 and 2, as mentioned earlier. To regulate frame rate, ftarget and a new B are calculated at the beginning of each frame: At the start of the second frame:
Bi,1 = B
At the start of each of the remaining frame:
ftarget = f0 , QP4i,1 ; 4 ftarget 10
An Introduction to Wavelet Theory
3.1 Orthonormal Wavelets
Wavelets are functions which are derived by a mother function using translations and dilations as follows:
1 a;bt = jaj 2 t , b a
Here, we assume that : is a function of one variable. The mother wavelet should satisfy the condition:
xdx = 0
that implies at least some oscillations. More precisely, the condition that the wavelet must satisfy is:
j !j2j!j,1d! 1
where ! is the Fourier transform of t: ! =
If t decays faster than jtj,1, then Equation 3-3 is equivalent to Equation 3-2 . 25
26 The de nition of wavelets as dilations of a function implies that high frequency wavelets correspond to a 1 or narrow width, whereas low frequency wavelets correspond to a 1 or wide width. The basic idea of the wavelet transform is the representation of an arbitrary function f : as a superposition of wavelet functions. Such a superposition analyzes
f : into di erent scale levels, where each level can be further analyzed.
One way of obtaining such a representation is to write f as an integral of a;b with respect to a and b with appropriate weight coe cients:
f t = ca; ba;btdadb
In practice, though, we prefer to express f as a sum and not as an integral. Thus, we do a discretization of the form a = am , b = nb0 am, where m; n 2 Z , and 0 0
a0 1; b0 0 are constants. Then the wavelet analysis becomes: f=
0 m;n t = am ;nb0am t = a,m=2 a,mt , nb0 0 0 0
The f in the notation cm;nf simply means that the coe cients depend on the function f . For a0 = 2; b0 = 1, there exist special forms of such that m;n consists of an orthonormal base. This means that
Z cm;nf = hm;n; f i = m;n xf xdx
where h:; :i denotes the inner product of two functions.
In a multiresolution analysis, we actually deal with two functions: The mother wavelet and the scale function . We also de ne dilated and translated versions of the scale function:
m;nx = 2
2,mx , n
For a given m, the base
are orthonormal. We de ne as Vm the space which the
spans. The spaces Vm are successive approximation spaces: : : : V2
V0 V,1 V,2 : : :, each of them with resolution 2m .
Also, for every m, the base m;n spans a space Wm which is the orthogonal complement of Vm with respect to Vm,1. Thus, the coe cients hm;n; f i represent the information that is lost when we move from an approximation of f with resolution 2m,1 to a coarser approximaton with resolution 2m . Using the above, the following algorithm for the calculation of cm;nf = hm;n ; f i can be proven:
cm;nf = am;nf =
g2n,k am,1;k f h2n,k am,1;k f
Low Pass Filter H Input
High Pass Filter G
Figure 3.1. Equivalent block diagram for equations 3-10 and 3-11
gl = ,1l h,l+1
hn = 21=2
x , n 2xdx
It can be easily seen that equations 3-10 and 3-11 are equivalent to ltering
am,1;k f with a discrete-time lter with impulse response gk and hk , respectively, and downsampling the output with a factor of 2. This is exactly what is done in subband coding.
Figure 3.1 depicts this procedure. We can regard the coe cients am;n f as the projection of the function f onto the space Vm. If the function f is given in a discrete form, then we can assume that the samples of the function correspond to the highest resolution coe cients a0;n . Then,
29 Equations 3-10 and 3-11 correspond to a subband coding algorithm where the lowpass lter is h amd the highpass lter is g. These lters give perfect reconstruction due to their relationship with orthonormal wavelet bases. The reconstruction is as follows:
am,1;l f =
h2n,l am;n f + g2n,l cm;nf
Again, the above equation is equivalent to upsampling with a factor of 2, ltering with the low pass and high pass lters, and adding the results. This is the same procedure used for reconstruction in subband coding. In most orthonormal wavelet bases, has in nite support. This means that does not become zero outside some area. This corresponds to lters h and g with in nite length IIR lters. Daubechies proved in 7 that there exist wavelets with nite support which correspond to lters with nite length FIR lters. Thus, the wavelets in 7 correspond to a subband coding system in which the same lters are used for the analysis and the synthesis. Such lters are known from the theory of subband coding and lter banks. However, in order for those lters to correspond to a wavelet, they should also satisfy the following regularity condition: 0, where
Q1 H 2,k should decay faster than C 1 + j j, ,0:5 when j j ! 1 for some k=1
H = 2,1=2
H is similar to de nition of the Discrete Time Fourier Transform.
Most of the lters that are mentioned in the subband coding bibliography do not satisfy the above condition.
3.2 Biorthogonal Wavelets
Since images are smooth for the most part, except for the edges, we would like the subband coding lters to come from an orthonormal basis with a smooth mother wavelet. We would also like our lters to have small length in order to have small calculation time. However, we very small length lters lead to a not so smooth mother wavelet. Thus, there should be a trade-of between lter length and wavelet smoothness. We would also like our lters to have linear phase, since such lters can be easily cascaded in pyramidal lter structures without the need for phase compensation 8 . Unfortunately, there exist no nontrivial orthonormal FIR lters which have linear phase. The only lters which are orthonormal are associated to orthonormal wavelet bases and have linear phase are those that correspond to the Haar basis:
h0 = h1 = 21=2 g0 = ,g1 = 2,1=2
where hn; gn = 0 for all other n.
It is possible to obtain linear phase lters which would correspond to a symmetric wavelet using biorthogonal wavelets instead of orthonormal. We can then construct examples where the mother can have arbitrarily large regularity. In a biorthogonal system, the analysis is done as in an orthonormal system according to 3-10 and 3-11. However, the reconstruction is done as follows:
am,1;l f =
X~ h2n,l am;n f + g2n,l cm;nf ~
~ where the lters h and g are di erent from h and g. In order to have perfect ~ reconstruction, we impose the following conditions:
gn = ,1nh,n+1 ~ ~ gn = ,1nh,n+1 X ~ hnhn+2k = k
3-19 3-20 3-21
where k is the Kronecker delta:
: 0 otherwise
1 if k = 0
The above equations do not describe anything di erent than a subband coding system where the reconstruction lters are di erent from the analysis lters. If the lters satisfy the supplementary condition that
1 1 Y ~ ,k Y ,k H 2 and H 2
3-23 0, where
decay faster than C 1 , j j, ,0:5 when j j ! 1 for some
X~ ~ H = 2,1=2 hne,jn n ,1=2 X h e,jn H = 2 n
32 then we can give the following interpretation to 3-10, 3-11 and 3-18: We de ne the functions and ~ as x = and ~x = X hn ~2x , n ~
hn 2x , n
The Fourier transforms of these functions are the in nite products of 3-23. Thus, they are well-de ned square-integrable functions. These functions have nite support ~ ~ if the lters h and h are FIR. Additionally, we de ne the functions and as:
gn 2x , n
X ~ x = gn ~2x , n ~
Then, the coe cients am;n and cm;n can be rewritten as:
am;nf = h
,m=2 m;n; f i = 2
cm;nf = hm;n
; f i = 2,m=2
33 Then, the reconstruction is given by the following formula:
~ hm;n; f im;n
If the in nite products in 3-23 decay faster than required by the condition, then ~ and ~ and thus and will be smooth enough. We can see that Equation 3-32 is similar to the one we saw for orthonormal ~ wavelets. The di erence is that the expansion of f with respect to the basis m;n ~ uses coe cients which are calculated using the dual basis m;n, where is di erent from . This interpretation is not possible for all subband coding systems which yield perfect reconstruction. More speci cally, the convergence of the in nite products of 3-23 is only possible when
hn = 21=2 and
X~ hn = 21=2
Also, in order for 3-32 to be valid, we should have:
~ ,1nhn = 0 and ,1nhn = 0
Most lters which are used for subband coding do not sa sfy the above conditions. ~ It can be proven that we wan achieve arbitrarily large regularity in and using ~ lters with su ciently large length. More speci cally, if and are di erentiable ~ ~ k , 1 and k , 1 times, respectively, then, H and H should be divisible with
~ 1 + e,j k and 1 + e,j k , respectively. Thus, the length of the corresponding lters ~ ~ h and h should be larger than k and k, respectively.
~ ~ From 3-23, divisibility of H implies that will have k consecutive moments equal to zero:
~ xl xdx = 0, for l = 0; 1; : : : ; k , 1
~ Using Taylor series expansions, it is possible to show that if has k consecu~ tive moments zero, then the coe cients hm;n; f i represent functions f which are k times di erentiable, with high compression potential, i.e. many coe cients will be negligible. In the application of wavelets to image compression, we are primarily interested ~ in the regularity of m;n, which is related to the number of zero moments of . Thus, ~ we try to choose k to be as large as possible. The equivalent relationship to 3-21 for perfect reconstruction involving H ~ and H is: ~ ~ H H + H + H + = 1 the above equation reduces to: 3-36
~ ~ If we assume that H and H are divisible by 1+ e,j k and 1+ e,j k , respectively,
0 1 l,1 XB l,1+p C ~ C H H = cos=22l B @ A p=0 p sin=22p + sin=22lR
35 ~ where R is an odd polynomial of cos and 2l = k + k, thus the sum of k and ~ ~ k is odd. This is due to the symmetry of h and h.
3.3 Extension to the Two-Dimensional Case
In two-dimensional wavelet analysis, we introduce a scaling function x; y such that: x; y = x y where x is a one-dimensional scaling function. Let x be the one-dimensional wavelet associated with the scaling function x. Then, the three two-dimensional wavelets are de ned as: 3-39
H x; y = xy V x; y = x y D x; y = xy
3-40 3-41 3-42
This is a separable extensions to two-dimensions. In practice, we use the onedimensional structure in Fig. 3.1 twice for each scale of the decomposition: We rst use the structure for each row of the image. This will give us two images which will have one dimension equal to the original dimension of the image and the other dimension cut by half. Then, we use the one-dimensional structure for each column of each of the two images. This will give us 4 images, all of them with both dimensions cut by half.
36 The reconstruction is done in a similar way. Compression is achieved by quantizing and entropy-coding the results of the analysis procedure. Since the images are usually lowpass, most of the energy will be concentrated in the low-frequency subbands.
A Progressive Method of Transmitting the Anchor Frame
As mentioned in the introduction, a progressive method is required for the encoding of the anchor frame in order for us to be able to stop the transmission at any time while the decoder is able to decode the anchor frame using the already transmitted bits. The method we use here for the transmission of the anchor frame is based on the Embedded Zerotree Wavelet algorithm EZW, rst proposed in 9 and 10 . We extend the algorithm for use with color images Y; Cb; CR , as those used in H.263 video coding. Embedded coding is the same concept as binary nite-precision representations of real numbers. All real numbers can be represented by a string of binary digits bits. More precision is added for each bit we add to the right of the binary number. However, the encoding can stop at any time and provide the best available representation of the real number using the speci ed number of bits. Similarly, the Embedded Zerotree coder can stop at any time and the bit stream can be encoded.
4.2 Encoding of the Wavelet Coe cients Using Zerotrees
As mentioned earlier, in the image compression methods based on the wavelet transform as well as other transforms, compression is achieved by quantizing and entropy-coding the transform coe cients. In low bit rate coding, after the quantization, many of the quantized coe cients will be zero. Thus, we will bene t if we are able to nd an e cient way of encoding the locations of non-zero coe cients and their values. In this method, encoding of the locations of the non-zero is done using data structures called zerotrees. A wavelet coe cient x is said to be insigni cant with respect to a given threshold T if jxj
T . The zerotree is based on the hypothesis that if a wavelet coe cient at a coarse scale is insigni cant with respect to a given threshold T , then all wavelet coe cients of the same orientation in the same spatial location at ner scales are likely to be insigni cant with respect to T . In practice, this hypothesis is often true.
In a hierachical subband system, with the exception of the highest frequency subbands, every coe cient at a given scale can be related to a set of coe cients at the next ner scale of similar orientation. The coe cient at the coarse scale is called the parent, and all coe cients corresponding to the same spatial location at the next ner scale of similar orientation are called children. For a given parent, the set of all coe cients at all ner scales of similar orientation corresponding to the same location are called descendants. Similarly, for a given child, the set of coe cients at all coarser scales of similar orientations corresponding to the same spatial location are called ancestors. In the following we will assume a 2-d separable 3-scale wavelet decomposition.
39 With the exception of the lowest frequency subband, all parents have four children. For the lowest frequency subband, the parent-child relationship is de ned such that each parent node has three childen. A scanning of the coe cients is performed in such a way that no child node is scanned before its parent. The scan begins at the lowest frequency subband. Any coe cient within a given subband is scanned before any coe cient in the next subband. Given a threshold T to determine whether or not a coe cient is signi cant, a coe cient x is said to be an element of a zerotree for threshold T if itself and all of its descendants are insigni cant with respect to T . An element of a zerotree for threshold T is a zerotree root if it is not the descendant of a previously found zerotree root for threshold T , i.e., it is not predictably insigni cant from the discovery of a zerotree root at a coarser level at the same threshold. A zerotree root is encoded with a special symbol indicating that the insigni cance of the coe cients at ner scales is completely predictable. The signi cance map can be e ciently represented as a string of symbols from a 3-symbol alphabet which is then entropy-coded. The three symbols used are: 1. Zerotree root, 2. Isolated zero, 3. signi cant Isolated zero means that the coe cient is insigni cant but has some signi cant discendant, i.e., it is not a zerotree root.
40 As we will see in the next section, it is useful to encode the sign of the signi cant coe cients along with the signi gance map. Thus, in practice, four symbols are used: 1. Zerotree root, 2. Isolated zero, 3. Positive signi cant, 4. Negative signi cant This minor addition will be useful for embedding.
4.3 Progressive Transmission
To perform the embedded coding, successive-approximation quantization SAQ is applied. The SAQ sequentially applies a sequence of thresholds T0 ; : : : ; TN ,1 to determine signi cance. The thresholds are chosen such that Ti = Ti,1=2. The initial threshold T0 is chosen so that jxj j 2T0 for all transform coe cients xj . During the encoding and decoding, two separate lists of wavelet coe cients are maintained. At any point in the process, the dominant list contains the coordinates of those coe cients which have not yet been found to be signi cant in the same relative order as the initial scan. This scan is such that the subband are ordered, and, within each subband, the set of coe cients are ordered. Thus, all coe cients in a given subband appear on the initial dominant list prior to coe cients in the next subband. The subordinate list contains the magnitudes of those coe cients that have been found to be signi cant. This list is scanned once for each threshold.
41 During a dominant pass, coe cients with coordinates on the dominant list, i.e., those which have not yet been found to be signi cant, are compared to the threshold
Ti to determine their signi cance and, if found signi cant, their sign. The signi cance map is then encoded using the techique described in the previous section. Each time a coe cient is encoded as signi cant positive or negative, its magnitude is appended to the subordinate list, and the coe cient of the wavelet transform is set to zero so that the signi cant coe cient does not prevent the ocurrence of zerotree on future dominant passes at smaller thresholds.
The dominant pass is followed by a subordinate pass in which all coe cients on the subordinate list are scanned and the speci cations of the magnitudes available to the decoder are re ned to an additional bit of precision. More speci cally, during the subordinate pass, the e ective quantizer step size, which de nes an uncertainty interval for the true magnitude of the coe cient, is cut in half. This re nement can be encoded using a binary alphabet with the symbol 1" indicating that the true value falls in the upper half of the old uncertainty and the symbol 0" indicating the lower half. The string of symbols from this binary alphabet that is generated during the subordinate pass is then entropy coded. After the completion of the subordinate pass, the magnitudes on the subordinate list are sorted in decreasing magnitude, to the extent that the decoder has the information to perform the same sort. The process continues to alternate between dominant passes and subordinate passes, where the threshold is halved before each dominant pass. For the decoding, each decoded symbol, both during a dominant and a subordinate pass, re nes and reduces the width of the uncertainty interval in which the true value of the coe cient, or coe cients, in the case of a zerotree root, may occur. In principle,
42 the reconstruction value used could be anywhere in the uncertainty interval. In this approach, we use the center of the uncertainty interval as the reconstruction value. The encoding stops when some target stopping condition is met, such as when the bit budget is exhausted. The encoding can cease at any time and the resulting bit stream contains all lower rate encodings. Of course, if the bit stream is truncated at an arbitrary point, there may be bits at the end of the code that do not decode to a valid symbol since a codeword has been truncated. In that case, those bits at the end will not actually improve the reconstructed picture. However, terminating the decoding of a bit stream at a speci c point in the bit stream produces exactly the same image that would have resulted had that point been the initial target rate.
4.4 Extension of the Algorithm for Color Images
The straightforward way to extend this algorithm for use with color images would be: We rst take the wavelet transform of each of the luminance and chrominance images Y; CB ; CR independently. Then, we perform a dominant and a subordinate pass for each image component. Thus, one round" of the process would be: Dominant pass for Y , Subordinate pass for Y , Dominant pass for CB , Subordinate pass for CB ,
43 Dominant pass for CR, Subordinate pass for CR . Unfortunately, this method tends to assign too many bits to the chrominance images. Thus, the quality of the Y component is worse in favor of the CB and CR components. The method which was used here is to divide the bits available for a frame to Y ,
CB and CR using a preset ratio. The ratio used here, which proved to work very well is:
If the bit budget available is B bits, we use B=10 bits for each of CB and CR and use the remaining 8B=10 bits for Y .
Chapter V. Approaches of Determining the best stopping point in the Coding of the Anchor Frame
5.1 Formulation of the Problem
As mentioned in the introduction, we want to nd the best point in time in which we should stop encoding the anchor frame. In the following, we introduce the notation which we will use in the remainder of this thesis. We assume that we want to code an image sequence with a given frame rate in real time. The sequence is to be transmitted over a constant bit rate channel. Let r be the number of bits we use for the rst frame and SNR1 the peak signal-to-noise ratio PSNR of the reconstructed rst frame. Let SNR2 be the PSNR of the next reconstructed frame. The peak SNR of a reconstructed picture xi; j for 8-bit images ^ is: 2552 PSNR = 10 log FD2 where
M N XX FD = N 1M xi; j , xi; j 2 ^ i=1 j =1
M and N are the dimensions of the image and xi; j is the original image.
45 Our goal is to maximize SNR2 provided the time delay is acceptable and about the same number of bits is used for the next second encoded frame at all times to have a fair comparison. After the second encoded frame, we use our rate controller to achieve a constant frame rate and constant bit rate. We expect the time distance and sybsequently the correlation between the rest of the frames to be approximately the same. Thus, we expect that good quality of the second encoded frame will carry over to the next frames, at least in the short run. For this reason we chose the maximization of SNR2 as our objective. The solution we propose is to treat SNR1 and SNR2 as time series with index the number of source frames that correspond to the bits alotted to the rst frame. For example, if the frame rate is 30 frames per second and we give half a second for the coding of the rst frame, then, the next frame to be coded will be frame 15. Thus, the PSNR of the rst and next frames will be denoted as SNR1 15 and SNR2 15 . We assume that SNR2 n will have one maximum or, at least, it has one maximum we are interested in, due to our other restrictions, such as acceptable time delay. In the following, we consider di erent solutions to the problem, depending on what data we assume that we have available to determine the best time when we should stop encoding the anchor frame.
5.2 Past Values of SNR1 n and SNR2 n are Known
Here, we assume that we know the values of SNR1 n and SNR2 n up to the corresponding index n the index which corresponds to the point in time we are.
46 Also, as noted previously, we assume that SNR2 n has one maximum. Thus, our problem reduces to nding the maximum of SNR2 n . Experimental results show that SNR2 n is virtually always monotonically increasing up to its maximum. Thus, we can wait until SNR2 n starts to decrease and stop at that point. Clearly, we will not get the maximum SNR2 n , since it will have decreased a little. The index n we get using this method is the optimal" n plus 1. In practice, we do not lose much quality in SNR2 n , since the decline in its value is small. However, we need to calculate the value of SNR2 n for every n in real time. This means that we need to perform the motion compensation, block transform and reconstruction of the second frame for each value of n. In the following, we develop a linear model for predicting the next value of SNR2 n from past values of SNR1 n and SNR2 n .
5.2.1 A linear Model of Predicting the Next Value of SNR2 n
Intuitively, we expect the quality of the second frame to be a function of the quality of the rst frame and the correlation between the rst and second frames. Thus, two of the above three quantities de ne the third, although, in practice, this is not exactly true. In this approach, we use SNR1 n,i ; i = 1; : : : ; N and SNR2 n,i ; i = 1; : : : ; N to predict the value of SNR2 n . To nd the optimal linear prediction, we use the orthogonality principle 13 . This states that the prediction error be orthogonal to the data used for the prediction. To avoid confusion with the notation, assume that we have two time series random processes X n and Y n and we want to predict X n using previous values of X : ^ and Y : . So, the prediction X n will be:
X X ^ X n = aiX n,i + biY n,i
N N i=1 i=1
We need to nd a i and b i ; i = 1; : : : ; N . The prediction error is: ^ en =Xn ,
N X i=1
aiX n,i ,
N X i=1
and, according to the orthogonality principle,
E fe n X n , k g = 0; k = 1; : : : N E fe n Y n , k g = 0; k = 1; : : : N
N X i=1 N X i=1
Assuming that X n and Y n are jointly wide sense stationary, we eventually get
a i RX i , k +
N X i=1 N X i=1
b i RXY i , k = RX k b i RY k , i = RXY k
a i RXY k , i +
for k = 1; : : : ; N , where:
RX l = E fX n + l X n g RY l = E fY n + l Y n g RXY l = E fX n + l Y n g
5-9 5-10 5-11
48 This leads to a 2N 2N linear system of equations with unknowns a : and b : . We solve the system and obtain the prediction coe cients a : and b : . We use this linear prediction to help us predict whether it is advantageous to continue to improve the intra frame, i.e. whether SNR2 will increase. To use the above derivation in our problem, we use SNR1 instead of Y n and SNR2 instead of
5.2.2 Experimental Results
For the exprimental results, we used the Mother and Daughter sequence. We assume that SNR1 and SNR2 are jointly wide sense stationary and ergodic, so that we can estimate the autocorrelation and cross correlation functions using time averages. For the Embedded Zerotree Wavelet method, we use the lters used by Shapiro in 10 which are found in 14 . For the Y component, four decomposition scales were used whereas for the CB and CR components, three decomposition scales were used. All time series are computed using the Y components only. For the H.263 coding, we do not use any of the extra coding option. We assume a constant bit rate channel of 16000 bps and an image sequence with an original frame rate of 30 frames s. It should be pointed out that the rst few samples of SNR1 n and SNR2 n are not accurate, since the bits used for the coding of the anchor frame using the EZW method are not enough for a decent picture, and subsequently, predictive coding of the second frame will not work well.
PSNR of the intra frame starting at frame 0 38 36 34 32 30 28 26 24 22 20 0
30 Frame index
Figure 5.1. SNR1 n
50 Figure 5.1 shows the time series SNR1 starting from frame 0 of the Mother and Daughter sequence. As we would expext, SNR1 n is monotonically increasing, since more bits improve the quality of the anchor frame. Figure 5.2 shows the time series SNR2 starting from frame 0 of the Mother and Daugher sequence. As we can see, SNR2 n has two peaks. The reason for this is that the mother's hand appears in the picture at some point and it later disappears. When the hand appears, the correlation between the anchor frame and the second coded frame drops resulting in a drop in SNR2. When the hand disappears, the second coded frame becomes more similar to the anchor frame, thus the corellation between the two frames becomes high again. Thus, the quality of the second coded frame improves and this results in the second peak in SNR2 n . The second peak is higher than the rst one because the quality of the anchor frame is higher at the location of the second peak. Figure 5.3 shows the actual and predicted SNR2 n when N = 5 is used for the prediction. The solid line is the prediction and the crosses are the actual values. Figure 5.4 shows the actual and predicted SNR2 n when N = 1. In both cases, the prediction follows the actual SNR2 n pretty well. Figure 5.5 shows the time series SNR1 starting from frame 50 of the Mother and Daughter sequence. Again, this is monotonically increasing, as expected. Figure 5.6 shows the time series SNR2 starting from frame 50 of the Mother and Daugher sequence. This looks more like the result we would intuitively expect, with one major peak. However, SNR2 n starts to rise again slowly. This is due to the low motion in this part of the sequence.
PSNR of second frame decoded starting at frame 0 30.5
29 PSNR 28.5 28 27.5 27 0
30 Frame index
Figure 5.2. SNR2 n
Predicted versus actual PSNR using N=5 and starting at frame 0 35
20 PSNR 15 10 5 0 0
30 Frame index
Figure 5.3. Actual and predicted SNR2 n for N=5
Predicted versus actual PSNR using N=1 and starting at frame 0 35
20 PSNR 15 10 5 0 0
30 Frame Index
Figure 5.4. Actual and predicted SNR2 n for N=1
PSNR of the intra frame starting at frame 50 38 36 34 32 30 28 26 24 22 20 0
30 Frame Index
Figure 5.5. SNR1 n
PSNR of the second frame starting at frame 50 30.5
30 Frame Index
Figure 5.6. SNR2 n
56 Figure 5.7 shows the actual and predicted SNR2 n when N = 5 is used for the prediction. The solid line is the prediction and the crosses are the actual values. Figure 5.8 shows the actual and predicted SNR2 n when N = 1. Again, the prediction follows the actual values pretty well.
5.3 SNR2 n is not Known
Let us suppose that we cannot calculate SNR2 n in real time, but have SNR1 n . Evaluation of SNR1 n requires far less computation, thus it is more realistic to assume that we have SNR1 n but we do not have SNR2 n . However, it is clear that we need a second piece of data in order to estimate the index n which results in the maximum SNR2 n . As mentioned earlier, we expect SNR2 n to be a function of SNR1 n and the correlation between the rst frame and frame n. A quantitative measure for the correlation can be the frame di erence FD or better, the displaced frame di erence DFD. The Frame Di erence between images xi; j and yi; j is de ned as:
M N XX xi; j , yi; j 2 FD = N 1M i=1 j =1
The Displaced Frame Di erence between images xi; j and yi; j is de ned as:
M N XX DFD = N 1M xi; j , yi; j 2 ~ i=1 j =1
where xi; j is the displaced original image, after motion compensation is per~ formed.
Predicted versus actual PSNR using N=5 and starting at frame 50 35
20 PSNR 15 10 5 0 0
30 Frame Index
Figure 5.7. Actual and predicted SNR2 n for N=5
Predicted versus actual PSNR using N=1 and starting at frame 50 35
20 PSNR 15 10 5 0 0
30 Frame Index
Figure 5.8. Actual and predicted SNR2 n for N=1
59 In order for FD and DFD to be in the same units dB as SNR1 n and SNR2 n , we use the following quantities:
2 FDdB = 10 log 255 FD
and 2552 DFDdB = 10 log DFD 5-15
We can assume that we have a system with SNR1 n and the correlation as inputs and SNR2 n as output. This system is likely to be nonlinear but, intuitively, we expect it to be memoryless. Thus, the problem reduces to identifying this nonlinear and memoryless system. Assuming that we use the Displaced Frame Di erence DFD as a measure of correlation, this is equivalent to estimating the function:
SNR2 n = f SNR1 n ; DFD n
The approach we follow is to nd some data points of this function using experimental data and then use interpolation to estimate the value of SNR2 n given SNR1 n and DFD n . The interpolation method we use is the Biharmonic Spline Interpolation 15 . This method allows us to have SNR1 n and DFD n values at random" points rather than regularly spaced points. In the following, we describe the Biharmonic Spline Interpolation in one and two or more dimensions.
5.3.1 Biharmonic Spline Interpolation in One Dimension
We wish to nd a biharmonic function which passes through N data points. Draftsmen in the 19th century solved the problem by attaching weights to an elastic beam or spline and positioning the weights so that the spline passed through the data points. The forces imposed on the spline by each weight kept it bent. For small displacements, the spline has zero fourth derivative except at the weights See gure 5.3.1. The point force Green function for the spline satis es the biharmonic equation:
d4 = 6 x dx4
The particular solution to 5-17 is x = jxj3 the problem is:
N d4w = X 6 dx4 j =1 wxi = wi
When the Green function is used to interpolate N data points, wi, located at xi ,
x , xj
The particular solution to 5-20 is a linear combination of point force Green functions centered at each data point. The homogeneous solution is not used:
w4 a1 w2
Figure 5.9. The biharmonic function wx is found by applying point forces to a
thin elastic beam or spline
w x =
The strength of each point force, equations:
N X j =1 j,
j jx , xj j
is found by solving the linear system of
N X j =1
j jxi , xj j
are determined, the biharmonic function wx can be evaluated at
any point using 5-21.
5.3.2 Biharmonic Spline Interpolation in 2 or More Dimensions
The derivation of the technique in 2 or more dimensions is similar to the derivation in one dimension. For N data points in m dimensions, the problem becomes:
N X j =1 wi
x , xj
where r4 is the biharmonic operator and x is a position in the m-dimensional space. The general solution is:
Again, we nd the
N X j =1
j m x , xj
by solving the linear system of equations:
63 Number of Dimensions m Green Function 1 2 3 4 5 6 m
jxj3 jxj2ln jxj , 1 jxj ln jxj jxj,1 jxj,2 jxj4,m Table 5.1. Biharmonic Green Functions
N X j =1 j m xi , xj
are determined, the biharmonic function wx can be evaluated at
any point using 5-25. The biharmonic Green functions, function is:
2 x = jxj 2 ln jxj , 1
for each dimension, are given in Table 5.1.
As seen from the table, in two dimensions, as we are interested here, the Green
5.3.3 Experimental Results
In the following example, we estimate the values of SNR2 n using actual values of SNR1 n and DFDdB , using the Mother and Daughter sequence. It should be
64 emphasized that we are interested only in the local maximum of SNR2 n and not its actual values. In practice, sometimes the DC level" of the estimated SNR2 n is di erent from the original, but in most cases, the index n which corresponds to the local maximum is estimated with very good accuracy. Also, as noted earlier, the rst few samples of SNR1 n and SNR2 n are inaccurate. Thus, in the following simulation, the rst ve samples of all sequences have been removed. Figure 5.10 shows the actual values of SNR2 n and Figure 5.11 shows the estimated values of SNR2 n . The SNR2 n is computed from the Mother and Daughter sequence starting at frame 0 the rst ve values are not used, as mentioned earlier. The values used for the lookup table are from di erent parts of the same sequence. As we can see, the actual maximum is at index 8, whereas the maximum of the estimated SNR2 n is at index 5. Figures 5.12 and 5.13 show actual and predicted SNR2 n when we start from frame 50 of the Mother and Daughter sequence. In this case, the actual maximum is at index 13 and the estimated maximum is at index 9. Figures 5.14 and 5.15 show actual and predicted SNR2 n when we start from frame 100 of the Mother and Daughter sequence. This is a case where we have very little motion and the trend in SNR2 n is increasing with the exception of less pronounced local maxima. The rst local maximum occurs at index 21 in the actual SNR2 n and at index 22 in the estimated SNR2 n Clearly, from these examples we see that the rst maximum of SNR2 n can be estimated very accurately in most cases using the proposed method.
Actual PSNR 30.5
30 Frame Index
Figure 5.10. Actual values of SNR2 n
Estimated PSNR 31
29.5 PSNR 29 28.5 28 27.5 0
30 Frame Index
Figure 5.11. Estimated values of SNR2 n
Actual PSNR 30.5
29 PSNR 28.5 28 27.5 27 0
30 Frame Index
Figure 5.12. Actual values of SNR2 n
Estimated PSNR 29.8 29.6 29.4 29.2 29 PSNR 28.8 28.6 28.4 28.2 28 27.8 0
30 Frame Index
Figure 5.13. Estimated values of SNR2 n
Actual PSNR 32 31.5 31 30.5 30 PSNR 29.5 29 28.5 28 27.5 27 0
30 Frame Index
Figure 5.14. Actual values of SNR2 n
Estimated PSNR 33
30 PSNR 29 28 27 26 0
30 Frame Index
Figure 5.15. Estimated values of SNR2 n
The motivation for this work was the intuitive expectation that there should be an optimal" bit allocation for the anchor frame which would give us the best performance. This bit allocation would depend on the image sequence and should be determined on-line, thus a progressive method for the transmission of the anchor frame should be used. We used the maximization of SNR2 as our objective since we want to achieve a constant frame rate after the second frame and it is reasonable to assume that the correlation' between any two subsequent encoded frames will be approximately the same. This will be true unless there is dramatic change in the picture, but something like that cannot be predicted. Thus, better quality of the second encoded frame will usually mean better quality of the rest of the frames, at least in the short run or until there is a dramatic change in the scene. Experimental results showed that except in cases where there is very little motion, there is a well de ned rst maximum in SNR2. Thus, it is clear that if we allocate the number of bits for the anchor frame that is speci ed by the peak in SNR2, we can expect better overall quality of the image sequence, according to the previous discussion. If computational complexity is no object, it is easy to nd the rst maximum in SNR2. From experimental results, we know that SNR2 n is virtually always 71
72 monotonically increasing up to the rst maximum. Thus, if we can calculate SNR2 n on-line, we can locate the maximum by pinpointing the location where SNR2 n starts to decrease. However, the computational complexity in this case is very high since we need to encode and reconstruct the second frame every 1 30th of a second in the case of source frame rate of 30 frames per second. Future work in this area can involve development of a more sophisticated algorithm for the location of the maximum which will be able to identify a very small local maximum where SNR2 declines a little and then it immediately starts to increase again up to a well de ned maximum. The new algorithm should be able to disregard such a small local maximum. If we assume that we cannot calculate SNR2 n on-line, we can assume that SNR2 is a function of SNR1 and the correlation between the two frames, as measured by the energy of the Displaced Frame Di erence DFD. In this work, we estimate this function using interpolation from experimental data. The computational complexity in this case is much lower, since we do have to perform motion compensation to nd the DFD, but we do not have to perform all the other steps that are required in order to nd the reconstructed second frame and calculate its PSNR. The experimental results show that, using the above method, we can satisfactorily estimate the maximum in SNR2. In most cases, the estimated location of the maximum is very close to the actual location. Thus, in this thesis, we found that it is worthwhile to send the anchor frame progressively and employ methods for determining the best stopping point. The methods described here proved to work satisfactorily.
1 A. R. Reibman and B. G. Haskell, Constraints on variable bit-rate video for atm networks," IEEE Transactions on Circuits and Systems for Video Technology, vol. 2, pp. 361 372, Dec. 1992. 2 Video coding for low bitrate communications," tech. rep., Draft ITU-T Recommendation H.263, May 1996. 3 W. Ding and B. Liu, Rate control of mpeg video coding and recording by ratequantization modeling," IEEE Transactions on Circuits and Systems for Video
Technology, vol. 6, pp. 12 20, Feb. 1996.
4 D. Lauzon, A. Vincent, and L. Wang, Performance evaluation of mpeg-2 video coding for hdtv," IEEE Transactions on Broadcasting, vol. 42, pp. 88 94, June 1996. 5 K. N. Ngan, D. Chai, and A. Millin, Very low bit rate video coding using h.263 coder," IEEE Transactions on Circuits and Systems for Video Technology, vol. 6, pp. 308 312, June 1996. 6 Video codec test model, tmn5," tech. rep., Telenor Research, Jan. 1995. 7 I. Daubechies, Orthonormal bases of compactly supported wavelets," Communications on Pure and Applied Mathematics, vol. 41, pp. 909 996, 1988.
74 8 M. Antonini, M. Barlaud, P. Mathieu, and I. Daubechies, Image coding using wavelet transform," IEEE Transactions on Image Processing, vol. 1, pp. 205 220, Apr. 1992. 9 J. M. Shapiro, An embedded wavelet hierarchical image coder," in Proc. IEEE
Int. Conf. Acoust., Speech, Signal Processing, San Francisco, CA, Mar. 1992.
10 J. M. Shapiro, Embedded image coding using zerotrees of wavelet coe cients,"
IEEE Transactions on Signal Processing, vol. 41, pp. 3445 3462, Dec. 1993.
11 G. E. Box, G. M. Jenkins, and G. C. Reinsel, Time Series Analysis: Forecasting
and Control. Prentice-Hall, 1994.
12 D. Graupe, Time Series Analysis: Identi cation and Adaptive Filtering. Robert E. Krieger Publishing Co., 1989. 13 A. Papoulis, Probability, Random Variables, and Stochastic Processes. McGraw Hill, 1991. 14 E. H. Adelson and E. Simoncelli, Orthogonal pyramid transforms for image coding," in SPIE Visual Communications and Image Processing II, 1987. 15 D. T. Sandwell, Biharmonic spline interpolation of geos-3 and seasat altimeter data," Geophysical Research Letters, no. 2, pp. 139 142, 1987.