We are IntechOpen, the world’s leading publisher of Open Access books
Built by scientists, for scientists

4,300
Open access books available

116,000
International authors and editors

125M
Downloads

154
Countries delivered to

TOP 1%
Our authors are among the most cited scientists

12.2%
Contributors from top 500 universities

WEB OF SCIENCE™
Selection of our books indexed in the Book Citation Index
in Web of Science™ Core Collection (BKCI)

Interested in publishing with us?
Contact book.department@intechopen.com

Numbers displayed above are based on latest data collected.
For more information visit www.intechopen.com
Chapter 8

Fast Algorithm Designs of Multiple-Mode Discrete Integer Transforms with Cost-Effective and Hardware-Sharing Architectures for Multistandard Video Coding Applications

Chih-Peng Fan

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/64985

Abstract

In this chapter, first we give a brief view of transform-based video coding. Second, the basic matrix decomposition scheme for fast algorithm and hardware-sharing-based integer transform design are described. Finally, two case studies for fast algorithm and hardware-sharing-based architecture designs of discrete integer transforms are presented, where one is for the single-standard multiple-mode video transform-coding application, and the other is for the multiple-standard multiple-mode video transform-coding application.

Keywords: video coding, transform coding, fast algorithm, matrix factorization, hardware sharing, multiple modes, multiple standards

1. Introduction

Video-coding system has generally utilized block-based transform-coding skills to shrink the data rates by joining quantization and entropy coding. Among some block-based transforms, the discrete cosine transform (DCT) [1] and integer transforms have extensively been used to still image and video-coding specifications, such as JPEG [2], MPEG-1/2 [3, 4], MPEG-4 [5], H. 264/AVC [6, 7], AVS [8, 9], VC-1 [10], VP8 [11], and HEVC [12]. Because integer transforms perform the low complexity and effective coding performance, the advanced video coding (AVC) in ITU-T H.264 [6, 7, 13, 14], which is also known as MPEG-4 part 10, applies integer
transforms for transform process. The 4 × 4 and 8 × 8 transforms in [13, 14] were calculated exactly to prevent non-adaptation issues of inverse transforms for high-quality moving visual images. The VC-1 specification [10, 15, 16] employed 4 × 4 and 8 × 8 integer transforms, and it was developed by Microsoft Corporation and standardized by the Society of Motion Picture and Television Engineers (SMPTE). The 8 × 8 integer transform is utilized to obtain the high-coding performance in the Audio Video Coding Standard (AVS) for China [8, 9]. In [11], the VP8 video-coding standard was developed for Internet browser applications. The Joint Collaborative Team on Video Coding proposed the high-efficiency video coding (HEVC) specification [12]. By HEVC, the compression efficiency was greatly better than that achieved using the H.264/AVC high-profile-coding specification.

To support the single-standard H.264/AVC video coding, several transform architectures in [17–24] have been developed to approach the multiple transform modes in H.264. To support the single-standard H.265/HEVC video coding, several transform architectures in [25–32] have been developed to approach the multiple transform modes in HEVC. Besides, supporting multiple-standard functions in video coding has been an important issue in multimedia applications recently, such as H.264/AVC, MPEG-1/2/4, VC-1, AVS, and VP8 standards, and several transform architectures in [33–41] have also been developed to complete the multiple transform functions. Owing to the growth of multistandard video-coding applications, how to achieve low-computational complexities and implement by hardware-sharing-based cost-effective architectures simultaneously are interesting research topics for the VLSI design of video codecs.

2. Matrix decomposition preprocessing for fast algorithm and hardware-sharing-based designs

Based on the resemblance property, the 8 × 8 inverse integer transforms [41] in H.264/AVC, AVS, VC-1, VP8, MPEG-1/2/4, and HEVC specifications are revealed in Eq. (1), and Table 1 depicts the coefficient values in the transforms.

\[
C_{8 \times 8} = \begin{bmatrix}
    a & b & f & c & a & d & g & e \\
    a & c & g & -e & -a & -b & -f & -d \\
    a & d & -g & -b & -a & e & f & c \\
    a & e & -f & -d & a & c & -g & -b \\
    a & -e & -f & d & a & -c & -g & b \\
    a & -d & -g & b & -a & -e & f & -c \\
    a & -c & g & e & -a & b & -f & d \\
    a & -b & f & -c & a & -d & g & -e
\end{bmatrix}
\]
<table>
<thead>
<tr>
<th>Transform sizes</th>
<th>VC-1</th>
<th>AVS</th>
<th>VP8</th>
<th>MPEG-1/2/4</th>
<th>H.264/AVC</th>
<th>HEVC</th>
</tr>
</thead>
<tbody>
<tr>
<td>4 × 4</td>
<td>√</td>
<td>√</td>
<td>√</td>
<td>N/A</td>
<td>√</td>
<td>√</td>
</tr>
<tr>
<td>8 × 8</td>
<td>√</td>
<td>√</td>
<td>N/A</td>
<td>√</td>
<td>√</td>
<td>√</td>
</tr>
<tr>
<td>16 × 16</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>√</td>
</tr>
<tr>
<td>32 × 32</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
</tr>
</tbody>
</table>

Table 1. The transform modes in several video-coding standards [41].

In Eq. (1), it is decomposed by Eq. (2) as

$$C_{8 \times 8} = P_1 \cdot A_0 \cdot P_r.$$  \hspace{1cm} (2)

In Eq. (2), $A_0$ is divided into two modules, $U_{4 \times 4}$ and $D_{4 \times 4}$, where

$$P_1 = \begin{bmatrix}
1 & 0 & 0 & 0 & 0 & 0 & 0 & -1 \\
0 & 1 & 0 & 0 & 0 & -1 & 0 & 0 \\
0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 \\
0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\
\end{bmatrix},$$

$$A_0 = \begin{bmatrix}
a & f & a & g & 0 & 0 & 0 & 0 \\
a & g & -a & -f & 0 & 0 & 0 & 0 \\
a & -g & -a & f & 0 & 0 & 0 & 0 \\
a & -f & a & -g & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & -e & d & -c & b \\
0 & 0 & 0 & 0 & -d & b & -e & -c \\
0 & 0 & 0 & 0 & -c & e & b & d \\
0 & 0 & 0 & 0 & -b & -c & -d & -e \\
\end{bmatrix},$$

Thus

$$A_0 = U_{4 \times 4} \oplus D_{4 \times 4}.$$  \hspace{1cm} (3)

and $C_{8 \times 8}$ becomes

$$C_{8 \times 8} = P_1 \cdot (U_{4 \times 4} \oplus D_{4 \times 4}) \cdot P_r.$$  \hspace{1cm} (4)

In (3), “$\oplus$” is the direct sum operator, and the two diagonal blocks $U_{4 \times 4}$ and $D_{4 \times 4}$ are processing in parallel. To cut down the computational operations and achieve effective hardware shares, the upper diagonal matrix $U_{4 \times 4}$ and the down diagonal matrix $D_{4 \times 4}$ are further decomposed into the cascaded multiplication form or the addition form of sparse matrices. After matrix
factorizations, the chosen sparse matrices have the coefficients which are 1, −1, 0, or an integer, and an integer value can equal the combination of powers of two. Besides, zero factors in the chosen sparse matrices could be factorized as many as possible [42].

By Eq. (1), for VC-1 the values of the coefficient set \{a, b, c, d, e, f, g\} are \{12, 16, 15, 9, 4, 16, 6\}, and those for AVS are \{8, 10, 9, 6, 2, 10, 4\}. Next, those for MPEG-1/2/4 are \{362, 502, 426, 284, 100, 473, 196\}, and those for H.264/AVC are \{8, 12, 10, 6, 3, 8, 4\}. Finally, those for HEVC are \{64, 89, 75, 50, 18, 83, 36\}.

The general 4 × 4 inverse integer transform matrices [41] can be presented in Eq. (5) as

\[
M_{4×4} = \begin{bmatrix}
  h & i & h & j \\
  h & j & -h & -i \\
  h & -j & h & i \\
  h & -i & h & -j \\
\end{bmatrix}.
\]

By Eq. (5), for VC-1 the values of the coefficient set \{h, i, j\} are \{17, 22, 10\}, and those for VP8 are \{128, 167, 70\}. Next, those for AVS-M are \{2, 3, 1\}, and those for H.264/AVC are \{1, 1, 0.5\}. Finally, those for HEVC are \{64, 83, 36\}.

3. Case study [32]: single-standard multiple-mode transform design

3.1. Hardware-sharing based 32 × 32 integer core transform for HEVC

The one-dimensional (1D) 32 × 32 inverse core transform for HEVC is described in [30]. By the symmetrical property, the 32 × 32 inverse core transform is presented as

\[
H_{32} = P_{A} \cdot C_{A1},
\]

where

\[
C_{A1} = \begin{bmatrix}
  C_{11} & C_{12} \\
  C_{21} & C_{22} \\
\end{bmatrix},
\]

\[
P_{A} = \begin{bmatrix}
  I_{16} & -I_{16} \\
  I_{16} & I_{16} \\
\end{bmatrix},
\]

\[
I_{16} = \begin{bmatrix}
  0 & 0 & \cdots & 0 & 1 \\
  0 & 0 & 0 & 1 & 0 \\
  \vdots & \vdots & \ddots & \vdots & \vdots \\
  0 & 1 & 0 & \cdots & 0 \\
  1 & 0 & \cdots & 0 & 0 \\
\end{bmatrix}
\]

identity matrix. In Eq. (6), \(P_{A}\) is the butterfly-like postprocessing, and \(C_{A1}\) is the sparse matrix. By swapping each column of \(C_{A1}\), it becomes

\[
C_{A1} = C_{A2} \cdot P_{A}.
\]

By Eqs. (6) and (7), \(H_{32}\) becomes
\[ H_{32} = P_t \cdot C_{42} \cdot P_{Ar}, \]  
(8)

where \( P_{Ar} \) is the permutation matrix. In Eq. (7), \( C_{42} \) is expressed by

\[ C_{42} = \begin{bmatrix} T_{411} & 0_{16 \times 16} \\ 0_{16 \times 16} & T_{422} \end{bmatrix} = T_{411} \oplus T_{422}, \]  
(9)

where \( \oplus \) means the direct sum operation, and then \( T_{411} \) and \( T_{422} \) are 16 \( \times \) 16 matrices, which are revealed in [32]. The matrix \( P_{Ar} \) in Eq. (8) is expressed as

\[ P_{Ar} = P(2,16), \]  
(10)

where the permutation matrix \( P(m, n) \) is defined in [43], and the notation \( \otimes \) means the Kronecker product. In Eq. (9), \( A_{a22} \) is presented as

\[ T_{a22} = T_{M1} + T_{N1}, \]  
(11)

First, the lower half of \( C_{N1} \) is divided into sixteen 8 \( \times \) 1 column vectors \( X_i \) where \( i = 0, 1, 2, \ldots, 15 \), and then \( T_{N1} \) becomes

\[ T_{N1} = \begin{bmatrix} 0_{8 \times 16} \\ X_0 & X_1 & \cdots & X_{15} \end{bmatrix}. \]  
(12)

Second, the coefficients in a single column vector can be shared. The vector coefficient computations are achieved by integrating several base coefficients [32]. After realizing the column vectors of \( T_{N1} \), the lower half of \( T_{N1} \) is factorized as an integration of eight 1 \( \times \) 16 row vectors depicted as \( Y_i \) where \( i = 8, 9, \ldots, 15 \), and \( T_{N1} \) becomes

\[ T_{N1} = \begin{bmatrix} 0_{8 \times 16} \\ Y_4 \\ Y_9 \\ \vdots \\ Y_{15} \end{bmatrix}. \]  
(13)
Adder tree structures are utilized to calculate the aggregate results for the row vectors $Y_0$–$Y_{15}$ [32]. By the duplicate operations for $T_{N1}$, $T_{M1}$ is presented as

$$T_{M1} = \begin{bmatrix} \hat{X}_0 & \cdots & \hat{X}_{15} \\ \vdots & \ddots & \vdots \\ 0_{8\times16} & \cdots & 0_{8\times16} \end{bmatrix},$$

(14)

where $\hat{X}_i$ is an $8 \times 1$ column vector, where $i = 0, 1, 2, \ldots, 15$. Then, $T_{M1}$ becomes

$$T_{M1} = \begin{bmatrix} Y_0 \\ \vdots \\ Y_{16} \\ 0_{8\times16} \end{bmatrix},$$

(15)

where $Y_i$ is a $16 \times 1$ row vector, where $i = 0, 1, \ldots, 7$. The realization of $T_{M1}$ equals that of $T_{N1}$. Finally, the operations of $T_{M1}$ and $T_{N1}$ are merged to $T_{A1}$. The computational operations $T_{A2}$ require 630 additions and 326 shift operations [32]. The matrix $T_{A1}$ in Eq. (9), which is also denoted as $H_{16}$, is the 1D $16 \times 16$ inverse core transform in HEVC [30].

3.2. Hardware-sharing-based 16 × 16 integer core transform for HEVC

The $16 \times 16$ integer core transform in [30] changes into

$$H_{i6} = P_B \cdot C_{B1},$$

(16)

where $P_B = \begin{bmatrix} I_{8\times8} & -I_{8\times8} \\ I_{8\times8} & I_{8\times8} \end{bmatrix}$ and $C_{B1}$ is revealed in [32]. By swapping each column of $C_{B1}$, it will be

$$C_{B1} = C_{B2} \cdot P_{B2},$$

(17)

where $P_{B2} = P(8,2)$. By Eqs. (16) and (17), $H_{i6}$ is expressed by

$$H_{i6} = T_{A1} = P_B \cdot C_{A1} \cdot P_{B2}.$$

(18)
In Eq. (18), \( C_{B_2} \) is presented as

\[
C_{B_2} = \begin{bmatrix}
T_{B_1} & 0_{8,8} \\
0_{8,8} & T_{B_2}
\end{bmatrix} = T_{B_1} \otimes T_{B_2},
\]

(19)

and \( T_{B_2} \) becomes

\[
T_{B_2} = T_{M_2} + T_{N_2},
\]

(20)

where

\[
T_{M_2} = \begin{bmatrix}
-9 & 25 & -43 & 57 & -70 & 80 & -87 & 90 \\
-25 & 70 & -90 & 80 & -43 & -9 & 57 & -87 \\
-43 & 90 & -57 & -25 & 87 & -70 & -9 & 80 \\
-57 & 80 & 25 & -90 & 9 & 87 & -43 & -70 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0
\end{bmatrix}
\]

and

\[
T_{N_2} = \begin{bmatrix}
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
-70 & 43 & 87 & -9 & -90 & -25 & 80 & 57 \\
-80 & -9 & 70 & 87 & 25 & -57 & -90 & -43 \\
-87 & -57 & -9 & 43 & 80 & 90 & 70 & 25 \\
\end{bmatrix}
\]

By the duplicate processed of \( T_{N_1} \) in Section 3.1, \( T_{N_2} \) turns into

\[
T_{N_2} = \begin{bmatrix}
0_{4,8} \\
U_0 \ldots U_7
\end{bmatrix},
\]

(21)

where \( U_i \) is an \( 8 \times 1 \) column vector, where \( i = 0, 1, 2, \ldots, \) and 7. Next, \( T_{N_2} \) also is

\[
T_{N_2} = \begin{bmatrix}
0_{4,8} \\
V_4 \\
\vdots \\
V_7
\end{bmatrix},
\]

(22)
where \( V_i \) is a 1 \( \times \) 8 row vector, where \( i = 4, 5, 6, \) and 7. Adder tree schemes are applied to compute the summed outcomes of \( V_i - V_j \) \[32\]. By the same processes of \( T_{M1} \) in Section 3.1, \( T_{M2} \) becomes

\[
T_{M2} = \begin{bmatrix}
\hat{U}_0 & \ldots & \hat{U}_7 \\
0_{4 \times 8}
\end{bmatrix}.
\]

(23)

where \( \hat{U}_i \) is a 4 \( \times \) 1 column vector, where \( i = 0, 1, 2, \ldots, \) and 7. Next, \( T_{M2} \) also is

\[
T_{M2} = \begin{bmatrix}
V_0 \\
\vdots \\
V_7 \\
0_{4 \times 8}
\end{bmatrix}.
\]

(24)

where \( V_i \) is a 1 \( \times \) 8 row vector, where \( i = 0, 1, 2, \) and 3. Then, adder trees are used to treat the row vectors \( V_0 - V_3 \) \[32\]. Finally, the calculations of \( T_{M2} \) and \( T_{N2} \) are merged to \( T_{B2} \). The computational operations of \( T_{B2} \) are 164 additions and 106 shift operations \[32\]. Meantime, the \( T_{B1} \) in Eq. (19), which is also denoted as \( H_{v} \), is the 1D 8 \( \times \) 8 inverse core transform in HEVC \[30\].

### 3.3. Hardware-sharing-based 8 \( \times \) 8 integer core transform for HEVC

The 8 \( \times \) 8 integer transform in \[30\] is described as

\[
H_{x8} = P_C \cdot C_{C1}.
\]

(25)

where \( P_C = \begin{bmatrix} I_{4 \times 4} & -I_{4 \times 4} \end{bmatrix} \) and \( C_{C1} = \begin{bmatrix} 64 & 0 & 83 & 0 & 64 & 0 & 36 & 0 \\
64 & 0 & 36 & 0 & -64 & 0 & -83 & 0 \\
64 & 0 & -36 & 0 & -64 & 0 & 83 & 0 \\
64 & 0 & -83 & 0 & 64 & 0 & -36 & 0 \\
0 & -18 & 0 & 50 & 0 & -75 & 0 & 89 \\
0 & -50 & 0 & 89 & 0 & -18 & 0 & -75 \\
0 & -75 & 0 & 18 & 0 & 89 & 0 & 50 \\
0 & -89 & 0 & -75 & 0 & -50 & 0 & -18 \end{bmatrix} \)

After swapping each column in \( C_{C1} \), it changes into

\[
C_{CS} = C_{C2} \cdot P_C,
\]

(26)
Based on Eqs. (25) and (26), $H_{i8}$ is presented by

$$H_{i8} = T_{811} = P_r \cdot C_{C2} \cdot P_{r^*},$$  (27)

In Eq. (27), $C_{C2}$ becomes

$$C_{C2} = \begin{bmatrix} T_{C11} & 0_{4 \times 4} \\ 0_{4 \times 4} & T_{C22} \end{bmatrix} = T_{C11} \otimes T_{C22},$$  (28)

where $T_{C11} = \begin{bmatrix} 64 & 83 & 64 & 36 \\ 64 & 36 & -64 & -83 \\ 64 & -36 & -64 & 83 \\ 64 & -83 & 64 & -36 \end{bmatrix}$ and $T_{C22} = \begin{bmatrix} -18 & 50 & -75 & 89 \\ -50 & 89 & -18 & -75 \\ -75 & 18 & 89 & 50 \\ -89 & -75 & -50 & -18 \end{bmatrix}$.

In Eq. (28), $T_{C22}$ is factorized as

$$T_{C22} = S_1 + S_2,$$  (29)

where $S_1 = \begin{bmatrix} -18 & 0 & 0 & 89 \\ 0 & 89 & -18 & 0 \\ 0 & 18 & 89 & 0 \\ -89 & 0 & 0 & -18 \end{bmatrix}$. Moreover, $S_1$ is expressed by

$$S_1 = Z_1 + (18 \cdot Z_2),$$  (30)

where $Z_1 = \begin{bmatrix} 0 & 0 & 0 & -1 \\ 0 & -1 & 0 & 0 \\ 0 & 0 & -1 & 0 \\ 1 & 0 & 0 & 0 \end{bmatrix}$ and $Z_2 = \begin{bmatrix} -1 & 0 & 0 & 5 \\ 0 & 5 & -1 & 0 \\ 0 & 1 & 5 & 0 \\ -5 & 0 & 0 & -1 \end{bmatrix}$. In Eq. (29), $S_2$ is presented as

$$S_2 = 25 \cdot Z_3,$$  (31)
where \( Z_3 = \begin{bmatrix} 0 & 2 & -3 & 0 \\ -2 & 0 & 0 & -3 \\ -3 & 0 & 0 & 2 \\ 0 & -3 & -2 & 0 \end{bmatrix} \). By Eqs. (29)–(31), \( T_{c22} \) becomes

\[
T_{c22} = Z_1 + (18 \cdot Z_2) + (25 \cdot Z_3).
\] (32)

In Eq. (32), the computations of \( T_{c22} \) require 36 additions and 28 shift operations [32]. The matrix \( T_{c11} \) in Eq. (28) is also the 1D 4 \( \times \) 4 inverse core transform matrix in HEVC.

### 3.4. Hardware-sharing-based 4 \( \times \) 4 integer core transform for HEVC

The 4 \( \times \) 4 integer core transform matrix is indicated as

\[
H_4 = P_D \cdot C_{D1}, \quad (33)
\]

where 

\[
P_D = \begin{bmatrix} 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 \\ 0 & 1 & 0 & -1 \\ 1 & 0 & -1 & 0 \end{bmatrix}
\]

and

\[
C_{D1} = \begin{bmatrix} 64 & 0 & 64 & 0 \\ 64 & 0 & -64 & 0 \\ 0 & -36 & 0 & 83 \\ 0 & -83 & 0 & -36 \end{bmatrix}.
\]

By swapping each column of \( C_{D1} \), it changes into

\[
C_{D1} = C_{D2} \cdot P_{D2}. \quad (34)
\]

where 

\[
P_{D2} = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}.
\]

From Eqs. (33) and (34), \( H_4 \) is described by

\[
H_4 = T_{c11} = P_D \cdot C_{D2} \cdot P_{D2}. \quad (35)
\]

In Eq. (34), \( C_{D2} \) is rewritten as

\[
C_{D2} = T_{D11} \oplus T_{D22}. \quad (36)
\]

In Eq. (36), \( T_{D11} \) becomes

\[
T_{D11} = 64 \cdot Z_4. \quad (37)
\]
where \( Z_4 = \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix} \). In Eq. (36), \( T_{r22} \) is indicated by \( Z_5 \) and \( Z_6 \) as

\[
T_{r22} = 36 \cdot Z_5 + 11 \cdot Z_6, \tag{38}
\]

where \( Z_5 = \begin{bmatrix} 2 & 1 \\ 1 & -2 \end{bmatrix} \) and \( Z_6 = \begin{bmatrix} 1 & 0 \\ 0 & -1 \end{bmatrix} \). Thus, the computations of \( T_{r22} \) are 10 additions and 10 shift operations [32]. Based on Eqs. (35)–(38), \( H_{4} \) is changed into

\[
H_{i4} = P_{5} \cdot [(64 \cdot Z_{i}) \oplus (36 \cdot Z_{i} + 11 \cdot Z_{i})] \cdot P_{6}. \tag{39}
\]

By the abovementioned discussions, the hardware modules of \( 4 \times 4, 8 \times 8, \) and \( 16 \times 16 \) inverse core transforms are shared to implement \( H_{4}, H_{8}, \) and \( H_{16} \) respectively [32]. By sharing the hardware of \( H_{4} \) in Eq. (39), the cost-effective design of the \( 8 \times 8, 16 \times 16, \) and \( 32 \times 32 \) inverse core transforms is obtained progressively. First, the hardware-sharing-based eight-point inverse transform is presented as

\[
H_{i8} = P_{8} \cdot [H_{i4} \oplus (18 \cdot Z_{i}) + (25 \cdot Z_{i})] \cdot P_{6}. \tag{40}
\]

Next, the hardware-sharing-based 16-point inverse transform is described as

\[
H_{i16} = P_{8} \cdot [H_{i8} \oplus (T_{r22} + T_{N2})] \cdot P_{6}. \tag{41}
\]

Finally, the hardware-sharing-based 32-point inverse transform is depicted as

\[
H_{i32} = P_{8} \cdot [H_{i16} \oplus (T_{M1} + T_{N1})] \cdot P_{6}. \tag{42}
\]

In this section, the hardware-sharing transform architecture cuts down the hardware cost because the same submodules and coefficients of the transforms are extracted to be shared. \textbf{Figure 1} illustrates the architecture of the hardware-sharing-based inverse core transform design for \( 4 \times 4/8 \times 8/16 \times 16/32 \times 32 \) transforms [32].

### 3.5. Architecture comparison

The proposed 1D inverse core transform in [32] involves four inputs to sustain \( 4 \times 4, 8 \times 8, 16 \times 16, \) and \( 32 \times 32 \) transform modes. Several multiplexers are utilized to acquire the transform outputs of the \( 32 \times 32 \) inverse core transform by the shared design of \( 4 \times 4, 8 \times 8, \) and \( 16 \times 16 \) inverse core transforms [32]. \textbf{Table 2} lists the number of adders and shifters needed to calculate four modes of the 1D inverse core transform for HEVC. The developed architecture in [32] does not require any multiplier, and the fixed-coefficient multiplications are replaced with
simple additions and shift operations. Table 3 shows the comparison of three 16-point inverse transform designs. Compared with the previous works in [29] and [31], the applied architecture contains fewer adders. However, several more shifters are required. Compared with the cost of adders, the shifters need lower hardware expense. Thus, the used architecture decreases the hardware cost more efficiently than previous transform schemes do.

Figure 1. The hardware-sharing-based inverse core transform structure for HEVC.

<table>
<thead>
<tr>
<th>Transform sizes</th>
<th>32 × 32</th>
<th>16 × 16</th>
<th>8 × 8</th>
<th>4 × 4</th>
</tr>
</thead>
<tbody>
<tr>
<td>No. of shifters</td>
<td>256</td>
<td>93</td>
<td>40</td>
<td>11</td>
</tr>
<tr>
<td>No. of adders</td>
<td>461</td>
<td>146</td>
<td>64</td>
<td>10</td>
</tr>
</tbody>
</table>

Table 2. The 1D inverse transform architecture at different transform modes [32].

<table>
<thead>
<tr>
<th>Designs</th>
<th>No. of shifters</th>
<th>No. of adders</th>
</tr>
</thead>
<tbody>
<tr>
<td>Ahmed [29]</td>
<td>132</td>
<td>232</td>
</tr>
<tr>
<td>Haggag [31]</td>
<td>58</td>
<td>242</td>
</tr>
<tr>
<td>Design in Section 3.2</td>
<td>93</td>
<td>146</td>
</tr>
</tbody>
</table>

Table 3. Hardware comparison of three 1D 16-point transform designs [32].

4. Case study [41]: multiple-standard multiple-mode transform design

4.1. Hardware-sharing design for 8 × 8 transforms mode

For H.264/AVC, the transform matrix is employed as a foundation matrix for the multistandard hardware-sharing scheme. Based on Eq. (3), the cost of the upper diagonal matrix in Eq. (43) is eight adders and two shifters.
\[ U_{4x4_{AVC}} = \begin{bmatrix} 8 & 8 & 0 & 4 \\ 8 & 4 & -4 & 8 \\ 8 & -4 & 8 & 8 \\ 8 & -8 & 8 & -4 \end{bmatrix} = 8 \cdot C_1 \cdot C_2, \tag{43} \]

where \( C_1 = \begin{bmatrix} 1 & 0 & 0 & 1 \\ 0 & -1 & 1 & 0 \\ 0 & 1 & 1 & 0 \\ 1 & 0 & 0 & -1 \end{bmatrix} \) and \( C_2 = \begin{bmatrix} 1 & 0 & 1 & 0 \\ 0 & -0.5 & 0 & 1 \\ 1 & 0 & -1 & 0 \\ 0 & 1 & 0 & 0.5 \end{bmatrix} \). For AVS, the upper diagonal matrix \( U_{4x4_{AVS}} \) in Eq. (44) costs 10 adders and four shifters.

\[ U_{4x4_{AVS}} = \begin{bmatrix} 8 & 10 & 8 & 4 \\ 8 & 4 & -8 & -10 \\ 8 & -4 & 10 & 8 \\ 8 & -10 & 8 & -4 \end{bmatrix} = 8 \cdot C_1 \cdot (C_2 + C_3), \tag{44} \]

where \( C_3 = \begin{bmatrix} 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0.25 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{bmatrix} \). In Eq. (45), the upper diagonal matrix \( U_{4x4_{VC1}} \) for VC1 needs 14 adders and eight shifters.

\[ U_{4x4_{VC1}} = \begin{bmatrix} 12 & 16 & 12 & 6 \\ 12 & 6 & 12 & -16 \\ 12 & -6 & 12 & 16 \\ 12 & -16 & 12 & -6 \end{bmatrix} = 8 \cdot C_1 \cdot (C_4 + C_5 \cdot C_2), \tag{45} \]

where and \( C_4 = \begin{bmatrix} 0 & 0 & 0 & 0 \\ 0 & 0 & 0.5 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0.5 & 0 & 0 \end{bmatrix} \) and \( C_5 = \begin{bmatrix} 1.5 & 0 & 0 & 0 \\ 0 & 1.5 & 0 & 0 \\ 0 & 0 & 1.5 & 0 \\ 0 & 0 & 0 & 1.5 \end{bmatrix} \). For HEVC, the \( 8 \times 8 \) transform matrix is acquired by the AVS design in Eq. (44), and the design in Eq. (46) costs 16 adders and 12 shifters.

\[ U_{4x4_{HEVC}} = \begin{bmatrix} 64 & 83 & 64 & 36 \\ 64 & 36 & -64 & -83 \\ 64 & -36 & -64 & 83 \\ 64 & -83 & 64 & -36 \end{bmatrix} = 2 \cdot C_1 \cdot [32 \cdot (C_2 + C_1) - U_{42}], \tag{46} \]
where \( U_1 = \begin{bmatrix} 0 & 0 & 0 & 0 \\ 0 & 2 & 0 & -1.5 \\ 0 & 0 & 0 & 0 \\ 0 & -1.5 & 0 & -2 \end{bmatrix} \). For MPEG-1/2/4, the upper diagonal matrix is factorized by

\[
U_{4 \times 4 \text{ MPEG}} = \begin{bmatrix} 362 & 473 & 362 & 196 \\ 362 & 196 & -362 & -473 \\ 362 & -196 & 362 & 473 \\ 362 & -473 & 362 & -196 \end{bmatrix} = C_1 \cdot \begin{bmatrix} 256 \cdot (C_4 + C_5) - (U_2 + U_3) \end{bmatrix}, \tag{47}
\]

where \( U_2 = \begin{bmatrix} 22 & 0 & 22 & 0 \\ 0 & 0 & 0 & 0 \\ 22 & 0 & -22 & 0 \\ 0 & 0 & 0 & 0 \end{bmatrix} \) and \( U_3 = \begin{bmatrix} 0 & 0 & 0 & 0 \\ 0 & 4 & 0 & 39 \\ 0 & 0 & 0 & 0 \\ 0 & 39 & 0 & -4 \end{bmatrix} \). In Eq. (47), the parameter “22” of \( U_2 \) is implemented by \((C_5 \cdot C_3 \ll 4) - (C_1 \ll 1)\), where “\(\ll\) 1” is left shifting one bit, and the cost in Eq. (47) requires 28 adders and 26 shifters.

By Eq. (3), on the other side, the down diagonal matrix \( D_{4 \times 4 \text{ AVC}} \) for H.264/AVC becomes Eq. (48), and it needs 17 adders and eight shifters.

\[
D_{4 \times 4 \text{ AVC}} = \begin{bmatrix} -3 & 6 & -10 & 12 \\ -6 & 12 & -3 & -10 \\ -10 & 3 & 12 & 6 \\ -12 & -10 & -6 & -3 \end{bmatrix} = 8 \cdot U_4 \cdot (D_4 + D_5) \cdot (D_4 + U_3), \tag{48}
\]

where \( U_4 = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & -1 \end{bmatrix} \), \( D_4 = \begin{bmatrix} -1 & -1 & 1 & 0 \\ 1 & 0 & 1 & -1 \\ -1 & 1 & 0 & -1 \\ 0 & 1 & 1 & 1 \end{bmatrix} \), and \( D_5 = \begin{bmatrix} -0.5 & 0 & 0 & 0 \\ 0 & 0.5 & 0 & 0 \\ 0 & 0.5 & 0 & 0 \\ 0 & 0 & 0 & 0.5 \end{bmatrix} \).

For AVS, the \( D_{4 \times 4 \text{ AVS}} \) matrix becomes (49), and \( D_4 \) and \( D_5 \) are shared with the design in Eq. (48), and then \( U_2 \) and \( U_3 \) are also partially shared with the scheme in Eq. (48). In Eq. (49), it costs 24 adders and 12 shifters.
where \( u_3 = \begin{bmatrix} 0 & -1 & 0 & 0 \\ 0 & 0 & 0 & -1 \\ 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 \end{bmatrix} \), \( D_3 = \begin{bmatrix} 1 & 0 & 0 & 0 \\ -1 & 0 & 0 & 0 \\ 0 & -1 & 0 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \), and \( D_4 = \begin{bmatrix} 1.5 & 0 & 0 & 0 \\ 0 & 1.5 & 0 & 0 \\ 0 & 0 & -1.5 & 0 \\ 0 & 0 & 0 & 1.5 \end{bmatrix} \).

For VC-1, the \( D_{4 \times 4\text{-VC1}} \) matrix is factorized by Eq. (50), and the design requires 21 adders and 12 shifters

\[
D_{4 \times 4\text{-VC1}} = \begin{bmatrix} -4 & 9 & -15 & 16 \\ -16 & 9 & -15 & 16 \\ -15 & 4 & 9 & -16 \\ -16 & -15 & 9 & -4 \end{bmatrix} = 8 \cdot U_4 \cdot (D_4 \cdot D_2 + D_1) \cdot (D_2 + U_3),
\]

where \( D_6 = \begin{bmatrix} 1.5 & 0 & 0 & 0 \\ 0 & 1.5 & 0 & 0 \\ 0 & 0 & 1.5 & 0 \\ 0 & 0 & 0 & 1.5 \end{bmatrix} \).

For HEVC, the \( D_{4 \times 4\text{-HEVC}} \) matrix is expressed by Eq. (51), and it expends 44 adders and 20 shifters

\[
D_{4 \times 4\text{-HEVC}} = \begin{bmatrix} -18 & 50 & -75 & 89 \\ -50 & 89 & -18 & -75 \\ -75 & 18 & 89 & 50 \\ -89 & -75 & -50 & -18 \end{bmatrix} = D_{4 \times 4\text{-AVS}} \cdot 9 + [4 \cdot (U_4 \cdot (D_4 + U_3) - U_5)],
\]

where \( u_5 = \begin{bmatrix} 0 & 0 & -1 & 0 \\ 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \end{bmatrix} \), \( u_6 = \begin{bmatrix} 0 & -1 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 0 & 0 & -1 & 0 \\ 0 & 0 & 1 & 0 \end{bmatrix} \), \( u_7 = \begin{bmatrix} 0 & 0 & 0 & 1 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ -1 & 0 & 0 & 0 \end{bmatrix} \).

For MPEG-1/2/4, based on \( D_{4 \times 4\text{-AVS}} \), the \( D_{4 \times 4\text{-MPEG}} \) matrix is presented by Eq. (52), and the design costs 48 adders and 32 shifters

\[
D_{4 \times 4\text{-MPEG}} = \begin{bmatrix} -2 & 6 & -9 & 10 \\ -6 & 10 & -2 & -9 \\ -9 & 2 & 10 & 6 \\ -10 & -9 & -6 & -2 \end{bmatrix} = 4 \cdot U_4 \cdot (D_4 + D_5) \cdot (D_5 + U_4),
\]
For AVS-M, the matrix $M_{4\times4_{AVS}}$ is presented by (53), and it spends 10 adders and six shifters

\[
M_{4\times4_{AVS}} = \begin{bmatrix} 2 & 3 & 2 & 1 \\ 2 & 1 & -2 & -3 \\ 2 & -1 & -2 & 3 \\ 2 & -3 & 2 & -1 \end{bmatrix} = C_1 \cdot (2 \cdot C_2 + U_B), \tag{53}
\]

where $U_B = \begin{bmatrix} 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \end{bmatrix}$. For VC-1, $M_{4\times4_{VC1}}$ is expressed by Eq. (54), and the design requires 14 adders and 12 shifters

\[
M_{4\times4_{VC1}} = \begin{bmatrix} 17 & 22 & 17 & 10 \\ 17 & 10 & -17 & -22 \\ 17 & -10 & -17 & 22 \\ 17 & -22 & 17 & -10 \end{bmatrix} = C_1 \cdot (16 \cdot C_2 + U_B), \tag{54}
\]

where $U_B = \begin{bmatrix} 1 & 0 & 1 & 0 \\ 0 & -2 & 0 & 6 \\ 1 & 0 & -1 & 0 \\ 0 & 6 & 0 & 2 \end{bmatrix}$. For VP8, all coefficients in $4 \times 4$ transform matrix are multiplied by 128 to get integer values, and it costs 18 adders and 14 shifters

\[
M_{4\times4_{VP8}} = \begin{bmatrix} 128 & 167 & 128 & 70 \\ 128 & 70 & -128 & -167 \\ 128 & -70 & -128 & 167 \\ 128 & -167 & 128 & -70 \end{bmatrix} = C_1 \cdot (128 \cdot C_2 + U_B), \tag{55}
\]
where \( U_{10} = \begin{bmatrix} 0 & 0 & 0 & 0 \\ 0 & -6 & 0 & 39 \\ 0 & 0 & 0 & 0 \\ 0 & 39 & 0 & 6 \end{bmatrix} \). The matrix \( U_{4\times4,\text{AVC}}/8 \) equals the 4 × 4 inverse transform matrix in H.264/AVC. In addition, the matrix \( U_{4\times4,\text{HEVC}} \) equals the 4 × 4 inverse transform matrix in HEVC. Thus, several multiplexers are used to share the hardware between the submatrices to decrease hardware cost.

4.3. Architecture comparison

The applied hardware-sharing-based 1D multistandard inverse integer transform scheme has two inputs, which sustain 4 × 4 and 8 × 8 transform modes. The hardware blocks of processing the 4 × 4 inverse transforms are shared with that of the upper diagonal matrix \( U_{8\times8} \). Thus, several multiplexers are utilized for \( U_{8\times8} \) to compute the 4 × 4 inverse transforms without additional operations. For the multistandard applications, the hardware-sharing architecture of the fast 1D 4 × 4 and 8 × 8 inverse integer transforms is illustrated in [41]. The shifters are also realized by wiring. Compared with the individual designs without hardware shares, Table 4 depicts that the used scheme in [41] decreases the number of shifters and adders by 50 and 75%, respectively.

<table>
<thead>
<tr>
<th>Different 1D inverse integer transform modes</th>
<th>No. of adders</th>
<th>No. of shifters</th>
</tr>
</thead>
<tbody>
<tr>
<td>Individual designs without hardware shares</td>
<td>336</td>
<td>180</td>
</tr>
<tr>
<td>Hardware-sharing-based design in Section 4</td>
<td>82</td>
<td>90</td>
</tr>
<tr>
<td>Reduction of cost</td>
<td>75%</td>
<td>50%</td>
</tr>
</tbody>
</table>

Table 4. Hardware comparison between two architectures [41].

To implement the discussed architecture, a cell-based VLSI design flow is utilized to design, simulate, and verify the cost-effective hardware-sharing architecture. For fair comparisons among different transform structures, the normalized mode gain, which is required to normalize the gate counts, is described as follows: By matrix dimensions and without missing generality [40], the normalized mode gains defined for the 32 × 32, 16 × 16, 8 × 8, and 4 × 4 inverse integer transform matrices are 16, 4, 1, and 1/4, respectively.

The hardware-sharing-based design in Section 3 supports 4 × 4, 8 × 8, 16 × 16, and 32 × 32 inverse transform modes for HEVC. Thus, the normalized mode gain of the design is 21.25 (i.e., 16 + 4 + 1 + 0.25). Similarly, five 8 × 8 and five 4 × 4 inverse transform functions are provided by the hardware-shared design in Section 4. Therefore, the normalized mode gain is assigned by 6.25 (i.e., 5 + 1.25) [41]. Afterwards, the normalized gate counts are defined by [40, 41]

\[
\text{Normalized gate counts} = \frac{\text{Gatecounts}}{\text{Normalized mode gain}}
\] (56)
Table 5 shows the hardware cost comparisons among different 1D multiple transform architectures, which includes single-standard multiple-mode [32] and multiple-standard multiple-mode [41] transform designs.

<table>
<thead>
<tr>
<th>Architecture</th>
<th>Ahmed et al. [29]</th>
<th>Hardware-sharing based-design in Section 3</th>
<th>Shen et. al. [26]</th>
<th>Martuza et. al. [28]</th>
<th>Qi et al. [36]</th>
<th>Wang et al. [38]</th>
<th>Hardware-sharing-based design in Section 4</th>
</tr>
</thead>
<tbody>
<tr>
<td>Gate counts</td>
<td>144.8K</td>
<td>115.7 K</td>
<td>134.8 K</td>
<td>39.4 K</td>
<td>18 K</td>
<td>23.06 K</td>
<td>27.4 K</td>
</tr>
<tr>
<td>Normalized</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>mode gain</td>
<td>21.25</td>
<td>21.25</td>
<td>25.75</td>
<td>5</td>
<td>3.5</td>
<td>4.5</td>
<td>6.25</td>
</tr>
<tr>
<td>Normalized</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>gate counts</td>
<td>6.81 K</td>
<td>5.44 K</td>
<td>5.23 K</td>
<td>7.88 K</td>
<td>5.14 K</td>
<td>5.12 K</td>
<td>4.38 K</td>
</tr>
<tr>
<td>Supporting</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>modes</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Supporting</td>
<td>HEVC: 4 × 4, 8 × 8</td>
<td>H.264/AVC, VC-1</td>
<td>HEVC: 4 × 4, 8 × 8</td>
<td>H.264/AVC, VC-1</td>
<td>H.264/AVC, VC-1</td>
<td>H.264/AVC, VC-1</td>
<td>H.264/AVC, VC-1</td>
</tr>
<tr>
<td>standards/</td>
<td>16 × 16, 32 modes</td>
<td>8 × 16, 16, 32 modes</td>
<td>4 × 4, 8 × 8 modes</td>
<td>4 × 4, 8 × 8 modes</td>
<td>4 × 4, 8 × 8 modes</td>
<td>4 × 4, 8 × 8 modes</td>
<td>4 × 4, 8 × 8 modes</td>
</tr>
<tr>
<td>Transforms</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Table 5. Hardware cost comparisons among different 1D multiple transform architectures [32, 41].

5. Conclusion

For the single-standard multiple-mode transform design, this chapter discussed the 4 × 4, 8 × 8, 16 × 16, and 32 × 32 inverse core transforms in HEVC with a cost-effective and hardware-efficient design. By the symmetrical characteristics of the elements, the core transform matrices were factorized into several submatrices. Thus, the hardware of the \((N/2) \times (N/2)\) inverse core transform was shared with that of the \(N \times N\) inverse core transform for \(N = 32, 16,\) and 8.

Compared with the direct design without hardware shares, the applied transform scheme in Section 3 decreased the hardware cost of adders and shifters by 32 and 36%, respectively. Besides, for VLSI implementation, the design in Section 3 requires less normalized gate counts than the design does in [29].

For the multiple-standard multiple-mode transform design, this chapter also discussed the fast algorithm and hardware-sharing-based design of 4 × 4 and/or 8 × 8 inverse transforms among H.264/AVC, VC-1, HEVC, MPEG-1/2/4, AVS, and VP8 for multistandard video...
decoders. By only shifters and adders, the decomposition scheme of matrices was used to develop the hardware-shared scheme. The used structure in Section 4 decreased the number of shifters and adders by 50 and 75% more than the individual fast algorithm-based implementation did. Besides, for VLSI implementation, the design in Section 4 requires less normalized gate counts than the designs do in [26, 28, 36, 38].

Acknowledgements

This work was supported by Ministry of Science and Technology, Taiwan, R.O.C. under Grant MOST 105-2221-E-005-078.

Author details

Chih-Peng Fan
Address all correspondence to: cpfan@dragon.nchu.edu.tw

Department of Electrical Engineering, National Chung Hsing University, Taichung, Taiwan, ROC

References


