Carry Prediction and Selection for Truncated Multiplication

Romain Michard*, Arnaud Tisserand† and Nicolas Veyrat-Charvillon*
* Arénaire project, LIP (CNRS–ENSL–INRIA–UCBL)
Ecole Normale Supérieure de Lyon
46 allée d’Italie. F-69364 Lyon, France
{firstname.lastname}@ens-lyon.fr
† Arith group, LIRMM (CNRS–Univ. Montpellier II)
161 rue Ada. F-34392 Montpellier, France
arnaud.tisserand@lirmm.fr

Abstract—This paper presents an error compensation method for truncated multiplication. From two \( n \)-bit operands, the operator produces an \( n \)-bit product with small error compared to the \( 2n \)-bit exact product. The method is based on a logical computation followed by a simplification process. The filtering parameter used in the simplification process helps to control the trade-off between hardware cost and accuracy. The proposed truncated multiplication scheme has been synthesized on an FPGA platform. It gives a better accuracy over area ratio than previous well-known schemes such as the constant correcting and variable correcting truncation schemes (CCT and VCT).

I. INTRODUCTION

In many digital signal processing applications, fixed-point arithmetic is used. In order to avoid word-size growth, operators with \( n \)-bit input(s) must return an \( n \)-bit result. For multiplication, the \( 2n \)-bit result of an \( (n \times n) \)-bit product has to be set back to \( n \)-bit by dropping the \( n \) least significant bits through a reduction scheme (usually truncation or rounding). This is the purpose of truncated multipliers.

Truncated multiplication is used mainly for applications such as finite-impulse response (FIR) filtering and discrete cosine transform (DCT) operations. It can also be used to reduce the hardware cost of function evaluation [1].

This paper starts by the notations and presents the main methods used for truncated multiplication in Section II. In this section, we introduce a simple classification of truncated multiplication schemes. The proposed method is presented in Section III. Our method is based on carry prediction and selection. Section IV presents the error analysis and the implementations results on FPGAs. It also presents a comparison with some existing solutions.

II. BACKGROUND

A. Notations

Figure 1 presents the partial product array (PPA) of an unsigned \((4 \times 4)\)-bit multiplication (see [2] for full-width multiplication algorithms). The partial product \( x_iy_j \) is often represented as a dot for compact notation (see Fig. 2).

We use fixed-point notation with \( n \) fractional bits (i.e., \( 0.x_1x_2x_3\ldots x_n \)) for the operands and the result. As shown in Figure 2, \( MP \) represents the \( n-1 \) most significant columns of the partial product array. \( MP \) corresponds to the \( n \) bits of the final truncated result. \( w_{lsb} \) is the weight of the least significant bit in the truncated result, i.e., \( w_{lsb} = 2^{-n} \). The least significant part of the PPA is noted \( LP \), and we further distinguish its \( k \) most significant columns as \( LP_{major} \), and the remaining \( n-k \) columns as \( LP_{minor} \). In some schemes, the column in \( LP_{minor} \) with the highest weight (the left-most column in \( LP_{minor} \) in Figure 2) is used. It is referred in the following as \( LP_{h-minor} \). We refer to a column in the PPA by its weight, for example \( MP \) extends from columns \( col_1 \) to \( col_n \), i.e. from the column where the partial products weight \( 2^{-1} \) to the column where the partial products weight \( 2^{-n} = w_{lsb} \).

Function \( \text{trunc}_{n}(x) \) denotes \( x \) truncated to the \( n \)-th bit, and \( \text{round}_{n}(x) \) stands for \( x \) rounded to the \( n \)-th bit.

A truncated multiplication scheme computes the partial products in \( MP \), and add an error compensation value (ECV) computed as a function of \( LP \). The result of a truncated multiplication is noted

\[
P = \text{trunc}_{n}(MP + f(LP)).
\]

The most obvious way of performing a truncated multiplication is first to compute the exact \( 2n \) bits of the result, then round it to \( n \) bits. This full-width result is

\[
P_{FW} = \text{round}_{n}(MP + LP).
\]

While giving the smallest possible error, which is only due to the rounding, this method also requires the highest amount
of hardware by computing all the partial products.

Since the sum bits in the 2n-bit full-width product are not all used, one is tempted to remove some low-weight columns in LP in order to diminish the hardware cost of the multiplier. However, by doing this, the carries in the low-weight part of the PPA are lost, thereby introducing an evaluation error.

Two kinds of error occur in truncated multiplication: the evaluation error \( E_{\text{eval}} \), which is due to the columns that are removed in LP, and the truncation error \( E_{\text{trunc}} \), which occurs when the computed value of the PPA is reduced to an \( n \)-bit value.

A direct-truncated multiplier computes only the \( n-1 \) most significant columns of the PPA. While minimizing the required amount of hardware, this approach does not take into account any of the carries propagating from LP, and leads to a maximal evaluation error. The result of a direct-truncated multiplier is

\[
P_{\text{DT}} = \text{trunc}_n (MP).
\]

B. Classification of the Truncated Multiplication Schemes

The truncated multiplication problem is the trade-off between accuracy and hardware cost. At one side, there is the full-width multiplier with the best accuracy but the highest cost. At the other side, there is the direct-truncated multiplier with the lowest cost but the worst accuracy. Many truncated multiplication schemes have been proposed with intermediate trade-offs.

We propose a simple classification based on the complexity of the ECV computation scheme. We distinguish two main kinds of solutions: static ECV and dynamic ECV. Static ECV means that the correction value does not depend on the actual values of the operands (the value is fixed at design time). Dynamic ECV uses a correction value computed using the actual operands (at run-time). Obviously, dynamic ECV is more accurate but it requires larger circuit area.

In order to refine the classification, we add subgroups depending on the PPA part impacted by the truncated multiplication method. We propose the five groups defined below.

- Static ECV with \( C \): no partial product from LP is computed, the constant \( C \) is based on the LP part.
- Static ECV with \( LP_{\text{major}} + C \): all partial products from \( LP_{\text{major}} \) are computed, the constant \( C \) is based on the \( LP_{\text{minor}} \) part.
- Dynamic ECV with \( f(LP_{\text{major}}) \): all partial products from \( LP_{\text{major}} \) are computed and used to evaluate the correction due to LP.
- Dynamic ECV with \( LP_{\text{major}} + f(LP_{\text{minor}}) \): all partial products from \( LP_{\text{major}} \) are computed, some partial products from \( LP_{\text{minor}} \) (usually \( LP^{(h)}_{\text{min}} \)) are used to evaluate the ECV.
- Dynamic ECV with LP: all partial products in LP are computed and used to compute the ECV.

Table I presents the classification of the previous methods and our method accordingly to those groups.

C. Static ECV: \( C \)

In [6], the expected value of \( C = \text{Sum}(LP) \) is estimated by assuming that each bit of the inputs has a probability 1/2 of being one. The probabilities of each carry and sum bit are evaluated using the logic properties of the half-adder and full-adder cells.

The direct-truncated multiplication scheme described previously also fits in this category with \( C = 0 \), LP is not computed nor approximated.

D. Static ECV: \( LP_{\text{major}} + C \)

Truncated multiplication schemes with a static ECV approximate the error done by leaving out the low-weight columns \( LP_{\text{minor}} \) with a constant, which is computed either by exhaustive search on the input values or as a statistical evaluation of the expected value of \( LP_{\text{minor}} \). In order to improve the accuracy, the \( k \) columns of \( LP_{\text{major}} \) are used as an extension of MP. The resulting \((n+k)\)-bit value is then rounded or truncated to \( n \) bits.

In [3], every input bit is assumed to have a probability \( \frac{1}{2} \) of being one. Each partial product therefore has an expected value \( \frac{1}{2} \). By adding their expected values over \( LP_{\text{minor}} \), Lim gets the expected value of the evaluation error:

\[
E_{\text{eval}} = -\frac{\omega_{\text{lsb}}}{4} \sum_{i=0}^{n-k-1} (i+1) \cdot 2^{i-n}.
\]

The multiplication result is:

\[
P = \text{round}_n (MP + LP_{\text{major}} + \text{round}_{n+k} (-E_{\text{eval}})),
\]

and the parameter \( k \) is chosen so as to give to the evaluation error a variance lower than the variance \( \omega_{\text{lsb}}^2/12 \) of the rounding error, which is treated as a random noise.

This method is refined in the constant correction truncated (CCT) multiplication [7], where the error made by truncating the \((n+k)\)-bit result to an \( n \)-bit value is computed assuming for each result bit of the multiplication a probability \( \frac{1}{2} \) of being...
one. This gives:

\[ E_{\text{trunc}} = -\frac{u_{\text{lsb}}}{2} \sum_{i=-k}^{-1} 2^i = -\frac{u_{\text{lsb}}}{2} (1 - 2^{-k}). \]

The multiplication result is:

\[ P = \text{trunc}_n (MP + LP_{\text{major}}) \]
\[ + \text{round}_{n+k} (-E_{\text{eval}} - E_{\text{trunc}}). \]

**E. Dynamic ECV: \( f(LP_{\text{major}}) \)**

In order to further diminish the error, some schemes have been proposed where, instead of approximating the value of the partial products in \( LP_{\text{minor}} \) by a constant, it is expressed as a function of the partial products in \( LP_{\text{major}} \).

[4] gives an ECV for a modified Booth encoded PPA, where each line of partial products in \( LP_{\text{minor}} \) is estimated as a multiple of the corresponding partial product in \( LP_{\text{major}} \) (\( k = 1 \)). This results in a data-dependent ECV.

[8] presents another dynamic ECV for a modified Booth multiplier. For every possible combination of bits in the recoded operand, a corresponding expected value of \( LP_{\text{minor}} \) is computed by statistic analysis, and added to the exact value of \( LP_{\text{major}} \) (\( k = 1 \)). This gives for every possible value of the recoded operand an approximation of the carries propagated from \( LP \). A carry generation circuit is then computed using a Karnaugh map. For sizes larger than 12, the exhaustive simulation is replaced by statistical analysis.

**F. Dynamic ECV: \( LP_{\text{major}} + f(LP_{\text{minor}}) \)**

In [5], the ECV for a Baugh-Wooley array multiplier is computed in three parts. First \( LP_{\text{major}} \) is computed and summed. Then the partial products in \( LP_{\text{minor}}^{(h)} \) are computed, some of them are inverted, and all this is summed. The pattern of inversions applied to the partial products in \( LP_{\text{minor}}^{(h)} \) is parametrized by an integer \( Q \). The sum is noted \( \theta_{Q,k} \). Finally the expected value of \( (LP_{\text{minor}} - \theta_{Q,k}) \) is estimated, and added do \( LP_{\text{major}} + \theta_{Q,k} \). The best value of \( Q \) is obtained by exhaustive search. For \( n \geq 16 \), a statistical analysis can be performed.

The variable correction truncated (VCT) multiplication [9], [10] estimates the carries propagated from \( LP_{\text{minor}} \) by adding to the least column of \( LP_{\text{major}} \) the partial products of \( LP_{\text{minor}}^{(h)} \). This is equivalent to multiplying these partial-products by two. An immediate consequence is that the ECV is minimal when the multiplication operands are minimal, and vice versa. The truncation error \( E_{\text{trunc}} \) is the same as the one defined in the CCT multiplication.

The multiplication result is:

\[ P = \text{trunc}_n \left( MP + LP_{\text{major}} + 2LP_{\text{minor}}^{(h)} \right. \]
\[ \left. + \text{round}_{n+k} (-E_{\text{eval}} - E_{\text{trunc}}) \right). \]

**A hybrid correction truncation (HCT) multiplication [11]** realizes a compromise between the CCT and VCT multiplications by only using a percentage \( p \) of the partial products in \( LP_{\text{minor}}^{(h)} \) for the ECV, and adding \( 1 - p \) of the evaluation error \( E_{\text{eval}} \), defined in the CCT multiplication. The truncation error \( E_{\text{trunc}} \) is also the same as in the CCT and VCT multiplications.

The multiplication result is:

\[ P = \text{trunc}_n \left( MP + LP_{\text{major}} + p \cdot 2LP_{\text{minor}}^{(h)} \right. \]
\[ \left. + \text{round}_n ((p - 1) \cdot E_{\text{eval}} - E_{\text{trunc}}) \right). \]

**G. Dynamic ECV: \( LP_{\text{major}} + LP_{\text{minor}} \)**

Only the full-width multiplier fits into this category: this is the case where all of \( LP \) is computed.

### III. Proposed Method

In this work, a new data-dependent truncated multiplication scheme is introduced. It is named prediction-selection correcting truncated (PSCT) multiplication. It is proposed for direct non-recoded unsigned array multiplication.

In the CCT, VCT and HCT multiplication schemes, the carries propagated from \( LP_{\text{minor}} \) are estimated, either by statistical analysis or with the help of the partial products in column \( LP_{\text{minor}}^{(h)} \). It is then difficult to know what kind of error is done, and what additional terms might be introduced in order to improve accuracy.

Our approach tries to address this issue by computing in a first time the exact values of every carry generated in \( LP_{\text{minor}} \), and then discarding the less probable ones. This scheme simplifies the computation of the ECV and lower the associated hardware cost, while keeping track of the error made by removing those products.

**A. Carry Prediction**

Consider a complete PPA as the one used for the full-width multiplication in Figure 2. Since the \( n - k \) least significant bits of the result are discarded, the corresponding sum bits of \( LP_{\text{minor}} \) do not have to be computed. But if \( LP_{\text{minor}} \) is

<table>
<thead>
<tr>
<th>Static ECV</th>
<th>Dynamic ECV</th>
</tr>
</thead>
<tbody>
<tr>
<td>( C + LP_{\text{major}} + C )</td>
<td>( f(LP_{\text{major}}) )</td>
</tr>
<tr>
<td><strong>direct-truncated</strong> [6]</td>
<td>[4]</td>
</tr>
<tr>
<td>[7]</td>
<td>[8]</td>
</tr>
<tr>
<td>[9], [10]</td>
<td>[9], [10]</td>
</tr>
<tr>
<td>our method</td>
<td>full-width</td>
</tr>
</tbody>
</table>

**TABLE I**

**Classification of the ECV methods used in literature.**
not implemented at all, all the carries generated there are lost, leading to the evaluation error described in Eq. (1).

In order to keep the evaluation error low while removing unnecessary hardware, only the logic formulas of the carries generated in $LP_{\text{minor}}$ have to actually be implemented. These expressions are obtained by replacing the full-adder and half-adder cells in $LP_{\text{minor}}$ by their respective logic definitions:

\[
\text{carry}_{FA}(a, b, c) = ab \lor ac \lor bc \\
\text{sum}_{FA}(a, b, c) = a \oplus b \oplus c \\
\text{carry}_{HA}(a, b) = ab \\
\text{sum}_{HA}(a, b) = a \oplus b
\]

where $ab$ means “a and b”, $a \lor b$ is “a or b” and $a \oplus b$ stands for “a exclusive-or b”.

It is possible to express the logic formulas of the carries through $LP_{\text{minor}}$. Once implemented, the carries generated in $LP_{\text{minor}}$ are known exactly.

Figure 3 shows the carry prediction steps for a 4-bit multiplication with one column in $LP_{\text{major}}$ (a).

- (b) The first step computes the carries generated in the least significant column of $LP_{\text{minor}}$, $col_8$. Since there is only one partial product there, no carry can be generated.
- (c) The carry generated in $col_7$ is added to $col_6$. This carry is expressed logically as $x_0 \times F_1 x_0 y_1$.
- (d) The first carry in $col_6$, $A(x, y) = \text{carry}_{FA}(x_0 y_2, x_1 y_1, x_2 y_0)$ is similarly added to $col_5$, and the corresponding sum bit $B(x, y) = \text{sum}_{FA}(x_0 y_2, x_1 y_1, x_2 y_0)$ replaces the three partial products in $col_6$.

- (e) Finally $x_0 x_1 y_0 y_1$ and $B(x, y)$ are removed and their carry $C(x, y)$ is added to $col_5$, in $LP_{\text{major}}$.

We note $LP_{\text{major}}$ and $LP_{\text{minor}}$ the parts of the PPA corresponding to $LP_{\text{major}}$ and $LP_{\text{minor}}$ obtained at the end of the prediction process.

All the partial products left in $LP'_{\text{major}}$ are sum bits, which do not need to be computed, since every carry originally generated in $LP_{\text{minor}}$ is now computed in $LP'_{\text{major}}$.

After the carry prediction process, $LP'_{\text{minor}}$ contains only one partial product in each column. Assuming that each result bit of the multiplication has a probability $\frac{1}{2}$ of being one, the expected value of the evaluation error $E_{\text{eval}}$ is lower than $2^{-k-1}w_{\text{lsb}}$, so that round$_{n+k}(-E_{\text{eval}}) = 0$.

We can replace round$_{n+k}(LP')$ by $LP_{\text{major}}$:

\[
P_{PS} = \text{round}_n \left( MP + LP'_{\text{major}} \right) = \text{round}_n \left( MP + \text{round}_n( LP_{\text{major}} + LP_{\text{minor}} ) \right) = P_{FW}
\]

At this point, the truncated operator we obtain gives the same result as the full-width multiplication.

B. Carry Selection

So far, the evaluation error is lower than $2^{-k-1}w_{\text{lsb}}$, but the partial products in $LP_{\text{major}}$ become very complicated as $n$ grows, and a large hardware cost may result. In order to reduce the area requirements of the truncated multiplier, some carries have to be simplified. For that purpose, the logic formulas are written under their simplified disjunctive normal form, and a “filter” is applied on the last column in $LP'_{\text{major}}$.

This consists in removing any conjunction which number of variables exceeds a given threshold $t$. Assuming that the input bits have a probability $\frac{1}{2}$ of being one, a conjunction of $t+1$ variables will have a probability $2^{-t-1}$ of being one. Applying the filter, one makes an error of about $m \times 2^{-t-1-w_{\text{lsb}}}$, where $m$ is the number of conjunctions that were removed in the process. The evaluation error $E_{\text{eval}}$ is updated in accordance to this value. The multiplication result is:

\[
P_{PS} = \text{round}_n \left( MP + LP'_{\text{major}} + \text{round}_n(-E_{\text{eval}}) \right)
\]

Let us consider the previous example (Figure 3) of an $n$-bit truncated multiplication with $n = 4$ and $k = 1$. The only column in $LP_{\text{major}}$ contains $x_3 y_0, \cdots, x_0 y_3$, $A(x, y)$ and $C(x, y)$, where $C(x, y) = x_0 x_1 x_2 y_0 y_1 y_2 \lor x_0 x_1 x_2 y_0 y_1 y_2$ in its disjunctive normal form.

If no threshold is imposed, the truncated multiplication is equivalent to the ideal rounded multiplication.

For a value of the threshold $t = 4$, both conjunctions in $C(x, y)$ are removed, and the evaluation error is increased by the probability of $C(x, y)$ being one.

If the restriction on $t$ is set down to 2, the three terms constituting $A(x, y)$ will also disappear. The evaluation error is further increased by the probability of $A(x, y)$ being one, and the truncated multiplier is now equivalent to a CCT multiplier.
IV. RESULTS

A. Mathematical Error

For each studied multiplier scheme, the absolute bias $|\beta|$, average absolute error $\varepsilon_{avg}$, standard deviation $\sigma$ and absolute maximum error $\varepsilon_{max}$ are given in output lsb $w_{lsb} = 2^{-n}$. They are computed exhaustively for $n \leq 8$, and using an extensive random sampling for larger values of $n$.

The mathematical data for the previous example is given in Table II. We can see that, by acting on the value of the threshold $t$, we realize a compromise between the full-width multiplier and the CCT multiplier.

B. Synthesis

We studied the implementation of truncated multiplication schemes on FPGAs. The CAD tool used was Xilinx ISE8.1i and the target was an FPGA of the Spartan 3 family (XC3S200) with a medium speed grade (-5). Synthesis and place-and-route were area-oriented with a standard effort. The multipliers were implemented using LUTs (not hardware block multipliers).

The Xilinx devices are optimized for 4 up to 8-bit input functions. This allows us to perform an efficient implementation of PSCT multipliers with a threshold up to $t = 8$. We implemented the PSCT multipliers for $t$ running from 3 to 8. A PSCT multiplier where $t = 2$ is equivalent to a CCT multiplier with the same value of the parameter $k$.

C. Comparisons

The comparisons are lead with some well known truncation schemes for direct multiplication, that is the CCT [7] and VCT [9] multiplications. The full-width multiplier and direct-truncated multiplier are used as a reference.

The comparisons were done for $n = 8, 12$ and 16. Our method is not yet fit for higher values of $n$, because of the fast growing computational cost of the prediction process.

Figure 4 shows how the different schemes behave for $n = 8, 12$ and 16 from top to bottom. The X-axis gives the average absolute error, which is our principal accuracy criterion. The Y-axis gives the hardware cost relatively to the full-width multiplier. The aim is to perform a good accuracy while minimizing the hardware cost. This corresponds to the lower left part of each graph.

For $n = 8$, the CCT is outperformed by the PSCT for $t = 3$, 4 and 5. One can compute with the same average accuracy as the CCT with smaller PSCT multipliers. Similarly, for $n = 12$, the PSCT for $t = 3$ requires less hardware to provide the same average accuracy as the CCT. For $n = 16$, the two schemes are equivalent.

Tables III, IV and V show accuracy results for the truncated multiplication schemes. If one wants to get an average accuracy as small as possible, that is get close to 0.25, the PSCT multiplication has a lower hardware cost than the other truncated multiplication methods.
We presented a new truncated multiplication scheme. The method first computes the logic expression of the carries propagated from $LP_{\text{minor}}$, then performs simplifications while keeping control over the introduced error. This scheme achieves an improvement both for accuracy and hardware requirements over previous schemes. The proposed method has been implemented on FPGAs, it shows an area reduction for comparable accuracy on 8 and 12-bit multipliers.

In a near future we plan to improve the speed of our method in order to deal with larger multipliers. We also plan to study the effects of different groupings of the partial products during the carry prediction phase, that should lead to accuracy and hardware cost improvements.

**REFERENCES**


**TABLE III**

<table>
<thead>
<tr>
<th>Multiplication scheme</th>
<th>$\varepsilon_{\text{avg}}$</th>
<th>$\sigma$</th>
<th>$\varepsilon_{\text{max}}$</th>
<th>area</th>
<th>delay</th>
</tr>
</thead>
<tbody>
<tr>
<td>Ideal rounded</td>
<td>2.49e-1</td>
<td>1.46e-1</td>
<td>5.00e-1</td>
<td>48</td>
<td>10.3</td>
</tr>
<tr>
<td>Direct-truncated</td>
<td>1.75</td>
<td>9.76e-1</td>
<td>7.00</td>
<td>32</td>
<td>8.9</td>
</tr>
<tr>
<td>CCT $k = 4$</td>
<td>2.52e-1</td>
<td>1.51e-1</td>
<td>5.66e-1</td>
<td>47</td>
<td>10.8</td>
</tr>
<tr>
<td>PSCT $k = 4$, $t = 3$</td>
<td>2.50e-1</td>
<td>1.48e-1</td>
<td>6.29e-1</td>
<td>45</td>
<td>10.2</td>
</tr>
<tr>
<td>PSCT $k = 4$, $t = 5$</td>
<td>2.49e-1</td>
<td>1.47e-1</td>
<td>5.51e-1</td>
<td>46</td>
<td>11.7</td>
</tr>
</tbody>
</table>

**TABLE IV**

<table>
<thead>
<tr>
<th>Multiplication scheme</th>
<th>$\varepsilon_{\text{avg}}$</th>
<th>$\sigma$</th>
<th>$\varepsilon_{\text{max}}$</th>
<th>area</th>
<th>delay</th>
</tr>
</thead>
<tbody>
<tr>
<td>Ideal rounded</td>
<td>2.49e-1</td>
<td>1.45e-1</td>
<td>5.00e-1</td>
<td>93</td>
<td>12.7</td>
</tr>
<tr>
<td>Direct-truncated</td>
<td>2.73</td>
<td>1.24</td>
<td>9.00</td>
<td>53</td>
<td>11.5</td>
</tr>
<tr>
<td>CCT $k = 5$</td>
<td>2.51e-1</td>
<td>1.48e-1</td>
<td>5.71e-1</td>
<td>85</td>
<td>12.3</td>
</tr>
<tr>
<td>VCT $k = 4$</td>
<td>2.52e-1</td>
<td>1.49e-1</td>
<td>6.03e-1</td>
<td>93</td>
<td>14.6</td>
</tr>
<tr>
<td>VCT $k = 5$</td>
<td>2.50e-1</td>
<td>1.46e-1</td>
<td>5.94e-1</td>
<td>88</td>
<td>14.3</td>
</tr>
<tr>
<td>PSCT $k = 5$, $t = 3$</td>
<td>2.51e-1</td>
<td>1.48e-1</td>
<td>6.56e-1</td>
<td>84</td>
<td>12.3</td>
</tr>
<tr>
<td>PSCT $k = 5$, $t = 4$</td>
<td>2.50e-1</td>
<td>1.46e-1</td>
<td>5.94e-1</td>
<td>88</td>
<td>14.3</td>
</tr>
<tr>
<td>PSCT $k = 5$, $t = 5$</td>
<td>2.49e-1</td>
<td>1.46e-1</td>
<td>5.78e-1</td>
<td>91</td>
<td>14.3</td>
</tr>
</tbody>
</table>

**TABLE V**

<table>
<thead>
<tr>
<th>Multiplication scheme</th>
<th>$\varepsilon_{\text{avg}}$</th>
<th>$\sigma$</th>
<th>$\varepsilon_{\text{max}}$</th>
<th>area</th>
<th>delay</th>
</tr>
</thead>
<tbody>
<tr>
<td>Ideal rounded</td>
<td>2.49e-1</td>
<td>1.45e-1</td>
<td>5.00e-1</td>
<td>162</td>
<td>15.2</td>
</tr>
<tr>
<td>Direct-truncated</td>
<td>3.76</td>
<td>1.48</td>
<td>12.5</td>
<td>95</td>
<td>13.2</td>
</tr>
<tr>
<td>CCT $k = 6$</td>
<td>2.51e-1</td>
<td>1.47e-1</td>
<td>5.62e-1</td>
<td>140</td>
<td>15.0</td>
</tr>
<tr>
<td>VCT $k = 4$</td>
<td>2.52e-1</td>
<td>1.50e-1</td>
<td>6.36e-1</td>
<td>149</td>
<td>15.3</td>
</tr>
<tr>
<td>VCT $k = 5$</td>
<td>2.50e-1</td>
<td>1.46e-1</td>
<td>5.59e-1</td>
<td>153</td>
<td>15.3</td>
</tr>
<tr>
<td>PSCT $k = 5$, $t = 4$</td>
<td>2.52e-1</td>
<td>1.50e-1</td>
<td>6.41e-1</td>
<td>140</td>
<td>16.0</td>
</tr>
<tr>
<td>PSCT $k = 5$, $t = 6$</td>
<td>2.51e-1</td>
<td>1.49e-1</td>
<td>6.26e-1</td>
<td>142</td>
<td>18.2</td>
</tr>
</tbody>
</table>

**CONCLUSION**

We presented a new truncated multiplication scheme. The method first computes the logic expression of the carries propagated from $LP_{\text{minor}}$, then performs simplifications while keeping control over the introduced error. This scheme achieves an improvement both for accuracy and hardware requirements over previous schemes. The proposed method has been implemented on FPGAs, it shows an area reduction for comparable accuracy on 8 and 12-bit multipliers.