1. Introduction
Discrete convolution is found in many applications in science and engineering. Above all, it plays a key role in modern digital signal and image processing. In digital signal processing, it is the basis of filtering, multiresolution decomposition, and optimization of the calculation of orthogonal transform [1,2,3,4,5,6,7,8,9,10]. In digital image processing, convolution is a basic operation of denoising, smoothing, edge detection, blurring, focusing, etc. [11,12,13]. There are two types of discrete convolutions: the cyclic convolution and the linear convolution. General principles for the synthesis of convolution algorithms were described in [1,2,3]. The main emphasis in these works was made primarily on the calculation of cyclic convolution, while in many digital signal and image processing applications, the calculation of linear convolutions is required.
In recent years, convolution has found unusually wide application in neural networks and deep learning. Among the various kinds of deep neural networks, convolutional neural networks (CNNs) are most widely used [14,15,16,17]. In CNNs, linear convolutions are the most computationally intensive operations, since in a typical implementation, their multiple computations occupy more than 90% of the CNN execution time [14]. Only one convolutional level in a typical CNN requires more than two thousand multiplications and additions. Usually, there are several such levels in the CNN. That is why developers of such type of networks seek and design efficient ways of implementing linear convolution using the smallest possible number of arithmetic operations.
To speed up linear convolution computation, various algorithmic methods have been proposed. The most common approach to effective calculating linear convolution is dipping it in the space of a double-size cyclic convolution with the subsequent application of a fast Fourier transform (FFT) algorithm [15,16,17]. The FFT-based linear convolution method is traditionally used for large length finite impulse response (FIR) filters; however, modern CNNs use predominantly small length FIR filters. In this situation, the most effective algorithms used in the computation of a small-length linear convolution are the Winograd-like minimal filtering algorithms [18,19,20], which are the most used currently. The algorithms compute linear convolution over small tiles with minimal complexity, which makes them more effective with small filters and small batch sizes; however, these algorithms do not calculate the whole convolution. They calculate only two inner products of neighboring vectors formed from the current data stream by a moving time window of length N; therefore, these algorithms do not compute the true linear convolution.
At the same time, there are a number of CNNs in which it is necessary to calculate full-size small-length linear convolutions. In addition, in many applications of digital signal processing, there is the problem of calculating a one-dimensional convolution using its conversion into a multidimensional convolution. The algorithm thus obtained has a modular structure, and each module calculates a short-length one-dimensional convolution [21].
The most popular sizes of sequences being convoluted are sequences of length 2, 3, 4, 5, 6, 7, and 8. However, in the papers known to the authors, there is no description of resource-efficient algorithms for calculation of linear convolutions for lengths greater than four [1,4,6,21,22]. In turn, the solutions given in the literature for N = 2, N = 3, and N = 4 do not give a complete imagination about the organization of the linear convolution calculation process, since their corresponding signal flow graphs are not presented anywhere. In this paper, we describe a complete set of solutions for linear convolution of small length N sequences from 2 to 8.
2. Preliminaries
Let{hm}and,{xn},m=0,1,…,M−1,n=0,1,…,N−1be two finite sequences of length M and N, respectively. Their linear convolution is the sequence{y1},i=0,1,…,M+N−2 defined by [1]:
yi=∑n=0N−1hi−n xn,
where we takehi−n=0, ifi−n<0.
As a rule, the elements of one of the sequences to be convolved are constant numbers. For definiteness, we assume that it will be a sequence{hm}.
Because sequences{xn}and{hm} are finite length, then their linear convolution (1) can also be implemented as matrix-vector multiplication:
Y(N+M−1)×1=H(N+M−1)×N XN×1
where
H(N+M−1)×N=h0h1h0⋮h1⋱hM−1⋮⋯h0hM−1⋯h1⋱⋮hM−1,
XN×1=[x0,x1,…xN−1]T,
Y(N+M−1)×1=[y0,y1,…,yN+M−2]T,
HM×1=[h0,h1,…,hM−1]T.
In the future, we assume thatXN×1is a vector of input data,Y(N+M−1)×1is a vector of output data, andHM×1is a vector containing constants.
Direct computation of (3) takes MN multiplications and (M − 1)(N − 1) addition. This means that the fully parallel hardware implementation of the linear convolution operation requires MN multipliers and N + M − 3 multi-input adders with a different number of inputs, which depends on the length of the sequences being convolved. Traditionally, the convolution for which M = N is assumed as a basic linear convolution operation. Resource-effective cyclic convolution algorithms for benchmark lengths (N = 2, 3,...,16) have long been published [1,2,3,4,5,6,7,8,9]. For linear convolution, optimized algorithms are described only for the cases N = 2, 3, 4 [4,6,21,22]. Below we show how to reduce the implementation complexity of some benchmark-lengths linear convolutions for the case of completely parallel hardware their implementation. For completeness, we also consider algorithms for the sequences of lengths M = N = 2, 3, and 4.
So, considering the above, the goal of this article is to develop and describe fully parallel resource-efficient algorithms for N = 2, 3, 4, 5, 6, 7, 8. 3. Algorithms for Short-Length Linear Convolution
The main idea of presented algorithm is to transform the linear convolution matrix into circular matrix and two Toeplitz matrices. Then we can rewrite (3) in following form:
Y(2N−1)×1=H(2N−1)×N XN×1=HK×NH˘NHL×N−0K×NHL×N01×NHK×N0L×NXN×1
whereHK×N=TK(l)0K×(N−K)andHL×N=0L×(N−L)TL(r)are matrices that are horizontal concatenations of null matrices and left-triangular or right-triangular Toeplitz matrices, respectively:
TK×N(l)=h00⋯0h1h0⋯0⋮⋮⋱⋮hK−1hK−2⋯h0, TL(r)=hN−1hN−2⋯hN−L−20hN−1⋯hN−L−1⋮⋮⋱⋮00⋯hN−1,
which gives
HK×N=h00⋯000⋯0h1h0⋯000⋮0⋮⋮⋱⋮⋮⋮⋱⋮hK−1hK−2⋯h000⋯0,
HL×N=00⋯000⋯0⋮⋮⋱⋮00⋯0hN−1hN−2⋯hN−L−20hN−1⋯hN−L−1⋮⋮⋱⋮00⋯hN−1.
A circulant matrixH˘Nis a matrix of cyclic convolution HN with rows cyclically shifted by n positions down:
H˘N=IN←n HN=hKhK−1⋯hK+1hK+1hK⋯hK+2⋮⋮⋱⋮hN−1hN−2⋯h0h0hN−1⋯h1⋮⋮⋱⋮hK−1hK−2⋯hK,
whereIN(←n)- is permutation matrix obtained from the identity matrix by cyclic shift of its columns by n positions to the left and:
HN=h0hN−1⋯h1h1h0⋯h2⋮⋮⋮⋮hN−1hN−2⋯h0.
The coefficients K and L are natural numbers arbitrary taken and fulfilling the dependence K + L = N − 1. These values are selected heuristically for each N separately.
The productH˘N XN×1is calculated using the well-known fast convolution algorithm. The products ofHK×N XN×1andHL×N XN×1are also calculated using fast algorithms for matrix-vector multiplication with Toeplitz matrices. We use all of the above techniques to synthesize the final short-length linear convolution algorithms with reduced multiplicative complexity.
3.1. Algorithm for N = 2
LetX2×1=[x0,x1]TandH2×1=[h0,h1]Tbe 2-dimensional data vectors being convolved andY3×1=[y0,y1,y2]Tbe an input vector representing a linear convolution. The problem is to calculate the product
Y3×1=H3×2 X2×1,
where
H3×2=h0h1h0h1.
Direct computation of (5) takes four multiplications and one addition. It is easy to see that the matrixH3×2 possesses an uncommon structure. By the Toom–Cook algorithmic trick, the number of multiplications in the calculation of the 2-point linear convolution can be reduced [1,21].
With this in mind, the rationalized computational procedure for computing 2-point linear convolution has the following form:
Y3×1=A3(2) D3 A3×2(2) X2×1
where
A3×2(2)=11−11, A3(2)=11−111, D3=diag(h0,h0−h1,h1),
s0(2)=h0, s1(2)=h0−h1, s2(2)=h1.
Figure 1 shows a signal flow graph for the proposed algorithm, which also provides a simplified schematic view of a fully parallel processing unit for resource-effective implementing of 2-point linear convolution. In this paper, the all data flow graphs are oriented from left to right. Straight lines in the figures denote the data transfer (data path) operations. The circles in these figures show the operation of multiplication (multipliers in the case of hardware implementation) by a number inscribed inside a circle. Points where lines converge denote summation (adders in the case of hardware implementation) and dotted lines indicate the sign-change data paths (data paths with multiplication by −1). We use the usual lines without arrows on purpose, so as not to clutter the picture [23].
As it can be seen, the calculation of 2-point linear convolution requires only three multiplications and three additions. In terms of arithmetic units, a fully parallel hardware implementation of the processor unit for calculating a 2-point convolution will require three multipliers, one two-input adder, and one three-input adder. In terms of arithmetic units, a fully parallel hardware implementation of the processor unit for calculating a 2-point convolution will require three multipliers, one two-input adder, and one three-input adder instead of four multipliers and one two-input adder in the case of completely parallel implementation of (6). So, we exchanged one multiplier for one three-input adder.
3.2. Algorithm for N = 3
LetX3×1=[x0,x1,x2]TandH3×1=[h0,h1,h2]Tbe 3-dimensional data vectors being convolved andY5×1=[y0,y1,y2,y3,y4]Tbe an input vector represented linear convolution for N = 3. The problem is to calculate the product
Y5×1=H5×3 X3×1,
where
H5×3=h0h1h0h2h1h0h2h1h2.
Direct computation of (9) takes nine multiplications and five addition. Because the matrixH5×3 also possesses uncommon structure, the number of multiplications in the calculation of the 3-point linear convolution can be reduced too [1,4,21].
An algorithm for computation 3-point linear convolution with reduced multiplicative complexity can be written with the help of following matrix-vector calculating procedure:
Y5×1(3)=A5×6(3) D6(3) A6×3(3) X3×1,
where
A6×3(3)=111111111, A5×6(3)=1−1−11−11−11−1−111,
D6(3)=diag(h0,h1,h2,h0+h1,h0+h2,h1+h2).
Figure 2 shows a signal flow graph of the proposed algorithm for the implementation of 3-point linear convolution. As it can be seen, the calculation of 3-point linear convolution requires only 6 multiplications and 10 additions. Thus, the proposed algorithm saves six multiplications at the cost of six extra additions compared to the ordinary matrix-vector multiplication method.
In terms of arithmetic units, a fully parallel hardware implementation of the processor unit for calculating a 3-point linear convolution (11) will require only six multipliers, four two-input adders, one three-input adder, and one four-input adder instead of 12 multipliers, 2 two-input adders, and 1 two-input adder in the case of fully parallel implementation of (9).
3.3. Algorithm for N = 4
LetX4×1=[x0,x1,x2,x3]TandH4×1=[h0,h1,h2,h3]Tbe 4-dimensional data vectors being convolved andY7×1=[y0,y1,y2,y3,y4,y5,y6]Tbe an input vector represented linear convolution for N = 4.
The problem is to calculate the product:
Y7×1=H7×4 X4×1,
where
H7×4=h0h1h0h2h1h0h3h2h1h0h3h2h1h3h2h3.
Direct computation of (14) takes 16 multiplications and 9 addition. Due to the specific structure of matrixH7×4 , the number of multiplication operations in the calculation of (14) can be significantly reduced.
An algorithm for computation 4-point linear convolution with reduced multiplicative complexity can be written using the following matrix-vector calculating procedure:
Y7×1=A7×8(4) A8×9(4) D9(4) A9×8(4) A8×4(4) X4×1,
where
A8×4(4)=111111−11−1111, A9×8(4)=1⊕H2⊕1111⊕I3,
whereINis anidentityN×Nmatrix,H2is the(2 × 2) Hadamard matrix, and sign,”⊕” denotes direct sum of two matrices [23,24,25].
D9(4)=diag(s0(4) s1(4),…,s8(4)),
s0(4)=h0,s1(4)=(h0+h1+h2+h3)/4,s2(4)=(h0−h1+h2−h3)/4,
s3(4)=(h0−h1−h2+h3)/2,s4(4)=(h0+h1−h2−h3)/2,s5(4)=(h0−h2)/2,s6(4)=h3,
s7(4)=h2,s8(4)=h3,
A8×9(4)=1⊕H2⊕−11−11⊕I3, A7×8(4)=111−1−11−1−11−1−111111.
Figure 3 shows a signal flow graph of the proposed algorithm for the implementation of 4-point linear convolution. So, the proposed algorithm saves 7 multiplications at the cost of 11 extra additions compared to the ordinary matrix-vector multiplication method.
In terms of arithmetic units, a fully parallel hardware implementation of the processor unit for calculating a 4-point linear convolution will require only 9 multipliers, 13 two-input adders, 2 three-input adders, and 1 four-input adder instead of 16 multipliers, 2 two-input adders, 2 three-input adders, and 1 four-input adder in the case of fully parallel implementation of (14).
3.4. Algorithm for N = 5
LetX5×1=[x0,x1,x2,x3,x4]TandH5×1=[h0,h1,h2,h3,h4]Tbe 5-dimensional data vectors being convolved andY9×1=[y0,y1,y2,y3,y4,y5,y6,y7,y8]Tbe an input vector representing a linear convolution for N = 5.
The problem is to calculate the product:
Y9×1=H9×5 X5×1,
where
H9×5=h0h1h0⋮h1⋱h4⋮⋱h0h4⋱h1⋱⋮h4.
Direct computation of (16) takes 25 multiplications and 16 addition. Due to the specific structure of matrixH9×5 , the number of multiplication operations in the calculation of (16) can be significantly reduced.
Thus, an algorithm for computation 5-point linear convolution with reduced multiplicative complexity can be written using the following matrix-vector calculating procedure:
Y9×1=A9×11(5) A11×13(5) A13×16(5) D16(5) A16×15(5) A15×11(5) A11×5(5) X5×1,
where
A11×5(5)=110311−11−11−11−11111110311,A15×11(5)=111111111111−11−1011×404×7I4
and0M×Nis a null matrix of orderM×N [23,24,25],
D16(5)=diag(s0(5) s1(5),…, s15(5)),
s0(5)=h0,s1(5)=h1,s2(5)=h0,s3(5)=(h0−h2+h3−h4)/4,s4(5)=(h1−h2+h3−h4)/4,
s5(5)=(3h2−2h1+2h0−2h3+3h4)/5,s6(5)=(−h0+h1−h2+h3),s7(5)=(−h0+h1−h2+h3),
s8(5)=(3h0−2h1+3h2−2h3−2h4)/5,s9(5)=−h2+h3,s10(5)=h1−h2,
s11(5)=(−h0−h1+4h2−h3−h4)/5,s12(5)=(h0+h1+h2+h3+h4)/5,s13(5)=h4,s14(5)=h3,
s15(5)=h4,
A16×15(5)=111111111111−1012×404×11I4,
A13×16(5)=I3⊕1111031111031111⊕I4,
A11×13(5)=I3⊕1−11−1−1−1−111111111−11⊕I3,
A9×11(5)=102×6111−1−103×51−11−1102×6−1−1102×5111.
Figure 4 shows a data flow diagram of the proposed algorithm for the implementation of 5-point linear convolution. The algorithm saves 9 multiplications at the cost of 22 extra additions compared to the ordinary matrix-vector multiplication method.
In terms of arithmetic units, a fully parallel hardware implementation of the processor unit for calculating a 5-point linear convolution will require 16 multipliers, 20 two-input adders, 2 three-input adders, 1 four-input adder, and 1 five-input adder instead of 25 multipliers, 2 two-input adders, 2 three-input adders, 2 four-input adders, and 1 five-input adder in the case of fully parallel implementation of expression (16).
3.5. Algorithm for N = 6
LetX6×1=[x0,x1,x2,x3,x4,x5]TandH6×1=[h0,h1,h2,h3,h4,h5]Tbe 6-dimensional data vectors being convolved andY11×1=[y0,y1,y2,y3,y4,y5,y6,y7,y8,y9,y10]Tbe an input vector representing a linear convolution for N = 6.
The problem is to calculate the product:
Y11×1=H11×6 X6×1,
where
H11×6=h0h1h0⋮h1⋱h5⋮⋱h0h5⋱h1⋱⋮h5.
Direct computation of (18) takes 36 multiplications and 25 addition. We proposed an algorithm that takes only 16 multiplications and 44 additions. It saves 20 multiplications at the cost of 19 extra additions compared to the ordinary matrix-vector multiplication method.
Proposed algorithm for computation 6-point linear convolution can be written with the help of following matrix-vector calculating procedure:
Y11(6)=A11(6) A˘11(6) A11×14(6) A^14(6) A˘14(6) A14×16(6) D16(6) A16×14(6) A˘14(6) A14(6) A14×6(6) X6×1,
where
A14×6(6)=1111105×30311110311102×3111,A14(6)=I3⊕(H2⊗I3)⊕I5,
A˘14(6)=I3⊕I2⊗1111−11−1⊕I5, A16×14(6)=I3⊕I2⊗11111⊕I5,
D16=diag(s0(6),s1(6),…,s15(6)),
s0(6)=6h0,s1(6)=6h1,s2(6)=6h0,s3(6)=h0+h3+h4+h1+h2+h5,s4(6)=3(h4+h1−h0−h3),s5(6)=3(h2+h5−h0−h3),s6(6)=3(h0+h3)−(h0+h3+h4+h1+h2+h5),s7(6)=h0−h3+h4−h1+h2−h5),s8(6)=3(h4−h1−h0+h3),s9(6)=3(h2−h5−h0+h3),s10(6)=3(h0+h3)−(h0−h3+h4−h1+h2−h5),s11(6)=6h5,s12(6)=6(−h4+h5),s13(6)=6(h3−h4),s14(6)=6h4,s15(6)=6h5,
A14×16(6)=I4⊕111102×401×3101×302×41111⊕I5,A^14(6)=I4⊕11111⊕I5,
A11×14(6)=111⊕(H2⊗I3)⊕111111, A˘11(6)=I303×40304×3111104×304×304×3I4,
and sign “⊗” denotes tensor or Kronecker product of two matrices [23,24,25].
Figure 5 shows a data flow diagram of the proposed algorithm for the implementation of 6-point linear convolution.
In terms of arithmetic units, a fully parallel hardware implementation of the processor unit for calculating a 6-point linear convolution will require 16 multipliers, 32 two-input adders, and 5 three-input adders, instead of 36 multipliers, 2 two-input adders, 2 three-input adders, 2 four-input adders, 2 five-input adders, and 1 six-input adder in the case of completely parallel implementation of expression (18).
3.6. Algorithm for N = 7
LetX7×1=[x0,x1,x2,x3,x4.x5,x6]TandH7×1=[h0,h1,h2,h3,h4,h5,h6]Tbe 7-dimensional data vectors being convolved andY13×1=[y0,y1,y2,y3,y4,y5,y6,y7,y8,y9,y10,y11,y12]Tbe an input vector representing a linear convolution for N = 7.
The problem is to calculate the product:
Y13×1=H13×7 X7×1,
where
H13×7=h0h1h0⋮h1⋱h6⋮⋱h0h6⋱h1⋱⋮h6.
Direct computation of (20) takes 49 multiplications and 36 addition. We developed an algorithm that contains only 26 multiplications and 79 additions. It saves 23 multiplications at the cost of 43 extra additions compared to the ordinary matrix-vector multiplication method.
The proposed algorithm for computation 7-point linear convolution with reduced multiplicative complexity can be written using the following matrix-vector calculating procedure:
Y13×1(7)=A13×26(7) D26(7) A26×7(7) X7×1,
where
A13×26(7)=A13×15(7) A15×20(7) A20×21(7) A21×22(7) A22×25(7) A25×21(7) A21(7) A21×26(7),
A26×7(7)=A26(7) A26×28(7) A28×21(7) A21×18(7) A18×7(7)
and
A18×7(7)=11111111106×3I404×2−14×102×41−11−106×41111111,A21×18(7)=I5⊕1111−11111−1111−1111⊕I5,
−14×1=[−1,−1,−1,−1]T,
A28×21(7)=I7⊕1103×4031111104×311111106×3106×31111−1104×30411⊕I4,
A26×28(7)=I10⊕1−1111−1110611111111⊕I8
A26(7)=I5⊕11111111060711111−111⊕I8,
D26(7)=diag(s0(7),s1(7),…,s25(7)),
s0(7)=h0, s1(7)=h2−h1,s2(7)=h0−h1,s3(7)=h1,s4(7)=h0,s5(7)=(h6+h5+h4+h3+2h2+h1+h0)/7,s6(7)=(−h6−2h5+3h4−h3−2h2+h1+2h0)/2,s7(7)=(2h4−h3−2h2+h1)/2,s8(7)=(−h6+h5+2h4−h3−2h2+3h1−h0)/2,s9(7)=(10h6+3h5−11h4+10h3+3h2−11h1−4h0)/14,s10(7)=(−2h6−2h5−2h4+12h3+5h2−9h1−2h0)/14,s11(7)=(2h6+3h5−h4−2h3+3h2−h1)/6,s12(7)=(3h6−11h5−4h4+10h3+3h2−11h1+10h0)/14,s13(7)=(−2h3+3h2−h1)/6,s14(7)=(3h6−h5−2h3+3h2−h1−2h0)/6,s15(7)=(−h6+h4−h3+h1)/6,s16(7)=(−h3+h1)/6,s17(7)=(h5−h3+h1−h0)/6,s18(7)=2h6−h5−2h4+3h3−2h2−2h1+h0,s19(7)=2h3−h2−2h1+h0,s20(7)=−h6−2h5+h4+2h3−h2−2h1+3h0,s21(7)=h6,s22(7)=h4−h5,s23(7)=h6−h5,s24(7)=h5,s25(7)=h6,
A21×26(7)=I6⊕11111103×602×91111−11−11−1103×602×9−11−11⊕I5,
A21(7)=I7⊕11111040411111⊕I6, A25×21(7)=I7⊕1111111106×406×41111111⊕I6,
A22×25(7)=I6⊕111031−11111105×706×7110311−111111⊕I5,
A21×22(7)=I6⊕1−1−1−111110505×6111111⊕I5, A20×21(7)=I10⊕11111⊕I6,
A15×20(7)=111111⊕I5⊕1−1−1−11104×503×5111111,
A13×15(7)=I303×403×50304×311111111−1−1−1−11−11−1111110303×40303×5I3,
Figure 6 shows a data flow diagram of the proposed algorithm for the implementation of 7-point linear convolution.
In terms of arithmetic units, a fully parallel hardware implementation of the processor unit for calculating a 7-point linear convolution will require 27 multipliers, 49 two-input adders, 7 three-input adders, and 5 four-input adders, instead of 49 multipliers, 2 two-input adders, 2 three-input adders, 2 four-input adders, 2 five-input adders, 2 six-input adders, and 1 seven-input adder in the case of completely parallel implementation of expression (20).
3.7. Algorithm for N = 8
LetX8×1=[x0,x1,x2,x3,x4.x5,x6,x7]TandH8×1=[h0,h1,h2,h3,h4,h5,h6,h7]Tbe 8-dimensional data vectors being convolved andY15×1=[y0,y1,y2,y3,y4,y5,y6,y7,y8,y9,y10,y11,y12,y13,y14]Tbe an input vector representing a linear convolution for N = 8.
The problem is to calculate the product:
Y15×1=H15×8 X8×1,
where
H15×8=h0h1h0⋮h1⋱h7⋮⋱h0h7⋱h1⋱⋮h7.
Direct computation of (22) takes 64 multiplications and 49 addition. We developed an algorithm that contains only 27 multiplications and 67 additions. Thus, the proposed algorithm saves 22 multiplications at the cost of 18 extra additions compared to the ordinary matrix-vector multiplication method.
Proposed algorithm for computation 8-point linear convolution with reduced multiplicative complexity can be written using the following matrix-vector calculating procedure:
Y15×1=A15(8) A^15(8) A˘15(8) A15×17(8) A17×27(8) D27(8) A27×17(8) A17×15(8) A15×8(8) X8×1
where
A15×8(8)=I303×5I4I4I4−I404I4, A17×15(8)=I3⊕(H2⊗I2)⊕I20202I2I2I2⊕I4,
A27×17(8)=111111⊕H2⊕I4⊗1111⊕11111−1111111,
D27(8)=diag(s0(8),s1(8),...,s27(8)),
s0(8)=h0,s1(8)=h2−h1,s2(8)=h0−h1,s3(8)=h1,s4(8)=h0,s5(8)=18(h0+h1+h2+h3+h4+h5+h6+h7),s6(8)=18(h0−h1+h2−h3+h4−h5+h6−h7),s7(8)=14(−h0+h1+h2−h3−h4+h5+h6−h7),s8(8)=14(−h0−h1+h2+h3−h4−h5+h6+h7),s9(8)=14(h0−h2+h4−h6),s10(8)=12(h0−h1−h2+h3−h4+h5+h6−h7),s11(8)=12(h0+h1−h2+h3−h4−h5+h6−h7),s12(8)=12(−h0+h2+h4−h6),s13(8)=12(h0−h1+h2−h3−h4+h5−h6+h7),s14(8)=12(h0−h1+h2+h3−h4+h5−h6−h7),s15(8)=12(−h0−h2+h4+h6),s16(8)=12(−h0+h1+h4−h5),s17(8)=12(−h0−h3+h4+h7),s18(8)=12(h0−h4),s19(8)=h4−h6,s20(8)=12(h5+h6),s21(8)=12(h6−h5),s22(8)=h5−h7,s23(8)=h7,s24(8)=h6,s25(8)=h7,s26(8)=h7,
A17×27(8)=111111⊕H2⊕I4⊗1111⊕11111−1−1⊕111,
A15×17(8)=I3⊕(H2⊗I2)⊕02I2I2I202I2⊕I4, A˘15(8)=I3⊕(H2⊗I4)⊕I4,
A^15(8)=I3⊕05×3I5I303×5⊕I4, A15(8)=I703×4⊕(−I4)−1−1−1⊕I4I8.
Figure 7 shows a data flow diagram of the proposed algorithm for the implementation of 8-point linear convolution.
In terms of arithmetic units, a fully parallel hardware implementation of the processor unit for calculating a 8-point linear convolution will require 27 multipliers, 57 two-input adders, 4 three-input adders and 1 four-input adder, instead of 64 multipliers, 2 two-input adders, 2 three-input adders, 2 four-input adders, 2 five-input adders, 2 six-input adders, 2 seven-point adders and 1 eight-input adder in the case of completely parallel implementation of expression (22).
4. Implementation Complexity
Since the lengths of the input sequences are relatively small, and the data flow graphs representing the organization of the computation process are fairly simple, it is easy to estimate the implementation complexity of the proposed solutions. Table 1 shows estimates of the number of arithmetic blocks for the fully parallel implementation of the short lengths linear convolution algorithms. Since a parallel N-input adder consists of N-1 two-input adders, we give integrated estimates of the implementing costs of the sets of adders for each proposed solution expressed as the sums of two-input adders. The penultimate column of the Table 1 shows the percentage reduction in the number of multipliers, while the last column shows the percentage increase in the number of adders. As you can see, the implementation of the proposed algorithms requires fewer multipliers than the implementation based on naive methods of performing the linear convolution operations.
It should be noted that our solutions are primarily focused on efficient implementation in application specific integrated circuits (ASICs). In low-power designing low-power digital circuits, optimization must be performed both at the algorithmic level and at the logic level. From the point of view of designing an ASIC-chip that implements fast linear convolution, you should pay attention to the fact that the hardwired multiplier is a very resource-intensive arithmetic unit. The multiplier is also the most energy-intensive arithmetic unit, occupying a large crystal area [26] and dissipating a lot of energy [27]. Reducing the number of multipliers is especially important in the design of specialized fully parallel ASIC-based processors because minimizing the number of necessary multipliers reduces power dissipation and lowers the cost implementation of the entire system being implemented. It is proved that the implementation complexity of a hardwired multiplier grows quadratically with operand size, while the hardware complexity of a binary adder increases linearly with operand size [28]. Therefore, a reduction in the number of multipliers, even at the cost of a small increase in the number of adders, has a significant role in the ASIC-based implementation of the algorithm. Thus, it can be argued categorically that algorithmic solutions that require fewer hardware multipliers in an ASIC-based implementation are better than those that require more embedded multipliers.
This statement is also true for field-programmable gate array (FPGA)-based implementations. Most modern high-performance FPGAs contain a number of built-in multipliers. This means that instead of implementing the multipliers with a help of a set of conventional logic gates, you can use the hardwired multipliers embedded in the FPGA. Thus, all multiplications contained in a fully parallel algorithm can be efficiently implemented using these embedded multipliers; however, their number may not be enough to meet the requirements of a fully parallel implementation of the algorithm. So, the developer uses the embedded multipliers to implement the multiplication operations until all of the multipliers built into the chip have been used. If the embedded multipliers available in the FPGA run out, the developer will be forced to use ordinary logic gates instead. This will lead to significant difficulties in the design and implementation of the computing unit. Therefore, the problem of reducing the number of multiplications in fully parallel hardware-oriented algorithms is critical. It is clear that you can go the other way—use a more complex FPGA chip from the same or another family, which contains a larger number of embedded multipliers; however, it should be remembered that the hardwired multiplier is a very resource-intensive unit. The multiplier is the most resource-intensive and energy-consuming arithmetic unit, occupying a large area of the chip and dissipating a lot of power; therefore, the use of complex and resource-intensive FPGAs containing a large number of multipliers without a special need is impractical.
Table 2 shows FPGA devices of the Spartan-3 family, in which the number of hardwired multipliers allows one to implement the linear convolution operation in a single chip. So, for example, a 4-point convolution implemented using our proposed algorithm can be implemented using a single Spartan XC3S200 device, while a 4-point convolution implemented using a naive method requires a more voluminous Spartan XC3S400 device. A 5-point convolution implemented using our proposed algorithm can be implemented using a single Spartan XC3S200A chip, while a 5-point convolution implemented using a naive method requires a more voluminous Spartan XC3S1500A chip, and so on.
Thus, the hardware implementation of our algorithms requires fewer hardware multipliers than the implementation of naive calculation methods, all other things being equal. Taking into account the previously listed arguments, this proves their effectiveness. 5. Conclusions
In this paper, we analyzed possibilities to reduce the multiplicative complexity of calculating the linear convolutions for small length input sequences. We also synthesized new algorithms for implementing these operations for N = 3, 4, 5, 6, 7, and 8. Using these algorithms reduces the computational complexity of linear convolution, thus reducing its hardware implementation complexity too. In addition, as can be seen from Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6 and Figure 7, the proposed algorithms have a pronounced parallel modular structure. This simplifies the mapping of the algorithms into an ASIC structure and unifies its implementation in FPGAs. Thus, the acceleration of computations during the implementation of these algorithms can also be achieved due to the parallelization of the computation processes.
Figure 1. The signal flow graph of the proposed algorithm for computation of 2-point linear convolution.
Figure 2. The signal flow graph of the proposed algorithm for computation of 3-point linear convolution.
Figure 3. The signal flow graph of the proposed algorithm for computation of 4-point linear convolution.
Figure 4. The signal flow graph of the proposed algorithm for computation of 5-point linear convolution.
Figure 5. The signal flow graph of the proposed algorithm for computation of 6-point linear convolution.
Figure 6. The signal flow graph of the proposed algorithm for computation of 7-point linear convolution.
Figure 7. The signal flow graph of the proposed algorithm for computation of 8-point linear convolution.
Length N | Number of Arithmetical Units (Multipliers — “×” and Adders — “+”) | |||||
---|---|---|---|---|---|---|
Naïve Method | Proposed Solutions | Percentage Stimate | ||||
“×” | “+” | “×” | “+” | “×" | “+” | |
2 | 4 | 1 | 3 | 3 | 25% | 66.7% |
3 | 9 | 4 | 6 | 10 | 33.3% | 60% |
4 | 16 | 9 | 9 | 20 | 43.8% | 55% |
5 | 25 | 16 | 16 | 38 | 57.9% | 66% |
6 | 36 | 25 | 16 | 44 | 55.6% | 43.2% |
7 | 49 | 36 | 26 | 79 | 47% | 54.4% |
8 | 64 | 49 | 27 | 67 | 58% | 26.9% |
Length N | Features of the Implementation in Spartan-3 Family Devices | |
---|---|---|
Naïve Method | Proposed Solutions | |
Type of Device | Type of Device | |
2 | XC3S50 | XC3S50AN |
3 | XC3S200 | XC3S200 |
4 | XC3S400 | XC3S200 |
5 | XC3S1400AN | XC3S200AN |
6 | XC3S2000 | XC3S200 |
7 | XC3S4000 | XC3S1400AN |
Author Contributions
Conceptualization, A.C.; methodology, A.C.; validation, J.P.P.; formal analysis, A.C. and J.P.P.; writing-original draft preparation, A.C.; writing-review and editing, J.P.P.; visualization, A.C. and J.P.P.; supervision, A.C. and J.P.P. Both authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Conflicts of Interest
The authors declare no conflict of interest.
1. Blahut, R.E. Fast Algorithms for Signal Processing; Cambridge University Press: Cambridge, UK, 2010.
2. Tolimieri, R.; An, M.; Lu, C. Algorithms for Discrete Fourier Transform and Convolution; Springer Science & Business Media: New York, NY, USA, 1989.
3. McClellan, J.H.; Rader, C.M. Number Theory in Digital Signal Processing; Prentice-Hall: Englewood Cliffs, NJ, USA, 1979.
4. Berg, L.; Nussbaumer, H. Fast Fourier Transform and Convolution Algorithms. Z. fur Angew. Math. Und Mech. 1982, 62, 282.
5. Burrus, C.S.; Parks, T. Convolution Algorithms; John Wiley and Sons: New York, NY, USA, 1985.
6. Krishna, H. Digital Signal Processing Algorithms: Number Theory, Convolution, Fast Fourier Transforms, and Applications; Routledge: New York, NY, USA, 2017.
7. Bi, G.; Zeng, Y. Transforms and Fast Algorithms for Signal Analysis and Representations; Springer Science & Business Media: New York, NY, USA, 2004.
8. Mallat, S.G. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. 1989, 11, 674-693.
9. Parhi, K.K. VLSI Digital Signal Processing Systems: Design and Implementation; John Wiley & Sons: New York, NY, USA, 2007.
10. Chan, Y.H.; Siu, W.C. General approach for the realization of DCT/IDCT using convolutions. Signal Process. 1994, 37, 357-363.
11. Huang, T.S. Two-dimensional digital signal processing II. Transforms and median filters. In Two-Dimensional Digital Signal Processing II. Transforms and Median Filters; Topics in Applied Physics; Springer: Berlin/Heidelberg, Germany, 1981; Volume 43.
12. Pratt, W.K. Digital Image Processing; John Wiley & Sons: Hoboken, NJ, USA, 2007.
13. Wolberg, G.; Massalin, H. A fast algorithm for digital image scaling. In Communicating with Virtual Worlds; Springer: Tokyo, Japan, 1993; pp. 526-539.
14. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84-90.
15. Mathieu, M.; Henaff, M.; LeCun, Y. Fast training of convolutional networks through ffts. arXiv 2013, arXiv:1312.5851.
16. Lin, S.; Liu, N.; Nazemi, M.; Li, H.; Ding, C.; Wang, Y.; Pedram, M. FFT-based deep learning deployment in embedded systems. In Proceedings of the 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), Dresden, Germany, 19-23 March 2018; pp. 1045-1050.
17. Abtahi, T.; Shea, C.; Kulkarni, A.; Mohsenin, T. Accelerating convolutional neural network with fft on embedded hardware. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2018, 26, 1737-1749.
18. Lavin, A.; Gray, S. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27-30 June 2016; pp. 4013-4021.
19. Wang, X.; Wang, C.; Zhou, X. Work-in-progress: WinoNN: Optimising FPGA-based neural network accelerators using fast winograd algorithm. In Proceedings of the 2018 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ ISSS), Turin, Italy, 30 September-5 October 2018; pp. 1-2.
20. Cariow, A.; Cariowa, G. Minimal Filtering Algorithms for Convolutional Neural Networks. arXiv 2020, arXiv:2004.05607.
21. Wang, Y.; Parhi, K. Explicit Cook-Toom algorithm for linear convolution. In Proceedings of the 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing, Istanbul, Turkey, 5-9 June 2000; Volume 6, pp. 3279-3282.
22. Ju, C.; Solomonik, E. Derivation and Analysis of Fast Bilinear Algorithms for Convolution. arXiv 2019, arXiv:1910.13367.
23. Cariow, A. Strategies for the Synthesis of Fast Algorithms for the Computation of the Matrix-vector Products. J. Signal Process. Theory Appl. 2014, 3, 1-19.
24. Regalia, P.A.; Sanjit, M.K. Kronecker products, unitary matrices and signal processing applications. SIAM Rev. 1989, 31, 586-613.
25. Granata, J.; Conner, M.; Tolimieri, R. The tensor product: A mathematical programming language for FFTs and other fast DSP operations. IEEE Signal Process. Mag. 1992, 9, 40-48.
26. Wong, H.; Betz, V.; Rose, J. Comparing FPGA vs. custom CMOS and the impact on processor microarchitecture. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays, Monterey, CA, USA, 27 February-1 March 2011; pp. 5-14.
27. Jhamb, M.; Lohani, H. Design, implementation and performance comparison of multiplier topologies in power-delay space. Eng. Sci. Technol. Int. J. 2016, 19, 355-363.
28. Oudjida, A.K.; Chaillet, N.; Berrandjia, M.L.; Liacha, A. A new high radix-2r (r ≥ 8) multibit recoding algorithm for large operand size (N ≥ 32) multipliers. J. Low Power Electron. 2013, 9, 50-62.
Aleksandr Cariow
* and
Janusz P. Paplinski
Faculty of Computer Science and Information Technology, West Pomeranian University of Technology, Żołnierska 49, 71-210 Szczecin, Poland
*Author to whom correspondence should be addressed.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2020. This work is licensed under http://creativecommons.org/licenses/by/3.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
In this article, we propose a set of efficient algorithmic solutions for computing short linear convolutions focused on hardware implementation in VLSI. We consider convolutions for sequences of lengthN=2, 3, 4, 5, 6, 7, and 8. Hardwired units that implement these algorithms can be used as building blocks when designing VLSI -based accelerators for more complex data processing systems. The proposed algorithms are focused on fully parallel hardware implementation, but compared to the naive approach to fully parallel hardware implementation, they require from 25% to about 60% less, depending on the length N and hardware multipliers. Since the multiplier takes up a much larger area on the chip than the adder and consumes more power, the proposed algorithms are resource-efficient and energy-efficient in terms of their hardware implementation.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer