Body

A 16 POINT FPGA-ORIENTED PARALLEL PIPELINED FFT PROCESSOR

(draft)

Abstract : An algorithm well suited for implementation on a FPGA is developed. The algorithm distributes computation and memory requirements evenly among the processors. The algorithm integrates the ram used on to the procesor itself to achieve higher speed. It also reduces the width of the interconnections between the different stages of FFT.

INTRODUCTION

Parallel pipelined FFT processors are employed to meet the growing demand of high processing rate. Highly parallel implementations obtain high computation rates but requires the simultanious distribution of all data samples. The high rate of distribution of data required to keep the processors busy is impossible to achive especially in real time applications involving word-serial data. This problem coupled with limited i/o resources in FPGAs makes the parallel algorithm inefficient.

The cascade FFT computes one transform in O(N) processing cycles, producing the o/p sequentially at the input data rate. So the cascade FFT is ideally suited for 1-dimensional real-time signal processing where data arrival is word-serial in nature. But cascade FFT requires registers organised as shift-registers between butterfly computation units with varying depths from 2 to N/2 words. Jones and Sorenson proposed an algorithm for overcoming this, in evaluating their multiprocessor architecture for FFT. Yu Tai Ma proposed an algorithm to reduce the number of twiddle factors in each processor to a minimum. These ideas with some modifications are used to design a 32-point FFT processor using multiple FPGAs.

A PARALLEL FFT ALGORITHM

The DFT of N-point is defined as

X_k = _m=0^n-1 x_m. _nw^mk -------------------------------------------------------(1)

w_n^mk = exp(-j.2/N m.k)

k = 0,1,.......,N-1

N = 2ⁿ

N = N₁.N₂

m = N₁m₂+ m₁

k = N₂k₁ + k₂

N₁ = 2ⁿ¹

N₂ = 2ⁿ²

m₁,k₁= 0,1,2,3,...........,N₁-1

m₂,k₂ = 0,1,2,3,...........,N₂-1

Then eq.(1) can be expressed as

X_{N2k1 + k2} = _m1=0^N1-1 (_N1N2 W^{m1(N2k1 + k2)} ) _m2=0^N2-1 (x_{N1m2 + m1} . _N2W^m2k2 ) --------(2)

eq.(2) can be further expressed as

Y_{N1k2 + m1} = _m2=0^N2-1 (x_{N1m2 + m1} . _N2W^m2k2 )

X_{N2k1 + k2} = _m1=0^N1-1 Y_{N1k2 + m1} . (_N1N2 W^{m1(N2k1 + k2)} ) -------------------------(3)

eq.(3) can be computed easily through the FFT algorithm and (4) can be computed in a similar way as is the standard FFT algorithm. The flow for a 16-point FFT using such an algorithm is as shown in fig.(1).

Fig.1. Flow diagram for 16-point FFT

Fig.2 : Division of computations among the FPGAs

The division of computations among the FPGAs is as shown in fig.2. There are 4 independent length-4 transforms in the first 2 stages of the radix-2 decimation in time algorithm as the first part of FFT computation. The last 2 stages of 4 independent length-4 transforms form the second part.There are 4 processors at each pipeline stage and each FFT processor computes a 4 point FFT. The reduced twiddle factor tables stored are shown in table-I

Table 1 : REDUCED TWIDDLE FACTOR TABLES STORED IN FFT PROCESSORs at second pipeline stage.

PROCESSOR NO. 0 1 2 3

0
1
2
3

--
--
--
--
₁₆W⁰
₁₆W¹
₁₆W²
₁₆W³
₁₆W⁰
₁₆W²
₁₆W⁴
₁₆W⁶
₁₆W⁴
₁₆W⁵
₁₆W⁶
₁₆W⁷

Processors

The processors used in the two pipeline stages incorporate both the twiddle factor ROM and the data RAM on the chip. The look-up table based logic in some of the commercial FPGAs makes this idea all the more efficient. Each processor is structured internally as shown in Fig.3. The single input and output buses in each processor is time shared between the different butterfly computation units for i/o of the samples and transforms.

Fig.3: Processor organisation and internal structure of FFT processors.

There is a single input[x(m)], output[X(k)] and transfer[Y(N₁k₂+m₁)] bus time-shared between the processors. To prevent contention of the bus, data transforms are done as in Fig.4.

output bus

tranfer bus

input bus

X₁ from q2
y₄ from p0
to q2
x₀ to p0
X₉ from q2
y₅ from p1
to q2
x₁ to p1
X₅ from q2
y₆ from p2
to q2
x₂ to p2
X₁₃ from q2
y₇ from p3
to q2
x₃ to p3

output bus

tranfer bus

input bus

X₃ from q3
y₁₂ from p0
to q3
x₄ to p0
X₁₁ from q3
y₁₃ from p1
to q3
x₅ to p1
X₇ from q3
y₁₄ from p2
to q3
x₆ to p2
X₁₅ from q3
y₁₅ from p3
to q3
x₇ to p3

output bus

tranfer bus

input bus

X₀ from q0
y₀ from p0
to q0
x₈ to p0
X₈ from q0
y₁ from p1
to q0
x₉ to p1
X₄ from q0
y₂ from p2
to q0
x₁₀ to p2
X₁₂ from q0
y₃ from p3
to q0
x₁₁ to p3

output bus

tranfer bus

input bus

X₂ from q1
y₈ from p0
to q1
x₁₂ to p0
X₁₀ from q1
y₉ from p1
to q1
x₁₃ to p1
X₆ from q1
y₁₀ from p2
to q1
x₁₄ to p2
X₁₄ from q1
y₁₁ from p3
to q1
x₁₅ to p3

Fig. 4 : Synchronisation of data transfers for time sharing the buses.

A controller generating control signals for data transfers is used for write enables to input registers and for the mux select signals at the output of each processors. To reduce the number of control signal lines to be generated by the controller all the three tranfers, namely input, output and intermediate transefers are done with the same signals.

(incomplete)