Some of the material in is restricted to members of the community. By logging in, you may be able to gain additional access to certain collections or items. If you have questions about access or logging in, please use the form on the Contact Page.
Efficient implementations of the Discrete Fourier Transform (DFT) for GPUs provide good performance with large data sizes, but are not competitive with CPU code for small data sizes. On the other hand, several applications require the use of a large number of DFTs on small data sizes. In fact, even algorithms for large data sizes use a divide-and-conquer approach, where eventually small DFTs need to be performed. In this thesis, it will be shown that the asymptotically slow matrix multiplication approach can yield better performance on GPUs than asymptotically faster algorithms, for small data, due to its regular memory access and computational patterns. Also discussed in this thesis are the effect of different optimization techniques, and the combination of the matrix multiplication algorithm with the mixed radix algorithm for 2-D and 3-D complex DFTs. This implementation performs up to 21 times faster than cuFFT on an NVIDIA GeForce 9800 GTX GPU and up to 5 times faster than FFTW on a CPU. Furthermore, this GPU implementation can accelerate the performance of a Quantum Monte Carlo application for which cuFFT is not effective. The primary contributions of this work lie in providing a GPU implementation that is efficient for small DFTs and in demonstrating the utility of the matrix multiplication approach.