Discussion and Research on VLFFT Demonstration of TMS320C6678 Processor

This white paper presents an in-depth analysis of the VLFFT demonstration conducted on the TMS320C6678 processor. The demonstration showcases the performance of a one-dimensional single-precision floating-point FFT algorithm with sizes ranging from 16K to 1024K, executed across one, two, four, or eight DSP cores of the TMS320C6678. The results highlight the efficiency and scalability of the C66X architecture, demonstrating that the processor's performance increases proportionally with the number of active cores.

The FFT algorithm is widely used in various applications such as medical imaging, communications, radar systems, and electronic warfare. In this demonstration, the TMS320C6678 was tested at a clock speed of 1 GHz, and it successfully completed a 1024K FFT in just 6.4 milliseconds when using all eight cores. This performance underscores the processor’s capability to handle high-throughput signal processing tasks efficiently.

The TMS320C6678 is a powerful system-on-chip (SoC) featuring eight C66x DSP cores, operating at up to 1.25 GHz. It delivers 160 gigaflops per second while consuming less than 10 watts. The device includes 512 KB of L2 memory, 8 MB of on-chip memory with 4 MB of shared memory, both equipped with error correction codes. Its DDR3 interface is 64-bit with 8-bit ECC, supporting speeds up to 1600 Mbps and enabling access to up to 8 GB of external memory. Additionally, the TMS320C6678 supports high-speed interconnects such as PCIe, Serial RapidIO, Gigabit Ethernet, and TI’s HyperLink interface, offering up to 50 Gbps connectivity for multi-core and heterogeneous system integration.

During the VLFFT demo, the TMS320C6678 operated at 1 GHz, with the DDR3 interface running at 1333 MHz. The demo involved loading data from external memory, performing computations on the DSP cores, and storing the results back to memory. Throughout the process, cycle counts and timing measurements were continuously monitored to evaluate performance.

The VLFFT algorithm requires input data to be stored in external memory. The data is then processed by the DSP cores, with the results stored back to external memory. To optimize performance, the algorithm distributes the workload across multiple cores, leveraging the high-performance computing capabilities of the C66X architecture. The one-dimensional VLFFT is implemented using a two-dimensional approach based on the time-domain extraction method, where large N is decomposed into N = N1 × N2. For example, a large one-dimensional array is represented as a 2D array of N1 rows by N2 columns, and the FFT is computed in steps involving column-wise and row-wise transformations.

This parallel FFT algorithm, known as the Takahashi algorithm, is optimized for multi-core execution. When using multiple cores, the first step involves computing the FFT on the N2 columns within each N1 row, followed by multiplying by rotation factors. The results are then stored and reorganized into a new 2D array before the final FFT is computed on the N1 rows. Each core is responsible for a portion of the computation, with the primary core managing synchronization and coordination among the other cores. Data is prefetched into the L2 SRAM via DMA and returned to external memory after processing. Each core uses two DMA channels to transfer data between internal and external memory.

The results of the demonstration show that increasing the number of cores significantly reduces the execution time of the FFT. When using two cores instead of one, the execution time was reduced by an average of 49.3%, which is close to the ideal 50% reduction. With four cores, the time was reduced by 72.5%, and with eight cores, it dropped by 81.6%. These results confirm the effectiveness of the multi-core architecture in achieving high performance and efficient resource utilization.

TI's internal demo report leaked: several pictures to understand the real performance of the TMS320C6678 processor

Figure 1: Block diagram of TMS320C6678

TI's internal demo report leaked: several pictures to understand the real performance of the TMS320C6678 processor

TI's internal demo report leaked: several pictures to understand the real performance of the TMS320C6678 processor

Table 1: Results of FFT and milliseconds for FFT on 1/2/4/8 DSP core respectively

Multicore Silicone Cable

Multicore Silicone Cable,Insulated Wire,Insulated Silicone Cable,Oil Proof Silicone Wire

JIANGSU PENGSHEN HIGH TEMPERATURE WIRE CABLE CO., LTD. , https://www.pengshencable.com