Optimization and Implementation of G.729 Speech Codec Algorithm Based on TMS320C5416

With the rapid development of multimedia information technology and network technology, the amount of information has increased rapidly, making channel resources more and more valuable. In order to transmit as much information as possible under limited channel resources, speech compression becomes a necessary means. The ITU (International Telecommunication Union) established the G.729 protocol in 1996, the Conjugate Structure Code Excited Linear Predictive Coding Algorithm (CS-ACELP). Its encoding rate is 8kb/s, which can meet the requirements of network communication, has good voice quality, and has strong adaptability to different application environments. It is a good performance international standard for voice compression and is widely used in individuals. Mobile communications, satellite communications and other fields.
1 Principle of the G.729 codec algorithm The waveform coding of the speech signal forces the reconstructed speech waveform to maintain the waveform shape of the original speech signal. Such encoders usually process the speech signal as a general waveform signal, which has the advantages of strong adaptability, good speech quality, etc., but requires a high coding rate. The parameter encoding reduces the coding rate by extracting and encoding the characteristic parameters of the speech signal, and tries to make the reconstructed speech signal maintain the semantic meaning of the original speech as much as possible, and the waveform of the reconstructed signal may be quite different from the waveform of the original speech signal. In the mid-1970s, especially since the 1980s, speech coding technology has made breakthroughs and proposed some very effective processing methods, such as hybrid coding. This algorithm overcomes the weaknesses of the original waveform encoder and vocoder, and combines their respective strengths to obtain high-quality synthesized speech at a rate of 4 kb/s to 16 kb/s, and in essence also has waveform coding. The advantages. The CELP vocoder used in the CS-ACELP (Conjugate-Structure Al2gebraic-Coder-Excited Linear Prediction) vocoder described by G.729 belongs to this type of encoder.
CELP coding is based on synthetic analysis (ABS) search process, perceptual weighted vector quantization (VQ) and linear prediction (LP) techniques, which use this coding scheme to greatly reduce the bit rate of transmission. The idea of â€‹â€‹CS-ACELP is derived from the idea of â€‹â€‹conjugate structure code linear prediction (CS-CELP) and algebraic code-excited linear prediction (ACELP). At the encoding end, four steps of quantization of the line spectrum pair (LSP) parameters, pitch analysis, fixed codebook search and gain quantization are mainly performed. The encoder first preprocesses the input signal (8 kHz sampled 16-bit PCM signal), then linearly predicts each frame of the speech signal, obtains the LPC coefficients, converts the LPC parameters into LSP parameters, and finally performs vector quantization on the LSP parameters. In the next pitch analysis, each frame first searches for a candidate delay of the optimal pitch delay T, and then searches for the optimal pitch delay for each frame based on the candidate delay. Finally, the adaptive codebook gain and the fixed codebook gain are quantized. At the decoding end, various parameter flags are first obtained from the received bit stream for decoding, and a 10 ms speech frame coding parameter is obtained. The decoder interpolates the LSP coefficients in each sub-frame and transforms them into LP filter coefficients, and then performs excitation generation, speech synthesis, and post-processing.
2 Algorithm optimization and DSP application improvement G.729 speech codec system requires high real-time performance, and it is necessary to complete the specified processing on the external input signal within a limited time, that is, the signal processing speed must be greater than or equal to the input signal update speed, so it is required Optimize and improve the algorithm. Optimize the code written in C language, use inline instructions at the same time, and embed assembly statements in C programs to maximize the speed of signal processing.
2.1 Optimization of the algorithm Firstly, the algorithm is improved. As shown in Figure 1, a CS-ACELP speech is used to combine the WD-LSP (Weighted Delta-LSP) ^[1] function with the sub-optimal partial codebook for fast search. The coding algorithm and the perceptual weighting filter based on the acoustic psychology model enable the speech coding to reduce the computational complexity without degrading the speech quality. The WD-LSP function is mainly used to distinguish the boundary of UV-V (unvoice-voice)/SV (silence-voice). The principle is: if the function value is greater than the given limit value Î·, the open-loop pitch delay Top is re-estimated; otherwise, the open-loop pitch delay Top is updated with the previous frame adaptive codebook delay. The WD-LSP function in the i-th frame Fi and the algorithm used to determine the open-loop pitch delay Top are as follows:

Where LSP _i (k) is the k-th order LSP coefficient in the ith frame; w _k is the weighting coefficient used to enhance the WD-LSP function of the UV-V/SV boundary. To obtain w _k , a large database containing 23 014 UV-V boundaries and 9 519 SV boundaries was used to estimate the square root value (RMS) of the delta-LSP at the UV-V/SV boundary. Therefore, WD-LSP is very sensitive for detecting VU-V/SV boundaries. Î· is a limit value set to 0.01. The entire calculation can save 21% of the calculation. The voice signal before and after this algorithm is shown in Figure 2.

This article refers to the address: http://

2.2 C language optimization The vocoder based on the G.729 standard is finally implemented in real time on the fixed point TMS320C5416. In the fixed point TMS320C5416, the floating point number is represented by fixing the decimal point at a specific position, which is one of the limitations of the fixed point TMS320C5416. In order to distinguish the different ranges of decimals, the Q-format is used. Different Q-formats differ in the location of the decimal point, so the integer domain is also different. When two numbers are multiplied, a special sign bit is generated. For example, if two Q4 numbers are multiplied, a left shift operation is required to remove the extra sign bit. The product should be in a Q9 format. If the FRST bit in the DSP is set, this shifting operation that removes the extra sign bit can be done automatically. For a 16-bit multiplication operation, a 32-bit product should be obtained. However, since only a 16-bit product is required, only the upper 16 bits of the 32-bit product are stored, and the lower 16 bits of the product are discarded. In order to achieve high accuracy, in the continuous multiplication operation (such as convolution), the 32-bit calculation result should be maintained at all times, and only the lower 16-bit truncation operation is discarded for the final calculation result. In order to achieve higher accuracy, a double-precision format is used in this operation, which only occurs when the single precision is not enough, and it is not necessary to use 32-bit precision. Multiplying two 32-bit numbers requires only a 32-bit product instead of 64 bits, but notes that the TMS320C5416 is 16-bit, so in the double-precision format, the 32-bit integer is divided into a high-order word and a low-order word. Both the high and low words contain sign bits for fast multiplication. Its format is as follows:
L_32=hi_word<<16+lo_word<<1
Hi_word=L_32>>16
Lo_word=L_32-hi_word>>1
An overflow will occur when the value in the accumulator exceeds a certain range. In the G.729 algorithm standard, the value of the accumulator is limited to 80000000 to 7FFFFFFF - the smallest negative number and the largest positive number. However, in the TMS320C5416, if the OVM in the PMST register is set, the overflow is automatically processed.
2.3 Application of inline instructions and embedded assembly statements in C programs Due to the characteristics of speech coding, codec functions are organized by some basic functions of addition, subtraction, multiplication and division. These functions are defined in BASIC OP.C and OPER_32B.C. In these files, if you can optimize the intrinsic instructions for these simple functions, you can achieve twice the result with half the effort. Inline instructions are direct mappings of assembly instructions and are highly efficient. E.g:
#define muh_ r(varl,var2) _mpylir(varl,var2)
#define L_ add(L_var1,L_var2) _sadd(L_var1,L_var2)
#define L_ muh(var1,var2) _smpy(var1,var2)
The method of embedding assembly statements in a C program is relatively simple. Just add a quotation mark around the assembly statement, then enclose the assembly statement in parentheses, and add an ASM identifier before the parentheses, such as ASM ("Assemble statement" "). On the one hand, some hardware control functions that cannot be implemented in C language can be implemented in the C program, such as modifying the interrupt control register, interrupt enable or mask, reading the status register and interrupt flag register, etc. This method can be used to replace the C language with the assembly part in the key part of the C program to optimize the program. The disadvantage of using this method is that it is easier to break the C environment, because the C compiler does not check or analyze the embedded assembly statement when compiling the C program embedded with the assembly statement. To take this approach, you need to pay attention to the following points:
(1) Do not destroy the C environment because the C compiler does not check and analyze embedded assembly statements.
(2) Assembly statement Do not change the value of the variable in the C program. Do not add the assembler to the assembly statement to change the assembly environment.
Based on the simplified algorithm, the C optimizer provided by CCS is used for C language optimization, and inline function and assembly optimization are also used.
3 G.729 implementation on TMS320C5416
3.1 TMS320C5416 architecture and application TMS320C5416 (hereinafter referred to as C5416) is a cost-effective general-purpose 16-bit fixed-point DSP chip recently introduced by TI. Its core CPU has the same basic composition as the TMS320C54X series. The single instruction cycle of the C5416 is 6.25 RS, and the number of instructions executed per second is 160Ã—106. The instruction system is rich and has many multi-function instructions. It uses a 6-level instruction pipeline structure, which is suitable for implementing low-latency G. . 729 vocoder. Use a 40bit ALU, 128K Ã— 16bit on-chip RAM (including 64KB on-chip DARAM and 64KB on-chip SARAM), 3 independent l6bit data memory bus, 1 program memory bus, 3 MCBSP, 6-channel DMA controller One 8/16 bit parallel enhanced host port interface and two l6bit timers.
In the TMS320C5416, the A/D and D/A conversion of the voice signal is performed by the PCM3002. The PCM3002 uses two serial channels, one for controlling the internal registers and the other for data transmission. The default speech signal sampling rate in the system board TMS320C5416 is 48 kHz. The sampling rate of the PCM3002 signal is set by modifying the internal control register of the PCM3002. In order to meet the requirements of G.729 coding, the sampling rate of the PCM3002 signal is 8 000 Hz. In order to make full use of the DSP for signal processing, the sampled data is sent to the DMA buffer by using MCBSP and DMA. When the buffer is full, an interrupt is generated. The DSP reads the data in the DMA buffer into the DSP for processing. The processed data is then sent to the DMA transmit buffer.
3.2 G.729 in the implementation of TMS320C5416 G.729 processing process using block processing technology shown in Figure 3. According to the G.729 standard, each block (frame) consists of 80 samples, and the first 80 samples are stored. Two operations are performed simultaneously. While processing the data in the block L, the data of the L+1 block is stored.

In the G.729 software simulation, the part with large computational complexity is the vector quantization of LSP coefficients and the search of the excitation codebook (adaptive codebook and fixed codebook). The computational complexity of these two parts accounts for all codecs. 60% or more of the calculation amount. Therefore, in the optimization process, the functions of the fixed codebook Acelp_Code_A(), the fractional pitch analysis pitch_fr3(), the open-loop pitch analysis pitch_ol_fast(), the gain quantization Qua_gain(), etc., which occupy the majority of the calculation amount, are optimized; The algorithm can not meet the real-time requirements, but also uses the C optimizer provided by CCS for C language optimization, and can also use inline functions and assembly statements. After the above processing, the output signal satisfies the communication requirements. By analyzing the speed comparison table of these main modules before and after optimization (as shown in Table 1), it can be seen that the optimization effect of each main module is obvious. After a frame of speech signal is processed before and after the amplitude-frequency diagram (as shown in Figure 4), it can be seen that the speech signal is processed to maintain good speech quality.

The system operation is mainly divided into four processes: voice storage, data encoding compression, data decompression, and voice playback. The input voice data is first subjected to anti-aliasing filtering, then subjected to analog-to-digital conversion, collected by the DSP and stored in the RAM memory, that is, a voice storage process; then the encoding program is executed to compress and store the previously stored information, which is The encoding process; then decoding and storing the data back to the original location; finally, the DSP executes the output instruction, and sends the decoded data to the digital-to-analog converter to implement the analog output.
Finally, G. is implemented in real time with C5416. The 729 vocoder uses the vocoder to play pure voice files, voice and background music files in real time. The subjective test results of reconstructed speech quality show that the restored speech retains a good speaker feature, and the synthesized speech has good clarity and naturalness. The vocoder performance test data is as follows: the average number of clock cycles per codec is 1 010 350, and the CPU clock frequency is 160 MHz, so the codec requires 7.31 ms for one frame; the program RAM capacity is 9.381 KB; the data and constant RAM capacity It is 7.146KB. The above data indicates that G. The real-time implementation of the 729 codec on the C5416 is well suited for teleconferencing, multimedia communications, and communications systems that use wideband speech coding.

2.0Mm Male Header

Pcb Pin Header,2.0Mm Male Header,2.0Mm Male Header Pins,2.0Mm Pin Header Connector

SHENZHEN ANTENK ELECTRONICS CO,LTD , http://www.antenk.com