General Codec Criteria Overview


CODECS FEATURES
Transporting real-time voice over packet-switched networks such as the internet the used codecs require the following features. Sampling, compressing and packetization of the voice signal (Speech Coding). The dominant standard for transmitting multimedia in packed-switched networks is International Telecommunication Union (ITU) Recommendation H.323, which uses IP/UDP/RTP (LINK) encapsulation for audio. Furthermore the codec should also be able to transmit signaling information (DTMF Tones), otherwise an additional transport protocol like the H.225.0 is required.
TOP


BIT RATE
The bit rate is an very important parameter in the speech coder design. Caused by the growing need for bandwith conservation the coders shoud produce high quality even at lower bit rates. Speech coded at 64 kilobit per second (kbit/s) using logarithmic pulse code modulation (PCM) is considered as “non-compressed” and is often used as a reference for quality comparision. The range of bit rates that have been stardardized is from 2.4 kbit/s for secure telephony to 64 kbit/s G.711 PCM and G.722 (wideband 7kHz) speech coder.

TABLE 1

VoIP2_table1.gif width=500 height=340
Ref. [1],Figure 7
ITU, Cellular and secure telephony speech coding standards.

The overall bit rate to be transmitted is the sum of the coder dependent transmission bits (pay loads) and the header bit caused by the protocol. Real-time protocol (RTP) headers provide the sequence number and timestamp information needed to reassemble a real-time stream from packets. The Voice Over IP (VoIP) standards committee is proposing a subset of H.323 for audio over IP. The H.323 standard addresses Video (Audiovisual) communication on Local Area Networks. The Ratio of header overhead to payload decreases with increasing packet size. Therefor the possible data compression gain depends on this ratio, too.

VoIP12_fig1.jpg width=500 height=340
Ref. [12],Figure 1
Relative compression gain (20ms) speech (sampled at 8kHz) per packet, 40 octets per-packet header overhead)

The bit rate is a very important parameter for local users connected to the internet via modem. Modems offer a defined bounded bit rate, in general 56kbit/s.

fixed vs variable bit rate
Most coders operate at the same rate regardless of the input. Any of the fixed rate speech coders can be combined with a voice activity detector and made into asimple two-state variable-bit-rate system. Investigations show that individual talkers only have active speech about 40% of the time in a two-way conversation. More sophisticated systems, such as that used in IS-96, can be used to operate in more than two rates.

TOP


DELAY
Delays have a great impact on ist suitability for a particular application. We have to distinguish between real-time conversation and multimedia storage applications. We will consider the first case in the following in detail, because it is the most delay sensitive application. Wherebye the second case is the least senstive one.

Delays causes two problems defined in literature as echos and talker overlap. Echos are caused by signal reflections at the far-end during four wire to two wire hybrid conversion and acoustic feedback. Echos become significant if round trip delay are greater than 50ms. Implemented echo cancellation algorithms trie to reduce the enyoing reflections to a certain level. If the time gaps between the direct signal and the reflected ones are much greater than the mentioned threshold above, even smaller levels are still enyoing. Talker overlaps arise if the one-way delay becomes greater than 250 ms. The conversation will become more like a half-duplex or push-to-talk experience, rather than an ordinary conversation.

In the following different sources contributing to the overall one-way delay are considered and depicted in fig.2 reliable average values are given for different codecs in table 2.

VoIP1_p20_fig5.jpg width=320 height=300
Ref. [1],Figure 5

TABLE 2

VoIP1_table1.gif width=320 height=300
Ref. [1],Table 1

Accumulation Delay
Time needed to collect a frame of voice samples.

Processing Delay
Caused by encoding the actual frame and collecting the encoded samples into a packet for transmission. To reduce the packet network overhead multiple encoded frames are collected (e.g. G.729). Furthermore the lookahead delay, depending on the codec algorithm is included.

Network Delay
Is caused by the used physical medium and the transmission protocol. Variations of packet delays as high as 70 to 160ms, which can be regareded as the most significant part.

Buffer Delay (Jitter)
Caused by the variable network delay (depending on the current capacity and processing during transmission) an additional buffer at the receiver side is necessary to remove packet arrival-time jitter. To minimize the contributing delay the buffer size should be as small as possible. An approach to adapting optimaly the jitter buffer size is to count the packets arriving late and create the ratio to those processed successfully. This ratio is used to adjust the jitter buffer.

TOP


COMPLEXITY
The measures of complexity for a DSP and a CPU are somewhat different, due to the natures of these two systems. Caused by the different DSP architectures consequently different efficient implementations of coders are achivied. Computational complexity is the number of instructions per second required for implementations and is usually expressed in millions instructions per second (MIPS). A second measure is the required amount of memory (RAM). Required ROM storage (for storing programm instructions and constants) is the third measure of complexity.
TOP


PERFORMANCE & QUALITY
Under this criteria come the synonymous with intelligibility. Furthermore the signal-to-noise ratio (SNR) is a kind of quantity to measure the quality. The coder should be capable to transmit music or a combination of speech plus some other signals. Otherwise the problem known as robustness to background noise may occure. Therefor the performance of the speech coder degrades significantly. It is very difficult to verify compliance for a bitstream specification. Therefore most often it is done via subjective testing, comparing implementations under test with a known version of the coder. The ITU includes the Speech Quality Experts Group measuring speech quality and determining whether performance should be sufficient for a given application.

PACKET LOSS

The transport protocol of internet networks (RTP/TCP/IP) do not guarantee delivery of packets. There is no reliable Quality of Service (QoS) available only best-effort delivery service is offert. There two kinds of package losses which can be disdinguished namly single frame erasures and bursty lost. While single packet losses are of little consequence, loss bursts can cause noticeable dropouts in the received signal. There are different proposed kinds of error concealment. Forward error correction (FEC) schemes have been proposed to alleviate loss bursts of a small number of packets. For example lost frames are reconstruced using previous successfully received packets. Transmitting redundant information whereby copies of the previous k frames are added to the packet containing frame n. If a packet n-1 is lost, and k is set equal 2, it can be still reconstructed from either packet n or packet n+1. Furthermore there are investigaions on the receiver side (e.g. add silence, add noise, repeat last successfully received packet, etc.), at the transmitter side (e.g. see above, transformation based on interleaving, adaptive package size, adaptive variable bitrate, multi description coding, etc.) and joint source-channel solutions (heuristic information of packet loss in the transmission channel, source related information,etc.) to overcome packet loss. The details of these approaches are out of scope and will be presented samewhere else.

VoIP1_fig1.gif width=320 height=300
Ref. [1],Figure 7

TOP