Section 8: Digitising Speech, Music & Video - The University of ...

Individual work, homework, classroom discussions, exercises. Instructor (s) ...... (
hours/week). Laboratory. (hours/week). National. Credit. ECTS. Plastic Arts. GSM
112. 2. 1. 2 ...... Apprendre à traduire: typologie d'exercice de traduction. Nancy: ...

Part of the document


University of Manchester
Comp30291 : Digital Media Processing 2009-10 Section 8: Processing Speech, Music & Video
8.1. Digitising speech: Traditional telephone channels normally restrict speech to a band-width of
300 to 3400 Hz. This band-pass filtering is considered not to cause
serious loss of intelligibility or quality, although it is easily
demonstrated that the loss of signal power below 300 Hz and above 3400 Hz
has a significant effect on the naturalness of the sound. Once band-
limited in this way the speech may be sampled at 8 kHz, in theory without
incurring aliasing distortion. The main "CCITT standard" for digital
speech channels in traditional non-mobile telephony (i.e. in the "plain old
fashioned telephone system" POTS) is an allocation of 64000 bits/sec to
accommodate an 8 kHz sampling rate with each sample quantised to 8 bits per
sample. This standard is now officially known as the "ITU-T G711" speech
coding standard. Since the bits are transmitted by suitably shaped voltage
pulses, this is called "pulse-code modulation" (PCM). Exercise: Why are the lower frequencies, i.e. those below 300Hz, are
normally removed? 8.1.1.International standards for speech coding: The CCITT which stands for "Comite Consultif International de Telephonie
and Telegraphie" was, until 1993, an international committee responsible
for setting global telecommunication standards. This committee existed up
to 1993 as part of the "International Telecommunications Union" (ITU) which
was, and still is, part of the "United Nations Economic Scientific &
Technical Organisation (UNESCO)". Since 1993, the CCITT has become part of
what is now referred to as the "ITU Telecommunications Standards Sector
(ITU-T)". Within ITU-T are various "study groups" which include a study
group responsible for speech digitisation and coding standards.
With the advent of digital cellular radio telephony, a number of national
and international standardisation organisations have emerged for the
definition of all aspects of particular cellular mobile telephone systems
including the method of digitising speech. Among the organisations
defining standards for telecommunications and telephony the three main ones
are the following:
. "TCH-HS": part of the "European Telecommunications Standards
Institute (ETSI)". This body originated as the "Groupe Special
Mobile (GSM)" and is responsible for standards used by the European
"GSM" digital cellular mobile telephone system.
. "TIA" Telecommunications Industry Association. The USA equivalent
of ETSI.
. "RCR" Research and Development Centre for Radio Systems" the
Japanese equivalent of ETSI.
Other telecommunications standardising organisations, generally with more
restricted or specialised ranges of responsibility, include the
"International Maritime Satellite Corporation (Inmarsat)" and committees
within NATO.
Standards exist for the digitisation of "wide-band" speech band-limited,
not from 300 to 3.4 kHz, but from 50 Hz to 7 kHz. Such speech bandwidths
give greater naturalness than that of normal telephone ("toll") quality
speech and are widely used for teleconferences. An example of such a
standard is the "ITU G722" standard for operating at 64, 56 or 48 kb/s. To
achieve these reduced bit-rates with the wider speech bandwidth
requirement, fairly sophisticated "compression" DSP techniques are
required. A later version of G722 incorporating 24kb/s and 16 kb/s
requires even more sophisticated DSP compression algorithms.
8.1.2. Uniform quantisation: Quantisation means that each sample of an
input signal x(t) is approximated by the closest of the available
"quantisation levels" which are the voltages for the binary numbers of
given word-length.
Uniform quantisation means that the difference in voltage between
successive quantisation levels, i.e. step-size, delta ((), is constant.
With an 8-bit word-length, & input range -V to +V, there will be 256 levels
with ( = V/128. If x(t) is between (V, & samples are rounded, uniform
quantisation produces error between ((/2. For each sample with true value
x[n], the quantised value is
x[n] + e[n] where e[n] is an error sample satisfying: ((/2 ( e[n] (
(/2 If x(t) ever becomes larger than +V or smaller than (V, overflow will occur
and the magnitude of the error may become much larger than (/2. Overflow
should be avoided. Then the samples e[n] are generally unpredictable or
"random" within the range ((/2 to (/2. Under these circumstances, when the
quantised signal is converted back to an analogue signal, the effect of
these random samples is to add a random error or "quantisation noise"
signal to the original signal x(t). The quantisation noise would be heard
as sound added to the original signal. The samples e[n] may then be
assumed to have a uniform probability density function (pdf) between -(/2
and (/2 . In this case, the probability density function (pdf) of e[n]
must be equal to 1/( in the range -(/2 to (/2, and zero outside this range.
It may be shown that the mean square value of e[n] is:
[pic] watts.
This becomes the 'power' of the analogue quantisation error (quantisation
noise) in the frequency range 0 to fs/2 Hz where fs is the sampling
frequency, normally 8 kHz for telephone speech. 8.1.3. Signal-to-quantisation noise ratio (SQNR): This is a measure of how
seriously a signal is degraded by quantisation noise. It is defined as:
[pic]
With uniform quantisation, the quantisation-noise power in the range 0 to
fs/2 Hz is (2/12 and is independent of signal power. Therefore the SQNR
will depend on the power of the signal, and to maximise this, we should try
to amplify the signal to make it as large as possible without risking
overflow. It may be shown that when we do this for sinusoidal waveforms
with an m-bit uniform quantiser the SQNR will be approximately 6m +1.8 dB.
We may assume this formula to be approximately true for speech.
Difficulties can arise in trying to fix the degree of amplification to
accommodate telephone users with loud voices and also those with quiet
voices with a step-size ( is determined by the ADC. If the amplification
accommodates loud voices without overflow, the SQNR for quieter voices may
too low. To make the SQNR acceptable for quiet voices we risk overflow for
loud voices. It is useful to know over what dynamic range of input powers
the SQNR will remain acceptable to users. 8.1.4. Dynamic Range: [pic]
Assume that for telephone speech to be acceptable, the SQNR must be at
least 30dB. Assume also that speech waveforms are approximately
sinusoidal and that an 8-bit uniform quantiser is used. What is the dynamic
range of the speech power over which the SQNR will be acceptable?
Dynamic range = 10log10( (Max possible signal power) / ((2/12) )
(10 log10 ( (min
power with acceptable SQNR) / ((2/12) )
= Max possible SQNR (dB) - Min acceptable SQNR
(dB)
= (6m + 1.8) ( 30 = 49.8 ( 30 =
19.8 dB.
This calculation is easy, but it only works for uniform quantisation. Just
subtract 'minimum acceptable SQNR' from 'maximum possible signal power', in
dB. This is a rather small dynamic range, not really enough for telephony. 8.1.5. Instantaneous companding: Eight bits per sample is not sufficient
for good speech encoding (over the range of signal levels encountered in
telephony) if uniform quantisation is used. The problem lies with setting
a suitable quantisation step-size. If it is too large, small signal levels
will have SQNR below the limit of acceptability; if it is too small, large
signal levels will be distorted due to overflow. One solution is to use
instantaneous companding where the step-size between adjacent quantisation
levels is effectively adjusted according to the amplitude of the sample.
For larger amplitudes, larger step-sizes are used as illustrated in Fig
8.1.
[pic]
This may be implemented by passing x(t) through a "compressor" to produce a
new signal y(t) which is then quantised uniformly and transmitted or stored
in digital form. At the receiver, the quantised samples of y(t) are passed
through an "expander" which reverses the effect of the compressor to
produce an output signal close to the original x(t). Fig 8.2 A common compressor uses a function which is linear for (x(t)(close to zero
and logarithmic for larger values. A suitable formula, which accommodates
negative and positive values of x(t) in the range -V to +V is:[pic]
where sign(x(t)) =1 when x(t) ( 0 and (1 when x(t) < 0, K = 1+ loge (A)
and A is a constant. This is 'A-law companding' which is used in UK with A
= 87.6 and K = 1+loge(A) = 5.473. This value of A is chosen because it
makes A/K = A/(1 + loge(A)) =16. If V=1, which means that x(t) is assumed
to lie in the range (1 volt, the 'A-law' with A=87.6 formula becomes:
A graph of y(t) against x(t) would be difficult to draw with A=87.6, so it
is shown below for the case where A (10 making K(3. With A=10, 10% of the range ( (1) for x(t), i.e. that between (1/A, is
mapped to 33% of the range for y(t). When A=87.6, approximately 1.14 %
(100/87.6) of the domain of x(t), is linearly mapped onto approximately
18.27 % of the range of y(t). The effect of the compressor is amplify
'small' values of x(t), i.e. those between (V/A so that they are quantised
more accurately. . When A=87.6, 'small' samples of x(t) are made 16 times
larger. The amplification for larger values of x(t) has to be reduced to
keep y(t) between (1. The effect on the shape of a sine-wave and a
triangular wave is illustrated below.
The expander formula, which reverses the effect of the 'A-law' compressor,
is as foll