|
TECHNICAL SPECIFICATION
Digital cellular telecommunications system (Phase 2+) (GSM);
Full rate speech;
Voice Activity Detector (VAD)
for full rate speech traffic channels
(3GPP TS 46.032 version 15.0.0 Release 15)
R
GLOBAL SYSTEM FOR
MOBILE COMMUNICATIONS
---------------------- Page: 1 ----------------------
3GPP TS 46.032 version 15.0.0 Release 15 1 ETSI TS 146 032 V15.0.0 (2018-07)
Reference
RTS/TSGS-0446032vf00
Keywords
GSM
ETSI
650 Route des Lucioles
F-06921 Sophia Antipolis Cedex - FRANCE
Tel.: +33 4 92 94 42 00 Fax: +33 4 93 65 47 16
Siret N° 348 623 562 00017 - NAF 742 C
Association à but non lucratif enregistrée à la
Sous-Préfecture de Grasse (06) N° 7803/88
Important notice
The present document can be downloaded from:
The present document may be made available in electronic versions and/or in print. The content of any electronic and/or
print versions of the present document shall not be modified without the prior written authorization of ETSI. In case of any
existing or perceived difference in contents between such versions and/or in print, the only prevailing document is the
print of the Portable Document Format (PDF) version kept on a specific network drive within ETSI Secretariat.
Users of the present document should be aware that the document may be subject to revision or change of status.
Information on the current status of this and other ETSI documents is available at
If you find errors in the present document, please send your comment to one of the following services:
Copyright Notification
No part may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying
and microfilm except as authorized by written permission of ETSI.
The content of the PDF version shall not be modified without the written authorization of ETSI.
The copyright and the foregoing restriction extend to reproduction in all media.
© ETSI 2018.
All rights reserved.
TM TM TM
DECT , PLUGTESTS , UMTS and the ETSI logo are trademarks of ETSI registered for the benefit of its Members.
TM TM
3GPP and LTE are trademarks of ETSI registered for the benefit of its Members and
of the 3GPP Organizational Partners.
oneM2M logo is protected for the benefit of its Members.
GSM and the GSM logo are trademarks registered and owned by the GSM Association.
ETSI
---------------------- Page: 2 ----------------------
3GPP TS 46.032 version 15.0.0 Release 15 2 ETSI TS 146 032 V15.0.0 (2018-07)
Intellectual Property Rights
Essential patents
IPRs essential or potentially essential to normative deliverables may have been declared to ETSI. The information
pertaining to these essential IPRs, if any, is publicly available for ETSI members and non-members, and can be found
in ETSI SR 000 314: "Intellectual Property Rights (IPRs); Essential, or potentially Essential, IPRs notified to ETSI in
respect of ETSI standards", which is available from the ETSI Secretariat. Latest updates are available on the ETSI Web
server (https://ipr.etsi.org/).
Pursuant to the ETSI IPR Policy, no investigation, including IPR searches, has been carried out by ETSI. No guarantee
can be given as to the existence of other IPRs not referenced in ETSI SR 000 314 (or the updates on the ETSI Web
server) which are, or may be, or may become, essential to the present document.
Trademarks
The present document may include trademarks and/or tradenames which are asserted and/or registered by their owners.
ETSI claims no ownership of these except for any which are indicated as being the property of ETSI, and conveys no
right to use or reproduce any trademark and/or tradename. Mention of those trademarks in the present document does
not constitute an endorsement by ETSI of products, services or organizations associated with those trademarks.
Foreword
This Technical Specification (TS) has been produced by ETSI 3rd Generation Partnership Project (3GPP).
The present document may refer to technical specifications or reports using their 3GPP identities, UMTS identities or
GSM identities. These should be interpreted as being references to the corresponding ETSI deliverables.
The cross reference between GSM, UMTS, 3GPP and ETSI identities can be found under
.
Modal verbs terminology
In the present document "shall", "shall not", "should", "should not", "may", "need not", "will", "will not", "can" and
"cannot" are to be interpreted as described in clause 3.2 of the ETSI Drafting Rules (Verbal forms for the expression of
provisions).
"must" and "must not" are NOT allowed in ETSI deliverables except when used in direct citation.
ETSI
---------------------- Page: 3 ----------------------
3GPP TS 46.032 version 15.0.0 Release 15 3 ETSI TS 146 032 V15.0.0 (2018-07)
Contents
Intellectual Property Rights . 2
Foreword . 2
Modal verbs terminology . 2
Foreword . 5
1 Scope . 6
2 References . 6
3 Abbreviations . 6
4 General . 6
5 Functional description . 7
5.1 Overview and principles of operation . 7
5.2 Algorithm description . 7
5.2.1 Adaptive filtering and energy computation . 9
5.2.2 ACF averaging . 9
5.2.3 Predictor values computation . 9
5.2.4 Spectral comparison . 10
5.2.5 Periodicity detection . 10
5.2.6 Information tone detection . 11
5.2.7 Threshold adaptation. 12
5.2.8 VAD decision . 15
5.2.9 VAD hangover addition . 15
6 Computational details . 15
6.1 Adaptive filtering and energy computation . 17
6.2 ACF averaging . 18
6.3 Predictor values computation . 18
6.3.1 Schur recursion to compute reflection coefficients . 19
6.3.2 Step-up procedure to obtain the aav1[0.8] . 19
6.3.3 Computation of the rav1[0.8] . 20
6.4 Spectral comparison . 20
6.5 Periodicity detection . 21
6.6 Threshold adaptation . 21
6.7 VAD decision . 23
6.8 VAD hangover addition . 23
6.9 Periodicity updating . 24
6.10 Tone detection . 24
6.10.1 Windowing . 24
6.10.2 Auto-correlation . 24
6.10.3 Computation of the reflection coefficients . 25
6.10.4 Filter coefficient calculation . 26
6.10.5 Pole Frequency Test. 26
6.10.6 Prediction gain test. 26
7 Digital test sequences . 27
7.1 Test configuration. 27
7.2 Test sequences . 28
Annex A (informative): . 29
A.1 Simplified block filtering operation . 29
A.2 Description of digital test sequences . 29
A.2.1 Test sequences . 29
A.2.2 File format description . 31
A.3 VAD performance . 33
ETSI
---------------------- Page: 4 ----------------------
3GPP TS 46.032 version 15.0.0 Release 15 4 ETSI TS 146 032 V15.0.0 (2018-07)
A.4 Pole frequency calculation . 34
Annex B (normative): Test sequences . 35
Annex C (informative): Change history . 36
History . 37
ETSI
---------------------- Page: 5 ----------------------
3GPP TS 46.032 version 15.0.0 Release 15 5 ETSI TS 146 032 V15.0.0 (2018-07)
Foreword
rd
This Technical Specification has been produced by the 3 Generation Partnership Project (3GPP).
The present document specifies the Voice Activity Detector (VAD) to be used in the Discontinuous Transmission
(DTX) for the digital cellular telecommunications system.
Archive en_300965v080000p0.zip which accompanies the present document, contains test sequences, as described in
clause A.2.
en_300965v080000p0.zip Annex B: Test sequences for the GSM Full Rate speech codec; Test sequences files
*.inp, *.cod, *.vad.
The specification from which the present document has been derived was originally based on CEPT documentation,
hence the presentation of the present document may not be entirely in accordance with the ETSI/PNE Rules.
The contents of the present document are subject to continuing work within the TSG and may change following formal
TSG approval. Should the TSG modify the contents of the present document, it will be re-released by the TSG with an
identifying change of release date and an increase in version number as follows:
Version x.y.z
where:
x the first digit:
1 presented to TSG for information;
2 presented to TSG for approval;
3 or greater indicates TSG approved document under change control.
y the second digit is incremented for all changes of substance, i.e. technical enhancements, corrections,
updates, etc.
z the third digit is incremented when editorial only changes have been incorporated in the document.
ETSI
---------------------- Page: 6 ----------------------
3GPP TS 46.032 version 15.0.0 Release 15 6 ETSI TS 146 032 V15.0.0 (2018-07)
1 Scope
The present document specifies the Voice Activity Detector (VAD) to be used in the Discontinuous Transmission
(DTX) as described in GSM 06.31. It also specifies the test methods to be used to verify that a VAD complies with the
technical specification.
The requirements are mandatory on any VAD to be used either in the GSM Mobile Stations (MS)s or Base Station
Systems (BSS)s.
2 References
The following documents contain provisions which, through reference in this text, constitute provisions of the present
document.
• References are either specific (identified by date of publication, edition number, version number, etc.) or
non-specific.
• For a specific reference, subsequent revisions do not apply.
• For a non-specific reference, the latest version applies. In the case of a reference to a 3GPP document (including a
GSM document), a non-specific reference implicitly refers to the latest version of that document in the same
Release as the present document.
[1] GSM 01.04: "Digital cellular telecommunications system (Phase 2+); Abbreviations and
acronyms".
[2] GSM 06.10: "Digital cellular telecommunications system(Phase 2+); Full rate speech;
Transcoding".
[3] GSM 06.12: "Digital cellular telecommunications system(Phase 2+); Full rate speech; Comfort
noise aspect for full rate speech traffic channels".
[4] GSM 06.31: "Digital cellular telecommunications system(Phase 2+); Full rate speech;
Discontinuous Transmission (DTX) for full rate speech traffic channels".
3 Abbreviations
Abbreviations used in the present document are listed in GSM 01.04 [1].
4 General
The function of the VAD is to indicate whether each 20 ms frame produced by the speech encoder contains speech or
not. The output is a binary flag which is used by the TX DTX handler defined in GSM 06.31 [4].
The ETS is organized as follows.
Clause 2 describes the principles of operation of the VAD.
In clause 3, the computational details necessary for the fixed point implementation of the VAD algorithm are given.
This clause uses the same notation as used for computational details in GSM 06.10.
The verification of the VAD is based on the use of digital test sequences. Clause 4 defines the input and output signals
and the test configuration, whereas the detailed description of the test sequences is contained in clause A.2.
The performance of the VAD algorithm is characterized by the amount of audible speech clipping it introduces and the
percentage activity it indicates. These characteristics for the VAD defined in the present document have been
established by extensive testing under a wide range of operating conditions. The results are summarized in clause A.3.
ETSI
---------------------- Page: 7 ----------------------
3GPP TS 46.032 version 15.0.0 Release 15 7 ETSI TS 146 032 V15.0.0 (2018-07)
5 Functional description
The purpose of this clause is to give the reader an understanding of the principles of operation of the VAD, whereas the
detailed description is given in clause 3. In case of discrepancy between the two descriptions, the detailed description of
clause 3 shall prevail.
In the following subclauses of clause 2, a Pascal programming type of notation has been used to describe the algorithm.
5.1 Overview and principles of operation
The function of the VAD is to distinguish between noise with speech present and noise without speech present. The
biggest difficulty for detecting speech in a mobile environment is the very low speech/noise ratios which are often
encountered. The accuracy of the VAD is improved by using filtering to increase the speech/noise ratio before the
decision is made.
For a mobile environment, the worst speech/noise ratios are encountered in moving vehicles. It has been found that the
noise is relatively stationary for quite long periods in a mobile environment. It is therefore possible to use an adaptive
filter with coefficients obtained during noise, to remove much of the vehicle noise.
The VAD is basically an energy detector. The energy of the filtered signal is compared with a threshold; speech is
indicated whenever the threshold is exceeded.
The noise encountered in mobile environments may be constantly changing in level. The spectrum of the noise can also
change, and varies greatly over different vehicles. Because of these changes the VAD threshold and adaptive filter
coefficients must be constantly adapted. To give reliable detection the threshold must be sufficiently above the noise
level to avoid noise being identified as speech but not so far above it that low level parts of speech are identified as
noise. The threshold and the adaptive filter coefficients are only updated when speech is not present. It is, of course,
potentially dangerous for a VAD to update these values on the basis of its own decision. This adaptation therefore only
occurs when the signal seems stationary in the frequency domain but does not have the pitch component inherent in
voiced speech. A tone detector is also used to prevent adaptation during information tones.
A further mechanism is used to ensure that low level noise (which is often not stationary over long periods) is not
detected as speech. Here, an additional fixed threshold is used.
A VAD hangover period is used to eliminate mid-burst clipping of low level speech. Hangover is only added to
speech-bursts which exceed a certain duration to avoid extending noise spikes.
5.2 Algorithm description
The block diagram of the VAD algorithm is shown in figure 2.1. The individual blocks are described in the following
subclauses. ACF, N and sof are calculated in the speech encoder.
ETSI
---------------------- Page: 8 ----------------------
3GPP TS 46.032 version 15.0.0 Release 15 8 ETSI TS 146 032 V15.0.0 (2018-07)
Adaptive
v
p
vad
VAD
vad
ACF vad
filtering and
VAD
hangover
energy
decision
addition
computation
r
vad
th
vad
ptch
Periodicity Threshold
N
detection
adaptation
stat
sof Tone
detection
tone
Predictor
r
av1
Spectral
values
comparison
computation
av1
av0
ACF
averaging
Figure 2.1: Functional block diagram of the VAD
The global variables shown in the block diagram are described as follows:
- ACF are auto-correlation coefficients which are calculated in the speech encoder defined in GSM 06.10
(subclause 3.1.4, see also clause A.1). The inputs to the speech encoder are 16 bit 2's complement numbers, as
described in GSM 06.10, subclause 4.2.0;
- av0 and av1 are averaged ACF vectors;
- rav1 are autocorrelated predictor values obtained from av1;
- rvad are the autocorrelated predictor values of the adaptive filter;
- N is the long term predictor lag value which is obtained every sub-segment in the speech coder defined in
GSM 06.10;
- ptch indicates whether the signal has a steady periodic component;
- sof is the offset compensated signal frame obtained in the speech coder defined in GSM 06.10;
- pvad is the energy in the current frame of the input signal after filtering;
- thvad is an adaptive threshold;
- stat indicates spectral stationarity;
- vvad indicates the VAD decision before hangover is added;
- vad is the final VAD decision with hangover included.
ETSI
---------------------- Page: 9 ----------------------
3GPP TS 46.032 version 15.0.0 Release 15 9 ETSI TS 146 032 V15.0.0 (2018-07)
5.2.1 Adaptive filtering and energy computation
Pvad is computed as follows:
8
Pvad=+rvad acf 2 rvad acf
00 i i
i=1
This corresponds to performing an 8th order block filtering on the input samples to the speech encoder, after zero offset
compensation and pre-emphasis. This is explained in clause A.1.
5.2.2 ACF averaging
Spectral characteristics of the input signal have to be obtained using blocks that are larger than one 20 ms frame. This is
done by averaging the auto-correlation values for several consecutive frames. This averaging is given by the following
equations:
frames−1
av00{}n=−acf{n j} ;i= .8
ii
j=0
av10{}n=−av {n frames} ;i=0.8
ii
Where n represents the current frame, n-1 represents the previous frame etc. The values of constants are given in
table 2.1.
Table 2.1: Constants and variables for ACF averaging
Constant Value Variable Initial value
frames 4 previous ACF's
av0 & av1 All set to 0
5.2.3 Predictor values computation
The filter predictor values aav1 are obtained from the auto-correlation values av1 according to the equation:
−1
aR= p
where:
- -
R = | av1[0], av1[1], av1[2], av1[3], av1[4], av1[5], av1[6], av1[7] |
| av1[1], av1[0], av1[1], av1[2], av1[3], av1[4], av1[5], av1[6] |
| av1[2], av1[1], av1[0], av1[1], av1[2], av1[3], av1[4], av1[5] |
| av1[3], av1[2], av1[1], av1[0], av1[1], av1[2], av1[3], av1[4] |
| av1[4], av1[3], av1[2], av1[1], av1[0], av1[1], av1[2], av1[3] |
| av1[5], av1[4], av1[3], av1[2], av1[1], av1[0], av1[1], av1[2] |
| av1[6], av1[5], av1[4], av1[3], av1[2], av1[1], av1[0], av1[1] |
| av1[7], av1[6], av1[5], av1[4], av1[3], av1[2], av1[1], av1[0] |
- -
and:
ETSI
---------------------- Page: 10 ----------------------
3GPP TS 46.032 version 15.0.0 Release 15 10 ETSI TS 146 032 V15.0.0 (2018-07)
- - - -
p = |av1[1]| a = |aav1[1]|
|av1[2]| |aav1[2]|
|av1[3]| |aav1[3]|
|av1[4]| |aav1[4]|
|av1[5]| |aav1[5]|
|av1[6]| |aav1[6]|
|av1[7]| |aav1[7]|
|av1[8]| |aav1[8]|
- - - -
aav1[0] = -1
av1 is used in preference to av0 as av0 may contain speech.
The autocorrelated predictor values rav1 are then obtained:
8−i
rav11==aav aav1 ;.i0.8
ik ki+
k =0
5.2.4 Spectral comparison
The spectra represented by the autocorrelated predictor values rav1 and the averaged auto-correlation values av0 are
compared using the distortion measure dm defined below. This measure is used to produce a Boolean value stat every
20 ms, as given by these equations:
8
rav102av + rav1 av0
00 ii
i=1
dm =
av0
0
difference = |dm - lastdm|
lastdm = dm
stat = difference < thresh
The values of constants and initial values are given in table 2.2.
Table 2.2: Constants and variables for spectral comparison
Constant Value Variable Initial value
thresh 0.05 lastdm 0
5.2.5 Periodicity detection
The frequency spectrum of mobile noise is relatively stationary over quite long periods. The Inverse Filter
Autocorrelated Predictor coefficients of the adaptive filter rvad are only updated when this stationarity is detected.
Vowel sounds however, also have this stationarity, but can be excluded by detecting the periodicity of these sounds
using the long term predictor lag values (Nj) which are obtained every sub-segment from the speech codec defined in
GSM 06.10. Consecutive lag values are compared. Cases in which one lag value is a factor of the other are catered for,
however cases in which both lag values have a common factor, are not. This case is not important for speech input but
this method of periodicity detection may fail for some sine waves. The Boolean variable ptch is updated every 20 ms
and is true when periodicity is detected. It is calculated according to the following equation:
ptch = oldlagcount + veryoldlagcount >= nthresh
The following operations are done after the VAD decision and when the current LTP lag values (N0 . N3) are
available, this reduces the delay of the VAD decision. (N{-1} = N3 of previous segment.)
ETSI
---------------------- Page: 11 ----------------------
3GPP TS 46.032 version 15.0.0 Release 15 11 ETSI TS 146 032 V15.0.0 (2018-07)
lagcount = 0
for j = 0 to 3 do
begin
smallag = maximum(Nj,N{j-1}) mod minimum(Nj,N{j-1})
if minimum(smallag,minimum(Nj,N{j-1})-smallag) < lthresh
then increment(lagcount)
end
veryoldlagcount = oldlagcount
oldlagcount = lagcount
The values of constants and initial values are given in table 2.
Table 2.3: Constants and variables for periodicity detection
Constant Value Variable Initial value
lthresh 2 oldlagcount
0
nthresh 4 veryoldlagcount 0
N3 40
5.2.6 Information tone detection
The tone flag is only evaluated in the downlink VAD. In the uplink VAD, tone detection is not performed and tone =
false.
Computation of the tone flag is complex. It is therefore evaluated after the processing of the current speech encoder
frame. In this way transmission of the speech or SID frame is not delayed.
Information tones and environmental noise can be classified by inspecting the short term prediction gain, information
tones resulting in higher prediction gains than environmental noise. Tones can therefore be detected by comparing the
prediction gain to a fixed threshold. By limiting the prediction gain calculation to a fourth order analysis, information
signals consisting of one or two tones can be detected whilst minimizing the prediction gain for environmental noise.
The prediction gain decision is implemented by comparing the normalized prediction error with a threshold. This
measure is used to evaluate the Boolean variable tone every 20 ms. The signal is classified as a tone if the prediction
error is smaller than the threshold predth. This is equivalent to a prediction gain threshold of 13,5 dB.
Mobile noise can contain very strong resonances at low frequencies, resulting in a high prediction gain. A further test is
therefore made to determine the pole frequency of a second order analysis of the signal frame. The signal is classified as
noise if the frequency of the pole is less than 385 Hz. The pole frequency calculation is described in clause A.4.
The algorithm for detecting information tones is as follows:
tone = false
den = a[1]*a[1]
num = 4*a[2] - a[1]*a[1]
if ( num <= 0 )
return
if (( a[1] < 0 ) AND ( num / den < freqth ))
return
4
prederr = MULT (1 - RC[i]*RC[i])
...