customer support



content SPECIAL  REPORT



Pocket Size Voice-Command Translator for Law Enforcement

OBJECTIVES

This proposed program is to meet the need of law enforcement officers to have an effective audio translation capability.  The effort would build upon a successful National Institute of Justice-funded program to develop a belt-mounted Voice Response Translator (VRT).  The VRT uses a unique voice recognition algorithm that is able to recognize an officer's voice with near 100 percent accuracy even in high background noise environments.  This proposal would advance the VRT design and testing to the point where it would be miniaturized to fit in the shirt pocket of an officer.  This design would provide law enforcement officers with a device capable of hands-free, eyes-free command playing of an appropriate number of pre-selected phrases in at least three languages.

BACKGROUND

No formal study of translation/interpreter problems confronting law enforcement is available.  But there is substantial anecdotal evidence that both large and small law enforcement departments are having increasing difficulties dealing with communities of persons who do not speak English.  Also, census data indicate that the problems is widespread:  In 112 American cities, one of every four residents is foreign-born, nationwide, 31.8 million people over the age of 5 speak a language other than English at home.1

IWT has identified an application of this sound analysis technology that would meet an need identified by law enforcement officials.  The National Institute of Justice's Technology Assessment Program Advisory Council (TAPAC), at its December 3, 1993, meeting heard from its Weapons and Protective Systems Committee, which identified instant language translation as one of six "immediate" law enforcement technology priorities.2

Law enforcement officers often encounter situations in which suspects and other persons do not speak English.  Departments spend considerable resources on developing multilingual resources.  The large number of languages involved -- often more than 10 and sometimes more than 203 -- and the changing mix of languages frustrate attempts to provide officers with the ability to give even simple directions to persons speaking other languages.  Integrated Wave Technologies, Inc., (IWT) worked from June 1, 1996 to March 30, 1998 under a National Institute of Justice Science and Technology grant to develop the VRT.  The VRT, on voice command, produces pre-programmed phrases in various languages.  This allows officers to identify the language spoken by a person, issue emergency commands to the person and make inquiries to which a person could respond with hand signals.  This system is designed for use in both hostile and non-hostile encounters with non-English-speaking persons.  Partners with IWT in this effort were the Oakland Police Department (OPD) and Eagan, McAllister Associates, Inc. (EMA).

The Oakland Police Department has provided extensive support for hardware/software design review, translated phrase selection and translation review, and operational testing.  Both the hardware and software have evolved extensively in response to detailed feedback from OPD personnel.

IWT has designed and built prototypes of the first belt-mounted, voice command translation unit.  IWT worked to develop this design -- based upon the PC104 board format -- after evaluating all available palmtop computers and determining that none was suitable.  The highly compact nature of the IWT voice command algorithm allows for the use of a 386-class processor in the unit, which greatly reduces size and power consumption.  Other voice recognition software requires a Pentium-class microprocessor, consuming 10 times the power and requiring a large amount of supporting hardware.

The basis of this unique device is IWT's work with Soviet-conceived sound analysis technology.  Scientists and engineers at these laboratories, in seeking to develop speech identification and other sound analysis programs, took an approach that is fundamentally different from that used in voice recognition systems developed in the United States and other Western countries.

Prototypes for the first-phase program were approximately five inches by five inches by two inches, much smaller than any previous voice-command platform capable of this task.  Advanced prototypes and production units can be as small as a hand-held calculator when based upon custom-designed computer boards such as has been developed for another IWT application.

The VRT has completed initial evaluation by the Oakland Police Department (OPD).  OPD has been working with the National Institute of Justice and IWT to evaluate the phrases being used, the translation of the phrases and the configuration of the hardware of the VRT.  The Department is also working closely with its Advisory Committees on Crime to gain pertinent feedback from citizens on the impact and effectiveness of the VRT in operational use.

The Voice Response Translator grant effort has broken new ground in two areas.  First, it has resulted in the design of computer optimized for law enforcement use that is both smaller and lighter than any previous unit able to accept voice commands.  Second, it has demonstrated the highly developed capabilities of the Soviet-based voice recognition system by operating in a high noise environment with virtually no externally induced audio-recognition errors.

This work has laid the groundwork for a testing program that will help to provide a better understanding of the effectiveness of this device, in particular, and machine translation in general.  A follow-up report based upon the results of planned testing should provide a wealth of data on community acceptance of the device and its ability to communicate with non-English-speaking persons.

The program changes were preliminary driven by the following three factors:  1) the need to design and build a belt-mounted computer rather than using an off-the-shelf unit as planned;  2) the need to conduct extensive community evaluation/relations work prior to field testing;  and 3) the need to include different translated phrases than planned.

IWT has been successful in responding to, and meeting, the refined requirements of this program.  The design work on the belt-mount computer has been completed and demonstrated to top officials of the Oakland Police Department (OPD).  The OPD has devoted considerable resources to making the community evaluation a success, and we believe that this unanticipated part of the program will be of great benefit to this law enforcement technology development effort.  IWT has also completed the expanded translator work, in close coordination with the OPD.

RESULTS OR BENEFITS EXPECTED

The research will lead to a production-ready VRT capable of meeting the requirements determined by the Oakland Police Department's work with the test units.  This new design will allow officers to be equipped with a voice-command, hands-free, eyes-free audio translation capability.

These advancements are critical to the usefulness of the VRT.  Police officers are encumbered already with a large amount of necessary equipment.  The VRT must be small -- 0.75 inch by 3.0 inch by 4 inch in this design -- to avoid reducing the officer's mobility or becoming a distraction.

Law enforcement officers cannot be distracted by having to look at a screen or keypad during interaction with members of the community.  Having to use their hands to manipulate a device also interfers with an officer's ability to perform his duties.

A )  Community-Oriented Policing

The development of this advanced translation capability would support more effective community-oriented policing, based upon the results of the initial use of the VRT in Oakland.  According to representatives of the Asian-American Advisory Committee on Crime there, citizens not able to speak English are comforted considerably when the police officers attempting to communicate with them have a translation capability, however limited.  This device will meet that requirement more fully than any other system yet developed.

B )  Goals

Program goals are to:
  • Improve the effectiveness of the VRT by miniaturizing it and making it easier to "train" using the officer's voice

  • Assess the effectiveness of the improved system using a trained criminologist to direct the evaluation

Miniaturized hardware needed

Even though the VRT developed under the preview NIJ effort was the smallest voice-command computer able to meet this requirement ever developed, further miniaturization is needed for it to become optional equipment for law enforcement officers.  This miniaturization is a central aspect of this proposal.

Self-training needed

A significant drawback of the VRT as developed was the need for it to be partly disassembled and connected to a video card, monitor and keyboard for officer voice pattern training.  A "training" is when a user makes a voice-command imprint computer file used as a template for the recognition algorithm.  Because an IWT or EMA technician needed to be present for all training sessions, officers were unable to create the voice sample collections at their convenience.  This approach was taken to minimize the initial development needed to field the VRT, but greatly increased the training needed for use of the device.

Effectiveness of the training was also hampered because of the lack of a voice-prompting training system.  While the pattern recognition algorithm will correct for a wide range of variances such as the speed with which a word is said, the best voice prints are those said in the situation in which the device will be used.  A good analogy is fingerprinting, as both are wave representations unique to individuals.  Law enforcement experts can match fingerprints from even a partial sample.  But when taking the sample upon which to base these matches, a clear, complete fingerprint is taken.  Clear voice samples, said naturally with no background noise, provide the most flexible and accurate basis for use of the IWT algorithm.  Using this sample, the algorithm can match the same word or phrase even if it is said slower, faster, under stress or with background noise.

All persons working initially with voice recognition systems experience a high degree of self consciousness, as talking to a machine is an unusual act.  This disappears quickly, but the awkward training structure of the VRT hardware made repeat trainings difficult.  An initial design decision to make the units as small as possible meant that the video card was not included.  Though this helped to meet the operational requirements, a software architecture redesign needs to be made to allow for audio prompts that will make the VRT a self-training device.  IWT has developed this design architecture for use with another application and can use it in future work with the VRT.

C )  Objectives

Program objectives are:

  • Deploy up to 12 Voice Response Translators with Oakland Police Department personnel

  • Test the functionality, flexibility, and reliability of the unit

  • Conduct user acceptance testing under field conditions

  • Integrate the expertise of a criminologist into the program to develop effectiveness evaluation criteria and collect data to support this evaluation

  • Work with OPD community Advisory Committees on Crime to continue to assess the impact of this device's use in the community

D )  Evaluation Component

Once the VRTs are in place and being used by the officers, the evaluation component will commence.

There are two evaluation components that must be completed as part of this 12 month program.  A technical product evaluation must determine the products reliability, durability and functionality.

The second phase will concentrate on the "end user" and their acceptance of the technology.  During this phase, the criminologist will accompany the officers, observe their operation of the technology and present a series of questions to them related to their interaction with the technology.  Officers will be solicited for comments on the design, flexibility and future uses of the voice recognition technology.

E )  Milestones and Deliverables

Milestones:

1-3 Month:  Develop VRT prototypes
1-3 Month:  Identify Criminologist Consultant
1-3 Month:  Begin Officer Training for new units
1-3 Month:  Develop Evaluation Instrument
1-6 Month:  Prepare Interim Report
4-10 Month:  Test and Evaluation
10-11 Month:  Analysis of Evaluation Data
12 Month:  Prepare Final Report
Deliverables:
  • Quarterly Financial Reports

  • Sixth Month Interim Report

  • Final Report

  • Equipment Purchased for Testing Program

APPROACH

IWT is based in Fremont, California.  It has acquired highly advanced sound analysis technology from sources in the former Soviet Union and has adapted and advanced this technology in its own research facilities.  These facilities are capable of software development, prototype development, and custom-designed chip fabrication.

IWT's sound recognition algorithm is uniquely capable of being able to take advantage of this new motherboard.  Other voice recognition software, even though less accurate, require Pentiums, which consume approximately 10 times as much power as the board being used in this effort.

The IWT algorithm has other unique features that support this application.  The recognition of the commands spoken by the user is accomplished by using a template matching technique.  The user first records the commands issued in his voice.  These sounds are digitized by the A/D converter and then separated into 17 bands of sound frequency.  The energy in each of these bands is measured, converted into a digital number.  These sound energy numbers are grouped together in a memory sector and electrically describe the sound that was captured.  This memory sector is called a template and a template of each command is stored in memory.  A typical template is 128 eight-bit bytes of information.

When the system is operated, the user speaks a command that has a larger energy than the background noise.  This sound triggers a threshold detector that allows the command to be recorded by the computer.  The end of the command is detected in a similar way.  The sound is analyzed in the same way as the original recorded template was.  This new template is compared to all the templates in memory one at a time and the degree of similarity is recorded as a number, for example from zero to 100 percent.  If the command template is not in the memory, the comparison numbers are small, if it is, however, and the comparison number is above a threshold level the command is recognized.

Each of the original command templates has attached to it a stored transaction that is executed by the computer once a command is recognized.

Because of the high resolution of the recognizer and the limited set of templates that are active at any one time, the accuracy of the system is virtually 100 percent in that it does not substitute an undesired action for the one requested.  One of the important features of the recognition algorithm is its simplicity, which results in speed of operation and a small requirement for memory.  The recognition speed is about 0.25 seconds and the program code is about 12,000 bits of memory.

IWT voice recognition technology performs in a robust manner using novel signal processing methods.  The accuracy of the system exceeds 99% in adverse conditions using different communication channels and in the presence of background noise.  The core technology is very efficient and inexpensive to implement:  An 8-bit audio/digital converter and a 286 or faster central processing unit is required.

A )  Background

The core technology was developed in the former Soviet Union, in a atmosphere where expensive and complicated resources were limited.  Russian scientists were forced to use inferior (by Western standards) computing machinery.  To get results, they had to rely on elegant, yet parsimonious, algorithms to achieve the results being accomplished in the West.

In the 1960s, Vinstyuk first proposed the use of dynamic programming methods for time-aligning a pair of speech utterances4.  Although the essence of the concepts of dynamic time warping, as well as rudimentary versions of the algorithms for connect-word recognition, were embodied in Vinstyuk's work, it was largely unknown in the West and did not come to light until the early 1980s -- long after more formal methods were proposed and implemented by others.

A significant milestone in voice recognition work was achieved in the 1970s by Velichko and Zagoruyko5.  They created perhaps the first viable and useful voice recognition system.  These Russian studies helped advance the use of pattern-recognition ideas in speech recognition.  It should be noted that these studies predated those by Sakoe and Chiba in Japan6 and Itakura in the U.S.7

The work in the Soviet Union continued on with an emphasis in robust voice recognition and voice identification for use in military and covert operations.  A wealth of commercially available potential research soon became available after the fall of the Soviet system.  Applicable commercial rights are owned by IWT.  The technical details have not been published so as to protect these rights.

B )  IWT Approach to Voice Recognition

Broadly speaking, there are three approaches to speech recognition:
  • The acoustic-phonetic approach

  • The artificial intelligence approach

  • The pattern recognition approach, which is used by IWT

The acoustic-phonetic approach is straightforward.  The machine attempts to decode the speech signal in a sequential manner based on the observed acoustic features of the signal and the known relations between acoustic features of the signal and the known relations between acoustic features and phonetic symbols.  It is a viable approach and has been studied in great depth for more than 40 years.

However, for a variety of reasons, the acoustic-phonetic approach has not achieved the same success in practical systems.  The central problem is the extreme difficulty in getting a reliable definitions of phonemes, i.e., segmenting the speech into discrete regions where the acoustic properties of the signal are representative of one (or possibly several) phonetic units (or classes) and then attaching one or more phonetic labels to each segmented region according to acoustic properties.

A second problem is that, once the labels have been defined, a valid word must be determined from the sequence of phonetic labels (usually in the form of a phoneme lattice) that can have many permutations for a given word or phase8.

The artificial intelligence (AI) approach attempts to combine the above phonetic approach with the power of an expert system that integrates phonemic, lexical, syntactic, semantic and pragmatic knowledge.  Although some of the limitations of the acoustic-phoneme approach can be overcome using AI, the complexity of the task makes it unsuitable for small, portable applications, or in applications where costs must be kept low.

The pattern-recognition approach is the basis for the IWT speech recognizer.  It has three qualities that lead to superior performance in applications such as is the subject of this grant proposal:
  1. Simplicity of use.  The method is easy to understand, rich in mathematical and communication theory, and is widely used and understood.

  2. It is robust and invariant to different speech vocabularies, users, languages, word vocabularies, talker populations, background environments, and transmission conditions.

  3. Proven high performance.  The pattern-recognition approach to speech recognition consistently provides high performance on any task that is within its technological parameters and provides a clear path for extending the technology in a wide range of directions.
The pattern-recognition approach is better suited for the conditions to which a hand-held law enforcement device will be subjected for the following reasons:
  1. The signal processing front end provides a set of unique filter bank parameters that are consistent over a wide range of speakers and communication channels.

  2. The Filter Bank parameters are transformed into a set of Principal Features (PF) that are statistically determined to remove redundant data across the vocabulary.

  3. The PF is transformed into frame pairs that model the statistical correlation between nearby speech frames.

  4. The system employs a modified dynamic-time-warping (DTW) process in which all templates are scanned continuously. The system then relaxes end-point constraints of the input utterance and updates allowable paths of the utterance.

  5. The algorithm works for speaker-dependent and speaker-independent recognition.

  6. The system works in a fast and efficient manner.
C )  Details of the IWT speech recognition technology

Acoustic waves are converted with an 8-bit analog-digital converter (ADC) at a sample rate of 12.8 K/sec.  The PCM data is placed into a circular buffer that is continuously updated.  The input data is converted into a stream of parameters in the preprocessor.  This secondary stream of data is converted into 8-dimension (8-D) feature vectors every 20 ms.

A word to be recognized is recognized against a "template" that is initially recorded during the "training" process.  These are stored in external memory.  A resident set of templates in memory defines the vocabulary.

Consider the input "utterance" as a set of feature parameters that stream in continuously.  To consider this utterance as a candidate for recognition, a front-end processor is needed to "grab" the utterance.

Once the utterance is captured, it is compared against the templates in memory using a comparison technique known as a dynamic time warping (DTW) algorithm. The DTW provides the best time alignment of two utterances (unknown and template)9.

However, instead of the common DTW algorithm, the comparison is performed continuously.  This means that the input is estimated every time a feature vector comes from the preprocessor, i.e., every 20 ms.  This allows for continuous and accurate determination of the end points of the spoken utterance, in effect, the DTW "slides" the input over all the templates to align correctly the patterns.

Accurate end-point detection is crucial for accurate voice recognition.  Tests have shown that small variations in end-point detection, such as +/- 40ms, can reduce accuracy by 3%10.  The method of sliding the patterns over each other reduces these end-point errors, quickly sorts out unlikely templates, and shows promise for continuous speech recognition.

D )  Inside the Pre-Processor

The pre-processor part of the speech recognition algorithm converts the input signal waveform into a stream of features.  There are two stages to this:  the primary transformation from the time to the spectral domain;  and the statistically based method to obtain a more compressed and reliable feature vector.  The first is realized by means of a quasi-synchronized (with FO, the fundamental or glottal frequency) 17-band filter bank.  The second is a frame-pair conversion using a Karhunen-Loeve transformation (KLT or principal feature method)11.

E )  Conclusion

The speech recognition algorithm used by IWT is very accurate and fast.  It encompasses many of the "proven" techniques used in commercial speech recognizers, along with many novel techniques that have been added to improve system performance.  It uses low cost hardware (8-bit analog-digital converter) and low computational overhead, typically well under 5% total on a 486-33 PC.

It should be noted that other methods, such as using LPC for the preprocessor, have been investigated thoroughly, but shown to have lower performance due to added complexity.  In addition, the use of "hidden Markov models" (HMM) has also been investigated.  HMMs are widely used for large vocabulary systems and for some speaker-independent systems.  However, the reliability of using HMMs for reliable and robust command-and-control voice recognition does not perform as well as template-based approaches.
  1. Miami Herald, January 2, 1994, "Immigration Overload;  United States has lost control of its borders."

  2. Minutes of the December 3-4, 1993 TAPAC meeting, p.6.

  3. Conversations with police departments and database research.

  4. T.K Vinstyuk, "Speech Discrimination by Dynamic Programming," Kibernetika, 4(2): 81-88, Jan./Feb. 1968.

  5. V.M Velichko and N.G. Zagoruyko, "Automatic Recognition of 200 Words," International Journal of Man-Machine Studies, 2:223, June 1970.

  6. H. Sakoe and S. Chiba, "Dynamic Programming Algorithm Optimization for Spoken Word Recognition," IEEE Tans. Acoustics, Speech, Signal Proc., ASSP-26 (1): 43-49, February 1978.

  7. F. Itakura, "Minimum Prediction Residual Applied to Speech Recognition," IEEE Tans.

  8. L. Rabiner and B. Juang, "Fundamentals of Speech Recognition," Prentice Hall Signal Processing Series, 1993.

  9. L. Rabiner and B. Juang, "Fundamentals of Speech Recognition," Prentice Hall Signal Processing Series, 1993.

  10. J.G. Wilpon, L.R. Rabiner, and T.B. Martin, "An improved word-detection algorithm for telephone-quality speech incorporating both syntactic and semantic constraints," AT&T Tech.

  11. E.L. Bocchieri and G.R. Doddington, "Frame-Specific statistical features for speaker independent speech recognition," IEEE Trans. on Acoustics, Speech & Signal Processing, 34(4), August 1986.