Main Page Voice Response Translator Deployed by USCG in Gulf Action
Contact Us Send us your comments, questions, request product information and news media
Directions Dirving directions from San Jose, San Francisco & Oakland Airports
White Paper White Paper on Integrated Wave Technologies
John H. Hall Behind-the-Scenes Microchip Pioneer Steps Forward with His Greatest Innovation
First Semiconductor technologies and applications firsts of John H. Hall
Special Report A National Institute of Justice Science and Technology grant to develop the first belt-mounted voice command and "Voice Response" Translator unit for Law Enforcement
DATD Devices U.S. Government placing increasing importance on the development of Disabled Assistive Technology Devices for disabled persons
Investor Information Investment growth opportunities, stock options, partnerships
|
|
White Paper on Integrated Wave Technologies
IWT is the only speech interface company to conduct all software and hardware development internally. Hardware and software designers work together on an Integrated Product Team that includes requirements specialists who have worked closely with users. This allows IWT to advance its designs quickly and effectively as it receives feedback from field users.
IWT's software and hardware development over the past years has produced complementary, integrated hardware, software designs that provide greater accuracy, noise immunity and power consumption reduction than those produced by other companies. IWT has reached the 'Dramamine Phase' of testing, devices being used on Navy ships, Coast Guard Cutters and equipment being readied for testing in aircraft. Aggressive development continues with hardware/software for incremental and generational improvements.
The challenge of speech recognition is to extract the voice signal from background noise and then recognize it. IWT has approached this as a signal processing problem rather than a linguistic problem. Software and hardware are co-developed to provide synergies relating to desired capabilities: Complementary noise immunity; Accuracy enhancement; Power consumption reduction; Language/vocabulary expansion; and Size reduction.
Nature of Technologies
IWT has developed specialized speech recognition technology that will allow it to implement this ambitious development. The company has over the past 10 years produced technology uniquely capable of meeting this requirement. IWT voice recognition technology performs in a robust manner using novel signal processing methods. The accuracy of the system exceeds 99% in adverse conditions using different communication channels and in the presence of background noise. The core technology is very efficient and inexpensive to implement: A standard 8-bit audio/digital converter and a 5 MHz controller chip is sufficient to run the program. High-level hardware and software filters and signal processing complement the recognition algorithm to achieve unequaled results.
Integrated Wave Technology's original core software technology was developed in the former Soviet Union, in an atmosphere where expensive and complicated resources were limited. Russian scientists were forced to use inferior (by Western standards) computing machinery. To get results, they had to rely on elegant, yet parsimonious, algorithms to achieve comparable results being accomplished in the West with more powerful computers. IWT's current generation of software is an entirely new creation, developed by the Company's employees and a generation ahead of the already-impressive work acquired eight years ago.
In the 1960s, Vintsyuk first proposed the use of dynamic programming methods for time-aligning a pair of speech utterances.1 Although the essence of the concepts of dynamic time warping, as well as rudimentary versions of the algorithms for connect-word recognition, were embodied in Vintsyuk's work, it was largely unknown in the West and did not come to light until the early 1980s -- long after more formal methods were proposed and implemented by others.
1 T.K Vintsyuk, "Speech Discrimination by Dynamic Programming," Kibernetika, 4 (2): 81-88, Jan./Feb. 1968.
A significant milestone in voice recognition work was achieved in the 1970s by Velichko and Zagoruyko.2 They created perhaps the first viable and useful voice recognition system. These Russian studies helped advance the use of pattern-recognition ideas in speech recognition. It should be noted that these studies predated those by Sakoe and Chiba in Japan3 and Itakura in the U.S.4
The work in the Soviet Union continued on with an emphasis in robust voice recognition and voice identification for use in military and covert operations. A wealth of commercially available potential research soon became available after the fall of the Soviet system. IWT secured the commercial rights to the most significant and applicable research. The technical details have not been published so as to protect these rights.
IWT's technological breakthroughs also have come because it has taken an approach fundamentally different from developers such as IBM. These companies, or their speech recognition divisions, were founded by highly talented linguists who attempted to mechanize their knowledge of how humans process speech into computer systems. Their systems try to recognize phonemes - parcels of speech such as consonants and vowels peculiar to each language - and then assemble them into words and words into sentences using contextual analysis, much like humans do.
Phonomes are subtle variations in speech peculiar not only to each language, but each accent and/or dialect within that language. Each phoneme is perhaps only a hundred milliseconds long, and recognition software based on them must separate them from background noise and each other to identify them continuously. These recognized phonemes are then assembled into a word, and then words into sentences.
These recognition approach is dependent on a continuous string of tasks being done correctly. If a "v" sound is misrecognized as an "f", then the entire word will be wrong even if the phonemes that follow are recognized correctly.
This problem has driven the complexity of phoneme-based speech recognition software. To compensate for phoneme misrecognition, this software uses a probability analysis to attempt to identify words from phonemes, and contextual analysis to assist in selecting words. For example, if the system recognizes "the dog" as the beginning of a sentence, it will conclude the next word is "barked" rather than "borrowed", though the recognition part of the software might not be able to discriminate between those words. This method of improving accuracy is very limited - many possible choices exist for each word positioned in a sentence - and it is of no use for command/control recognition as there is no context for words such as numbers.
2 V.M Velichko and N.G. Zagoruyko, "Automatic Recognition of 200 Words," International Journal of Man-Machine Studies, 2:223, June 1970.
3 H. Sakoe and S. Chiba, "Dynamic Programming Algorithm Optimization for Spoken Word Recognition," IEEE Tans. Acoustics, Speech, Signal Proc., ASSP-26 (1): 43-49, February 1978.
4 F. Itakura, "Minimum Prediction Residual Applied to Speech Recognition," IEEE Tans. Acoustics, Speech, Signal Proc., ASSP-23(1): 67-72, February 1975.
IWT's analysis, reached initially eight years ago, has been that this approach is fundamentally flawed. Other research supports this analysis. Dr. Steven Pinker, a noted language expert and director of the Center for Cognitive Neuroscience at the Massachusetts Institute of Technology, described the fundamental problems with the phoneme/linguistic approach in his book The Language Instinct. He wrote that, "no human-made system can match a human in decoding speech." He elaborated, adding:
Sentences and phrases are built out of words, words are built out of morphemes, and morphemes, in turn are built out of phonemes. Unlike words and morphemes, though, phonemes do not contribute bits of meaning to the whole. The meaning of dog is not predictable from the meaning of d, the meaning of o, the meaning of g, and their order. Phonemes are a different kind linguistic object. They connect outward to speech, not inward to mentalese: a phoneme corresponds to an act of making a sound.5
IWT identified in its early analysis of speech recognition technology that phoneme-based systems are also highly susceptible to background noise and require large computer processing resources to operate. A key flaw in the phoneme approach is that the processors needed to implement it create system noise that interferes with speech recognition. Each generation of phoneme-based software has required increasingly powerful processors, which in turn interfere with the system's ability to recognize speech, evolution that is partly self defeating. Similarly, Pinker describes the futility of trying to guess words from the sentence context because of the "sheer vastness" of language. He wrote:
Go into the Library of Congress and pick a sentence at random from any volume, and chances are you would fail to find an exact repetition no matter how long you continue to search. Estimates of the number of sentences that an ordinary person is capable of producing are breathtaking. If a speaker is interrupted at a random point in a sentence, there are on average about ten different words that could be inserted at that point to continue the sentence in a grammatical and meaningful way. (At some points in a sentence, only one word can be inserted, and at others, there is a choice from among thousands; ten is the average). Let's assume that a person is capable of producing sentences up to twenty words long. Therefore the number of sentences that a speaker can deal with in principle is at least 1020 (a one with twenty zeros after it, or a hundred million trillion.) At a rate of five seconds a sentence, a person would need a childhood of about a hundred trillion years (with no time for eating or sleeping) to memorize them all.6
Rather than emulating human speech recognition, the Company approached this problem as its founder approached the challenges of producing the first electronic watch and the first computerized heart pacemaker. This approach was to analyze precisely the delicate audio signals produced by human speech and develop innovative ways of extracting this sound from background noise and recognizing it with high accuracy. While perhaps insurmountable roadblocks were encountered in pursuing the linguistic approach, the Company was able to achieve the specific results described below.
5 Pinker, Steven, "The Language Instinct: How the Mind Creates Language," William Morrow and Company, New York, 1994, pp. 162-163.
6 Pinker, op.cit., p. 87.
IWT's technologies are not merely superior to those of other companies. They cross performance thresholds that will allow them to be the basis of new products and new markets. In addition to securing ownership of the algorithms, IWT has pursued an aggressive strategy of developing essential implementation technologies. The Company is in the process of completing patent application documentation for these technologies and believes that the resulting patents will prevent competitors from developing similarly capable products.
Background noise is simply the everyday noise that surrounds us. The noise level exceeds 40 decibels often even within the perceived quiet of an office because of ventilation systems, equipment and other people. City street noise is generally around 80 decibels, while the noise within moving vehicles rises to about 100 decibels at highway speeds. The military and police applications in which the Company's products are being demonstrated routinely experience background noise over 100 decibels.
Integrated Wave Technologies, Inc., has developed systems based on its unique intellectual property that have unprecedented capabilities in background noise situations over 100 decibels. This capability, described below in detail, is key to IWT's competitive advantage. Other speech recognition systems being marketed cease to recognize generally at about 30 to 40 decibels of extraneous noise. These systems "lock up" under the pressure of noise above that level, making them useless.
References
Itakura, F. "Minimum Prediction Residual Applied to Speech Recognition," IEEE Tans. Acoustics, Speech, Signal Proc., ASSP-23(1): 67-72, February 1975.
Pinker, Steven, "The Language Instinct: How the Mind Creates Language," William Morrow and Company, New York, 1994, pp. 162-163.
Sakoe, H. and Chiba, S., "Dynamic Programming Algorithm Optimization for Spoken Word Recognition," IEEE Tans. Acoustics, Speech, Signal Proc., ASSP-26 (1): 43-49, February 1978.
Velichko, V.M. and Zagoruyko, H.G., "Automatic Recognition of 200 Words," International Journal of Man-Machine Studies, 2:223, June 1970.
Vintsyuk, T.K., "Speech Discrimination by Dynamic Programming," Kibernetika, 4(2): 81-88, Jan./Feb. 1968.
|