| | | | | | Listen to Me | | | | | |

The Kurzweil Applied Intelligence Alumni Newsletter

Go to:

Welcome

Table of Contents

What's New

Registration

Database

InformationWeek July 21, 1997, Issue: 640, Section: InformationWeek Labs

Its Master's Voice -- Speech-recognition software has come a long way-but is it ready for mainstream use?

By Andy Feibus

Everybody talks to his or her computer-most people don't expect it to listen. But maybe it's time to raise those expectations. Desktop speech-recognition software is just about ready for general use. - There are two categories of speech-recognition software today:discrete speech and continuous speech. Discrete speech recognition can interpret only one word at a time, so users must place distinct pauses between words. This type of voice recognition is perfect for controlling PC software but inadequate for dictating a memo, because it's hard to speak with short pauses between each word and keep your train of thought from derailing. - Software for continuous speech recognition can interpret a continuing stream of words. The software must understand the context of a word to determine correct spelling-for example, did you say to, too, or two? Software for continuous speech recognition also must be able to overcome accents and interpret words very quickly to be effective. Because of these requirements, continuous-speech software requires a computer with significantly more speed and memory than does discrete-speech software. - Many companies are deploying speech-recognition software when use of a mouse and keyboard is impractical. For example, speech-recognition software can provide an excellent alternative to a mouse and keyboard for users with disabilities, repetitive strain injuries, or severe arthritis. Speech-recognition products also provide an excellent alternative when the traditional human-to-computer interface is just not possible-such as in an autopsy room or an industrial environment. For this review, I examined five speech-recognition products in the InformationWeek Labs-four with discrete-speech capability:

- Kurzweil VoicePro 2.5 from Kurzweil Applied Intelligence,

- Listen for Windows 2.5 from Verbex Voice Systems,

- VoiceAssist 1.10 from Creative Labs, and

- VoiceType Simply Speaking Gold 3.5 from IBM; as well as one with continuous-speech abilities:

- Dragon NaturallySpeaking 1.0 from Dragon Systems.

Of these packages, only Dragon NaturallySpeaking provides continuous speech recognition suitable for dictation.

My test configurations used a Creative Labs SoundBlaster AWE64 Gold card, and a preproduction model of the ANC-600 noise-canceling microphone headset from Andrea Electronics of Long Island City, N.Y. I used two test systems:one running a 100-MHz Pentium and the other running a 200-MHz Pentium with MMX technology.

Both systems ran Windows 95 and had 32 Mbytes of RAM; although this met or exceeded the manufacturers' published RAM requirements for all packages, some of the packages would clearly have consumed more RAM if it had been available. Only the Kurzweil VoicePro includes optimizations for the Pentium MMX instruction set.

All but Creative Technology's VoiceAssist claim a degree of independence from individual accents and speech patterns. But to get the best level of accuracy from these packages, users will have to train the package to understand their specific voice and speech patterns.

This training period generally consists of reading a set of words, phrases, numbers, and sentences; for some of the packages I tested, this exercise also trains you in the way you need to speak to be understood.

Some of the packages will run just fine without this training, which can take anywhere from 30 minutes to well over an hour, but the package's recognition accuracy definitely improves when you spend the time to ensure that it understands your voice.

Kurzweil VoicePro 2.5

Kurzweil VoicePro 2.5 has an active vocabulary of 40,000 words that you can expand to 60,000. To support its vocabulary, VoicePro requires 16 Mbytes of dedicated RAM; VoicePlus, its smaller sibling, has an active vocabulary of 20,000 words that can be expanded to 30,000 and requires only 8 Mbytes of dedicated RAM.

VoicePro has the longest training time of the packages evaluated for this review. The user has to speak 400 individual words into the microphone, a piece of drudgery that took about 30 minutes. VoicePro then compiles these words into a database of the user's speech patterns; on the Pentium 100 system, this compilation took just about an hour.

As with IBM's Simply Speaking Gold, you can speak a button's label text to have VoicePro "click" the button. Unlike the IBM package, VoicePro doesn't need to load additional Word macros to work with Microsoft Word 97, nor does it have its own word processor for dictation. It doesn't need these features; instead, VoicePro lets you dictate directly into any word processor. It includes command and control support for more than 40 common applications, including Microsoft Word, Lotus 1-2-3, and Netscape Navigator, although support for these applications doesn't include the latest releases for many of these products. For that, you'll need to send away for the free VoicePro 2.52 upgrade from Kurzweil.

You can train VoicePro to understand commands for applications not directly supported. However, unlike the other discrete-recognition packages reviewed here, VoicePro can't incorporate mouse events into the commands you can initiate with VoicePro-a huge limitation for controlling applications that don't have text labels on every button.

But on a positive note, VoicePro macros can be put on a network server to be shared by other VoicePro users.

The VoicePro user interface is a bit confusing. Instead of organizing its active vocabulary into a hierarchical structure, VoicePro lists the words that it understands in a long list box, making it difficult at best to find which words are specific applications. Two other windows are also displayed while VoicePro is running:One shows VoicePro's current state, and another is used to select alternatives when VoicePro improperly identifies a word. VoicePro does not evaluate similar words for applicability based on the context of the sentence being dictated, so it has a higher inaccuracy rate for dictated documents than products that do, such as IBM's Simply Speaking Gold.

Listen For Windows 2.5

Unlike Kurzweil's products, Listen for Windows 2.5 from Verbex Voice Systems doesn't even pretend to support dictation. Listen for Windows has two purposes: command and control of PC applications, and entry of data. Because of its limited range, it demands much less from your system than the packages from IBM, Kurzweil, and Dragon Systems.

Listen for Windows, which costs $99, includes command support for multiple revisions of more than 30 applications, although this support extends only to menu and keystroke commands. A command editor is included with the package that scans the available menus and dialog boxes from an application and builds what Listen calls a speech-interface file, containing the menu commands and dialog buttons for the application.

When Listen is running and a known application is active, the speech interface for that application is automatically loaded. The user can then speak any of the commands that have been defined as part of the speech interface to make them occur. Commands can be added to the speech interface, but these commands can only consist of keystrokes and not mouse events-so application buttons not accessible by keyboard commands cannot be controlled with Listen.

The user interface for Listen is pretty simple. While it is running, a window is displayed containing the unsorted commands you can verbally execute in the application that has the mouse's focus. If the application doesn't have a defined speech interface, Listen shows the default set of global Windows commands that are common to all applications.

Because it keeps only the current speech interface file in active memory, Listen requires only about 2 Mbytes of RAM to operate. Listen's limited scope has another benefit:For the most part, it doesn't really need to be trained to understand a user's voice. Instead, the user selects his or her gender as part of the product's configuration, and can start using the software right away. If you find that recognition accuracy is low for certain phrases or that a particular phrase is not part of its vocabulary, you can train the application to understand how you speak those specific phrases.

VoiceAssist 1.10

Why is Creative Labs' giveaway software application being considered alongside applications that are both more costly and more capable? Because if you invest the time required to train VoiceAssist, you get a command and control speech-recognition product that is as useful for some tasks as some of the other products in this review. VoiceAssist is a good choice to determine if speech recognition is viable for command and control in your environment. However, if you absolutely must have speech recognition for command, control, or dictation, then a number of the other packages in this review are better deals.

Out of the box-the SoundBlaster AWE32 or AWE64 box, that is-VoiceAssist includes commands from only four simple Windows applications; the rest must be configured by the user. The good news is that it can easily record both keyboard and mouse events and associate them with a command name. The user can then train the application to understand how he or she pronounces this command name. Also included in the box is TextAssist, a text-to-speech reader.

VoiceAssist lacks a number of features, including integrated help, the ability to change the generic Windows-environment commands, and speaker independence. Although it takes only 10 minutes to train the software to recognize a user's voice, it might take several hours to configure the application for the different commands in an application. Also, VoiceAssist won't automatically "click" a button in an open dialog box when the user speaks the button's text label.

VoiceType Simply Speaking Gold 3.5

If you have the computing horsepower to handle the demands of VoiceType Simply Speaking Gold 3.5 (SSG), it's definitely a worthwhile choice. IBM plans to offer continuous speech recognition with another product, ViaVoice, in August for $199.

For a discrete-speech-recognition engine, SSG has a large system overhead. First, the recognition engine and the VoiceCenter user interface together consume more than 13 Mbytes of RAM. Then, if you want to dictate into Microsoft Word 97, you'll need to load a set of macros that causes Word 97 to consume 6 Mbytes of RAM more than the 15 Mbytes that Word 97 normally needs to start. If you don't load these macros, the package won't interact with Word 97. Finally, it has the highest CPU requirement of the discrete-speech packages:a 100-MHz Pentium.

For just $99, SSG provides dictation support with the ability to correct your edits using your voice. IBM claims that SSG can recognize 70 to 100 words per minute with 97% accuracy, but I found that hard to achieve. The faster the speech is, the worse the recognition engine's accuracy becomes; you can't speak that fast and leave a large-enough gap between words to enable the package to identify each word as a discrete entity.

Training SSG took about 20 minutes of reading simple sentences. The SSG vocabulary starts with about 22,000 words and is expandable up to 64,000.

SSG will eventually learn how you speak as you use its correction features to correct what the package thought you said. To use this feature, click on the incorrect word or words. SSG replays the words using your voice, and then you are asked to correct the words. The package then remembers this change and incorporates this information into its vocabulary.

Dictation is supported for Microsoft Word-with the additional RAM requirements mentioned earlier-as well as a simple, voice-enabled word processor called VoicePad. Dictation into many other applications is supported using SSG's Dictate Direct feature.

But like all the packages for discrete speech recognition, SSG is far better at handling command and control than handling dictation. By default, the package is ready to start 14 commonly used Windows accessories and games, as well as Microsoft Word. In addition to being able to start these applications, it includes support for verbally activating menu picks in these programs. Outside of this set of programs, SSG also lets you speak the label on any dialog box button to "click" the button or speak the menu hierarchy to run the menu pick. As long as the label on the button or menu is present in the package's vocabulary, the button is properly recognized and clicked.

Application-specific macros can be created to control features that are not preprogrammed. A macro consists of a sequence of keyboard clicks and mouse clicks that are associated with a word or phrase. Macros are created for specific applications. If the application that the macro was recorded for is the active application within Windows, SSG will run a macro when it hears you speak the trigger words.

SSG also includes a text-to-speech component that reads text back, and even provides eight "actors" to speak your lines, each with different pitches, reading speeds, and inflections for each "voice." Just click the cursor on the passage of text you want to hear, and say "Begin reading"; the passage will be read aloud.

Dragon NaturallySpeaking 1.0

As the first software-only continuous speech-recognition product suitable for dictation, Dragon Systems' Dragon NaturallySpeaking 1.0 Personal Edition is the product against which all later ones will be measured. For now, like most version 1.0 products of any type, it has limited abilities and sluggish performance on mainstream systems.

Dragon Systems is already shipping the Personal Edition of NaturallySpeaking, which contains an active vocabulary of 30,000 words and sells for $695. Professional Editions are expected later this year with vocabularies tailored toward vertical markets, such as the legal or medical professions.

The first thing I noticed about NaturallySpeaking is that it's primarily a dictation tool. Command and control are not a large part of NaturallySpeaking's repertoire. Only a few global commands are included, and you can't train the program to work with a specific application. The software also can't "click" buttons when the user reads the button's label aloud.

The training features of NaturallySpeaking are the best of the bunch. Training the NaturallySpeaking recognition engine isn't a chore-it's a blast. Instead of reading a dull set of words, numbers, and phrases, users read passages from either Arthur C. Clarke's 3001:The Final Odyssey or Dave Barry's Dave Barry in Cyberspace. Training takes just 18 minutes, plus an additional 10 to 15 minutes for the application to compile the speaker's speech pattern into its database.

On the negative side, NaturallySpeaking places an immense resource drain on the system it runs on. Not only does NaturallySpeaking require a Pentium running at 133 MHz or faster, it also consumes about 24 Mbytes of RAM on its own. I highly recommend having at least 48 Mbytes of RAM if you plan to use NaturallySpeaking while you have another application open, unless you're willing to tolerate sluggish system performance.

Using NaturallySpeaking is as simple as speaking. As you speak-or some time shortly thereafter, depending on your system's performance-your words are placed into the NaturallySpeaking document window, which looks a bit like the WordPad applet included with Windows 95 and NT. When you're done dictating your thoughts, you can verbally correct the words that NaturallySpeaking thought you said-or the words that you've decided you really didn't want to say-then copy and paste your words into another document or print them right from the NaturallySpeaking window.

For those who must have dictation capabilities in an inexpensive package, NaturallySpeaking is an excellent way to go. But it's not the best possible solution for users who need voice control over a PC.

Andy Feibus is president of CustomBytes, an automation software consulting firm in Atlanta. He can be reached at amf@mindspring.com.

Pricing And Platforms

Products Vendor Price Requirements

Kurzweil VoicePlus
Kurzweil VoicePro 2.5 Kurzweil Applied Intelligence
Boston, Mass.
800-380-1234
www.kurzweil.com $99 for VoicePlus
$199 for VoicePro Windows 3.1 on 486 DX4/75, Windows 95 on Pentium
VoicePlus:8 Mbytes RAM
VoicePro:16 Mbytes RAM

Listen for Windows 2.5 Verbex Voice Systems
Edison, N.J.
888-483-7239
www.verbex.com $99 Windows 3.1, Windows 95, 486 SX/25, 2 Mbytes RAM

VoiceAssist 1.10 Creative Labs
Milpitas, Calif.
800-998-5227
www.soundblaster.com Included with SoundBlaster AWE32 or AWE64 Windows 3.1, Windows 95, Windows NT, 75-MHz Pentium, 2 Mbytes RAM

VoiceType Simply Speaking Gold 3.5 IBM
West Palm Beach, Fla.
800-426-2255
www.software.ibm.com/is/voicetype $99 Windows 95, Windows NT 4.0, 100-MHz Pentium, 8 Mbytes RAM

Dragon Naturally Speaking 1.0 Personal Edition Dragon Systems
Newton, Mass.
800-437-2466
www.dragonsys.com $695 Windows 95, Windows NT 4.0, 133-MHz Pentium, 24 Mbytes RAM

Data supplied by Vendors