Microsoft has developed a piece of software that can transcribe a conversation as accurately as a human, in a significant breakthrough for artificial intelligence systems. The software not only listens to words, but also is able to place the words within context to allow for more accurate transcriptions.
The latest program from Microsoft’s research team is capable of transcribing a conversation with a word error rate of 5.9 percent, a figure comparable to the error rate of a professional transcriptionist. Xuedong Huang, Microsoft’s chief speech scientist, said in a statement this represents a historic achievement.
The result doesn’t represent perfect speech transcription, but rather offers something very close to the way humans mishear fragments of conversations. Mistakes are generally quite straightforward, such as confusing ‘have’ for ‘is’ or ‘a’ for ‘the’. Transcription mistakes from both humans and this system come from minor misinterpretations of sentences rather than a physical mishearing.
The result doesn’t represent perfect speech transcription, but rather offers something very close to the way humans mishear fragments of conversations
The breakthrough that led to this achievement is the use of a neural network and the grouping of words that not just sound similar, but have similar meanings. For example, the words ‘fast’ and ‘quick’ are close together in the virtual dictionary of the neural network, since the use of one increases the likelihood of the other. This lets the system generalise meaning in the same way a human might.
What the system cannot do is understand what it is listening to. While it is able to accurately transcribe speech, it does not understand what is being said, so cannot, for example, answer a question.
The primary use of the software is likely to be in Microsoft products that use speech recognition, like the Xbox and the Cortana virtual assistant. The next stage for the research team is to modify the system so it can still function in places with a large amount of background noise, or listen to multiple voices.