The Ultimate Guide to Speech Recognition with Deepspeech using python

Comments · 68 Views

Thinking of how to use Deepspeech with Python for voice to text? Here's a quick guide that explains the process in simple steps.

The Ultimate Guide to Speech Recognition with Deepspeech using python.

Have you ever wondered how to implement speech recognition in your Python project?, IoT project?, Virtual assistants? If so, then keep reading! It’s easier than you might think.


Speech Recognition –  Overview

Speech recognition also known as speech to text is the ability of a machine or program to identify words and phrases in spoken language and convert them to a machine-readable format.

Speech recognition has its roots in research done at Bell Labs in the early 1950s. Early systems were limited to a single speaker and had a limited vocabulary of about a dozen words. Modern speech recognition systems have come a long way since their ancient counterparts. They can recognize speech from multiple speakers and have enormous vocabularies in numerous languages. 

The most frequent applications of speech recognition within the enterprise include call routing, speech-to-text processing, voice dialing, and voice search. A variety of factors can affect computer speech recognition performance, including pronunciation, accent, pitch, volume, and background noise.

Mozilla Deepspeech - Overview

DeepSpeech is an open-source Speech-To-Text engine, using a model trained by machine learning techniques based on Baidu’s Deep Speech research paper. It is able to handle noisy environments, different accents, languages and can beat Humans on some benchmarks.

Deepspeech can be a standalone device without the need of a continuous internet connection to process speech recognition. It also offers consistent low latency and memory utilization, regardless of the length of the audio being transcribed.

DeepSpeech is composed of two main subsystems: an acoustic model and a decoder. The acoustic model is a deep neural network that receives audio features as inputs and outputs character probabilities. The decoder uses a beam search algorithm to transform the character probabilities into textual transcripts that are then returned by the system. we will look deeply into the subsystems below.


Installing Deepspeech

In this tutorial, we will use Deepspeech version 0.6.0, though a higher version currently exists. In case you don't want to install anything, you can try out DeepSpeech APIs in the browser using this code lab. Trying out this tutorial will require that you have an active internet connection and be ready to download the Deepspeech model of about +900mb. another requirement is that you must have at least  Python 3.6.5+ installed and some elementary Python programming skills. 

# Create virtual environment named ds06
$ python3 -m venv ./some/pyenv/dir/path/ds06
# Switch to virtual environment
$ source ./some/pyenv/dir/path/ds06/bin/activate

# Install DeepSpeech
$ pip3 install deepspeech==0.6.0

# Download and unzip en-US models, this will take a while
$ mkdir -p ./some/workspace/path/ds06
$ cd ./some/workspace/path/ds06
$ curl -LO
$ tar -xvzf deepspeech-0.6.0-models.tar.gz
x deepspeech-0.6.0-models/
x deepspeech-0.6.0-models/lm.binary
x deepspeech-0.6.0-models/output_graph.pbmm
x deepspeech-0.6.0-models/output_graph.pb
x deepspeech-0.6.0-models/trie
x deepspeech-0.6.0-models/output_graph.tflite

$ ls -l ./deepspeech-0.6.0-models/

# Download and unzip some audio samples to test setup$ curl -LO
$ tar -xvzf audio-0.6.0.tar.gz
x audio/
x audio/2830-3980-0043.wav
x audio/Attribution.txt
x audio/4507-16021-0012.wav
x audio/8455-210777-0068.wav
x audio/License.txt

$ ls -l ./audio/

# Test deepspeech
$ deepspeech --model deepspeech-0.6.0-models/output_graph.pb --lm deepspeech-0.6.0-models/lm.binary --trie ./deepspeech-0.6.0-models/trie --audio ./audio/2830-3980-0043.wav
$ deepspeech --model deepspeech-0.6.0-models/output_graph.pb --lm deepspeech-0.6.0-models/lm.binary --trie ./deepspeech-0.6.0-models/trie --audio ./audio/4507-16021-0012.wav
$ deepspeech --model deepspeech-0.6.0-models/output_graph.pb --lm deepspeech-0.6.0-models/lm.binary --trie ./deepspeech-0.6.0-models/trie --audio ./audio/8455-210777-0068.wav

DeepSpeech API

Deepspeech provides APIs many languages which include C, .Net Framework, Java, Javascript, and Python. bindings for Golang exists and I will look into it in another article. In this article, we will focus on the Python API of which you can find more information on the Deepspeech Python API docs.

You first need to create a model object using the model files you downloaded:

 $ python3
import deepspeech
model_file_path = 'deepspeech-0.6.0-models/output_graph.pbmm'
beam_width = 500
model = deepspeech.Model(model_file_path, beam_width)


Then you should add language model for better accuracy:

lm_file_path = 'deepspeech-0.6.0-models/lm.binary'
trie_file_path = 'deepspeech-0.6.0-models/trie'
lm_alpha = 0.75
lm_beta = 1.85
model.enableDecoderWithLM(lm_file_path, trie_file_path, lm_alpha, lm_beta)

 Once you have the model object, you can use either batch or streaming speech-to-text API.

Batch Tracsription API

Batch transcription is a set of API operations that enables you to transcribe a large amount of audio in storage. It comes in handy if you intend to use the system over the internet as a speech to text REST API server, or you do have other means of recording then just using the system to transcribe. 

To use the batch API, the first step is to read the audio file:

import wave
filename = 'audio/8455-210777-0068.wav'
w =, 'r')
rate = w.getframerate()
frames = w.getnframes()
buffer = w.readframes(frames)

 As you can see that the speech sample rate of the WAV file is 16000hz, the same as the model’s sample rate. But the buffer is a byte array, whereas the DeepSpeech model expects 16-bit int array.

To convert the array we will introduce NumPy a well-known python library.

Let’s convert it:

import numpy as np
data16 = np.frombuffer(buffer, dtype=np.int16)

 Now that we have converted our byte array to a 16-bit int array, we can now run speech-to-text in batch mode to get the text:

  text = model.stt(data16)
your power is sufficient i said


Streaming transcription API

Streaming transcription takes a stream of your audio data through the connected microphone and transcribes it in real-time. The transcription is returned to your application in a stream of transcription events. 

Now let’s accomplish the same using Deepspeech streaming API. It consists of 3 steps: open session, feed data, close session.

Open a streaming session:

context = model.createStream() 

 Repeatedly feed chunks of speech buffer, and get interim results if desired:

  buffer_len = len(buffer)
offset = 0
batch_size = 16384
text = ''
while offset buffer_len:
... end_offset = offset + batch_size
... chunk = buffer[offset:end_offset]
... data16 = np.frombuffer(chunk, dtype=np.int16)
... model.feedAudioContent(context, data16)
... text = model.intermediateDecode(context)
... print(text)
... offset = end_offset
your power is
your power is suffi
your power is sufficient i said
your power is sufficient i said

 Close stream and get the final result:

text = model.finishStream(context)
your power is sufficient i said


 A Transcriber is a tool for the transcription and annotation of speech signals for linguistic research, it enables you to send an audio stream and receive a stream of text in real-time. 

A transcriber consists of two parts: a producer that captures voice from microphone, and a consumer that converts this speech stream to text. These two execute in parallel. The audio recorder keeps producing chunks of the speech stream. The speech recognizer listens to this stream, consumes these chunks upon arrival, and updates the transcribed text.

To capture audio, we will use PortAudio, a free, cross-platform, open-source, audio I/O library. You have to download and install it.

PyAudio is Python bindings for PortAudio, and you can install it with pip:

$ pip3 install pyaudio  

 PyAudio has two modes: blocking, where data has to read (pulled) from the stream; and non-blocking, where a callback function is passed to PyAudio for feeding (pushing) the audio data stream. The non-blocking mechanism suits the transcriber. The data buffer processing code using DeepSpeech streaming API has to be wrapped in a call back:

 text_so_far = ''
def process_audio(in_data, frame_count, time_info, status):
  global text_so_far
  data16 = np.frombuffer(in_data, dtype=np.int16)
  model.feedAudioContent(context, data16)
  text = model.intermediateDecode(context)
  if text != text_so_far:
    print('Interim text = {}'.format(text))
    text_so_far = text
  return (in_data, pyaudio.paContinue)


Now you have to create a PyAudio input stream with this callback:

audio = pyaudio.PyAudio()
stream =
print('Please start speaking, when done press Ctrl-C ...')

Finally, you need to print the final result and clean up when a user ends recording by pressing Ctrl-C:

  while stream.is_active():
except KeyboardInterrupt:
  # PyAudio
  print('Finished recording.')
  # DeepSpeech
  text = model.finishStream(context)
  print('Final text = {}'.format(text))

 That’s all it takes, just a few lines of Python code to put it all together:

Speech Recognition Use Case

  • Streaming transcriptions can generate real-time subtitles for live broadcast media.
  • Easy speech-language translation
  • Streaming transcriptions can provide assistance to the hearing impaired.
  • Lawyers can make real-time annotations on top of streaming transcriptions during courtroom depositions.
  • Creates a platform for Speech biometrics which can lead to speaker identification.
  • Development of smarter IoT products.

Used Model's Accuracy

The included English model was trained on 3816 hours of transcribed audio coming from Common Voice EnglishLibriSpeechFisherSwitchboard. The model also includes around 1700 hours of transcribed WAMU (NPR) radio shows. It achieves a 7.5% word error rate on the LibriSpeech test clean benchmark and is faster than real-time on a single core of a Raspberry Pi 4.

Most of the data used to train it is American English. For this reason, it doesn’t perform as well as it could on other English dialects and accents. But still it was their best English model by the start of 2020.

Improving the Deepspeech Performance

Speech recognition performance is measured by accuracy and speed. Accuracy is measured with the word error rate. Speed is measured with the real-time factor. A variety of factors can affect computer speech recognition performance, including pronunciation, accent, pitch, volume, and background noise.

Tensorflow Lite can make a significant difference in inference by Deepspeech, It is a version of TensorFlow that’s optimized for mobile and embedded devices. This has reduced the DeepSpeech package size from 98 MB to 3.7 MB. It has reduced the English model size from 188 MB to 47 MB.

TensorFlow Lite is designed for mobile and embedded devices, it was found that for DeepSpeech it is even faster on desktop platforms. And so, they made it available on Windows, macOS, and Linux as well as Raspberry Pi and Android. DeepSpeech v0.6 with TensorFlow Lite runs faster than real-time on a single core of a Raspberry Pi 4. 

This brings about the possibility of quality speech recognition on a low-end computational device, such as the development of IoT devices that need voice recognition.


It is important to note the terms speech recognition and voice recognition are sometimes used interchangeably. However, the two terms mean different things. Speech recognition is used to identify words in spoken language. Voice recognition is a biometric technology used to identify a particular individual's voice or for speaker identification.

In this article, you had a quick introduction to batch and stream APIs of DeepSpeech 0.6 and learned how to implement it with PyAudio to create a speech transcriber. The ASR model used here is for US English speakers, accuracy will vary for other accents. By replacing the model for other languages or accents, the same code will work for that language/accent.

We had also looked at some of the use cases, and the advantage of using Deepspeech on low-end computational devices.

How did you find this article, did you encounter any errors? do you have any suggestions, improvements, or feedback, or do you want to suggest a topic I should cover next time? please do let me know in the comment box below.