Skip to main content

Automatic Speech Recognition

LINTO-PLATFORM-STT

LinTO-platform-stt can either be used as a standalone transcription service or deployed within a micro-services infrastructure using a message broker connector.

Pre-requisites

Hardware

To run the transcription models you'll need:

  • At least 7Go of disk space to build the docker image.
  • Up to 7GB of RAM depending on the model used.
  • One CPU per worker. Inference time scales on CPU performances.

Model

LinTO-Platform-STT accepts two kinds of models:

  • LinTO Acoustic and Languages models.
  • Vosk models.

ASR Models

Docker

The transcription service requires docker up and running.

(micro-service) Service broker and shared folder

The STT only entry point in task mode are tasks posted on a message broker. Supported message broker are RabbitMQ, Redis, Amazon SQS. On addition, as to prevent large audio from transiting through the message broker, STT-Worker use a shared storage folder (SHARED_FOLDER).

Deploy linto-platform-stt

1- First step is to build or pull the image:

git clone https://github.com/linto-ai/linto-platform-stt.git
cd linto-platform-stt
docker build . -t linto-platform-stt:latest

or

docker pull lintoai/linto-platform-stt

2- Download the models

Have the acoustic and language model ready at AM_PATH and LM_PATH if you are using LinTO models. If you are using a Vosk model, have it ready at MODEL.

3- Fill the .env

cp .envdefault .env
PARAMETERDESCRIPTIONEXEMPLE
SERVING_MODESTT serving mode see Serving modehttp|task|websocket
MODEL_TYPEType of STT model usedlin|vosk
ENABLE_STREAMINGUsing http serving mode, enable the /streaming websocket routetrue|false
SERVICE_NAMEUsing the task mode, set the queue's name for task processingmy-stt
SERVICE_BROKERUsing the task mode, URL of the message brokerredis://my-broker:6379
BROKER_PASSUsing the task mode, broker passwordmy-password
STREAMING_PORTUsing the websocket mode, the listening port for ingoing WS connexions.80
CONCURRENCYMaximum number of parallel requests>1

Serving mode

Serving Modes

STT can be used three different ways:

Mode is specified using the .env value or environment variable SERVING_MODE.

SERVING_MODE=http

HTTP Server

The HTTP serving mode deploys an HTTP server and a swagger-ui to allow transcription request on a dedicated route.

The SERVING_MODE value in the .env should be set to http.

docker run --rm \
-p HOST_SERVING_PORT:80 \
-v AM_PATH:/opt/AM \
-v LM_PATH:/opt/LM \
--env-file .env \
linto-platform-stt:latest

This will run a container providing an HTTP API binded on the host port HOST_SERVING_PORT.

Parameters:

PARAMETERDESCRIPTIONEXEMPLE
HOST_SERVING_PORTHost serving port80
AM_PATHPath to the acoustic model on the host machine mounted to /opt/AM/my/path/to/models/AM_fr-FR_v2.2.0
LM_PATHPath to the language model on the host machine mounted to /opt/LM/my/path/to/models/fr-FR_big-v2.2.0
MODEL_PATHPath to the model (using MODEL_TYPE=vosk) mounted to /opt/model/my/path/to/models/vosk-model

Micro-service & task broker

The HTTP serving mode connect a celery worker to a message broker.

The SERVING_MODE value in the .env should be set to task.

tip

LinTO-platform-stt in task mode is not intended to be launch manually. However, if you intent to connect it to your custom message's broker here are the parameters:

You need a message broker up and running at MY_SERVICE_BROKER.

docker run --rm \
-v AM_PATH:/opt/models/AM \
-v LM_PATH:/opt/models/LM \
-v SHARED_AUDIO_FOLDER:/opt/audio \
--env-file .env \
linto-platform-stt:latest

Parameters:

PARAMETERDESCRIPTIONEXEMPLE
AM_PATHPath to the acoustic model on the host machine mounted to /opt/AM/my/path/to/models/AM_fr-FR_v2.2.0
LM_PATHPath to the language model on the host machine mounted to /opt/LM/my/path/to/models/fr-FR_big-v2.2.0
MODEL_PATHPath to the model (using MODEL_TYPE=vosk) mounted to /opt/model/my/path/to/models/vosk-model
SHARED_AUDIO_FOLDERShared audio folder mounted to /opt/audio/my/path/to/models/vosk-model

Websocket Server

Websocket server's mode deploy a streaming transcription service only.

The SERVING_MODE value in the .env should be set to websocket.

Usage is the same as the http streaming API

Usages

HTTP API

/healthcheck

Returns the state of the API

Method: GET

Returns "1" if healthcheck passes.

/transcribe

Transcription API

  • Method: POST
  • Response content: text/plain or application/json
  • File: Wave file 16b 16Khz

Return the transcripted text using "text/plain" or a json object when using "application/json" structure as followed:

{
"text" : "This is the transcription",
"words" : [
{"word":"This", "start": 0.123, "end": 0.453, "conf": 0.9},
...
]
"confidence-score": 0.879
}

/streaming

The /streaming route is accessible if the ENABLE_STREAMING environment variable is set to true.

The route accepts websocket connexions. Exchanges are structured as followed:

  1. Client send a json {"config": {"sample_rate":16000}}.
  2. Client send audio chunk (go to 3- ) or {"eof" : 1} (go to 5-).
  3. Server send either a partial result {"partial" : "this is a "} or a final result {"text": "this is a transcription"}.
  4. Back to 2-
  5. Server send a final result and close the connexion.

Connexion will be closed and the worker will be freed if no chunk are received for 10s.

/docs

The /docs route offers a OpenAPI/swagger interface.

Through the message broker

STT-Worker accepts requests with the following arguments: file_path: str, with_metadata: bool

  • file_path: Is the location of the file within the shared_folder. /.../SHARED_FOLDER/{file_path}
  • with_metadata: If True, words timestamps and confidence will be computed and returned. If false, the fields will be empty.

Return format

On a successfull transcription the returned object is a json object structured as follow:

{
"text" : "this is the transcription as text",
"words": [
{
"word" : "this",
"start": 0.0,
"end": 0.124,
"conf": 1.0
},
...
],
"confidence-score": ""
}
  • The text field contains the raw transcription.
  • The word field contains each word with their timestamp and individual confidence. (Empty if with_metadata=False)
  • The confidence field contains the overall confidence for the transcription. (0.0 if with_metadata=False)

Test

Curl

You can test your http API using curl:

curl -X POST "http://YOUR_SERVICE:YOUR_PORT/transcribe" -H  "accept: application/json" -H  "Content-Type: multipart/form-data" -F "file=@YOUR_FILE;type=audio/x-wav"

License

This project is developped under the AGPLv3 License (see LICENSE).

Acknowlegment.