Finite State Methods with Modern Neural Architectures for Speech Applications and Beyond

Full proposal

Objectives

To design a speech interface module which converts speech signal into a speech representation with the following properties:
1. Accelerator friendly: small memory footprint; low computational load; predictable memory access pattern and easy-to-port onto different hardware architectures: can be used for training with GPU/TPU or deployed on devices with limited memory and computing power
2. Downstream invariant: replacing a speech representation module with another one does not require retraining of the back-end modules.
3. Differentiable: given task specific pairwise data, one can jointly train the back-end system and the speech representation module
4. Streaming: in real world applications the ASR systems are used in the streaming mode (at each moment of time, the model only has access to what has been said so far). Any practical speech representation should also be built under streaming assumption.
5. Less supervised data dependent: there should be semi-supervised algorithms for improving the quality of speech representation in low or zero supervised resource languages.
6. Adaptive and extendable: a speech representation should be modifiable or adaptive, and can be extended with other sources of information like language models.
To develop libraries for extracting speech representation and ways to modify and extend them for example by integrating other source of information like language models
To validate the merit of our proposed speech representations, we propose to apply them on two different tasks:
1. Long-form speech modeling and segmentation
2. Edge-cloud computing

Finite State Speech Representations

[R1] has proposed a finite state lattice based interpretation of all the common training criteria in speech recognition. The weights of this lattice are computed by a neural network directly built on the top of the speech signal. An example of such lattice is shown in Fig1 (copied from Figure 1 of [R1]). We will use this lattice as a dense representation of the speech signal. Such a lattice already possesses most of the properties listed for a speech representation. In this workshop, we would like to extend this representation with the focus on the following problems:

Improve computation complexity and memory footprint: the computation and memory footprint of the lattice based representation exponentially scales by the number of states. We would like to explore ways to limit the computational complexity and memory footprint while increasing the number of states.
Semi-supervised training of the lattice: the current way of training lattice based speech representation is using supervised training data. We would like to explore ways to train this lattice with supervised and unpaired speech or text data.

Furthermore, we would like to collaborate with other sub-teams in the following problems:

Improving speech representation by sequence training
Biasing speech representation
Exploring ways to use speech representation in long-form ASR and edge-cloud computing

Figure 1: Dense trellis representing the Q x T output of an ASR-based neural network. Q is the neural network output dimension and T is the input sequence length.

Scale and Extend the Speech Representation

The lattice-based speech representation can be further modified, enriched or compressed through classical Finite State Transducer (FST) operations such as composition, intersection, pruning, ... and finally passed to downstream systems. To guarantee differentiability and hardware portability (GPU / CPU) – essential requirements for any machine learning pipeline - we propose to revisit the FST framework through the lens of sparse linear algebra. Indeed, expressing FST operations in terms of matrix-vector multiplications (or similar linear operator) greatly facilitates the computation of gradients for a broad range of semiring and makes GPU-based implementation trivial.

Long-form ASR

Most ASR research rely on the simplifying assumption that speech is pre-segmented in short duration utterances usually lasting a few seconds. This assumption is however rarely met in practice. Moreover, the task of segmenting long audio recordings is costly and challenging and remains an open issue. We propose to use our FST-based speech representations to efficiently tackle this problem. The idea is to rely on a sparse linear algebra interpretation of the FST framework to implement a fast, memory efficient and streaming version of the forward-backward and Viterbi algorithm for aligning and segmenting long recordings. The crux of the method is to :

Efficiently representing the alignment / decoding trellis via sparse matrices (representing FSTs).
Approximate the backward pass by looking at steps into the future.

We plan to evaluate this work by measuring the WER of downstream ASR systems trained on the provided segments.

Edge-cloud distributed inference

Speech-based technologies find applications in a wide variety of environments : computing cluster, personal computer, smartphone, embedded devices, ... However, the ASR pipeline is hard to adapt to all these targets especially the ones with memory and computing limited resources. More precisely, the inference part of the ASR pipeline is often the bottleneck : in many situations it can take up to 75% of the processing time. We want to demonstrate that our FST-based representations - since they are based on sparse-matrices – are highly efficient and can adapt well in a wide range of computing environments. We propose to benchmark inference with our FST-based representation against other competitive baselines such as Kalid or k2 on different devices : raspberry pi (i.e. the edge), personal computer and GPU-based environment (i.e. the cloud). To provide a fine-grain analysis we propose to measure the processing time and the memory bandwidth and the efficiency (in FLOPS) of different implementations / representations.

Finally, we propose to go one step further and show how our FST-based speech representations can help in a distributed inference environment. Indeed, one can guess the amount of work needed for inference based on different characteristics of the FSTs (effective beam size, sparsity level, ...) allowing therefore to decide on a per-case basis how and on which node of the distributed system to run the inference. For this project we propose to demonstrate this ability by making a small scale prototype of an edge-cloud distributed system with 2 agents (1 small, 1 big) that chooses the inference target based on the nature of the FST-based neural network output.

References:

[R1] E. Variani, K. Wu, M. Riley, D. Rybach, M. Shannon, C. Allauzen, “Global normalization for streaming speech recognition in a modular framework”, Neurips 2022

[R2] K. Wu, E. Variani, T. Bagby, M. Riley, “LAST: Scalable lattice-based speech modeling in jax”, Submitted to ICASSP 2023

Published on March 2, 2023