Logo JSALT2023

Finite State Methods with Modern Neural Architectures for Speech Applications and BeyondAbstract

Many advanced technologies such as Voice Search, Assistant Devices (e.g. Alexa, Cortana, Google Home, ...) or Spoken Machine Translation systems are using speech signals as input. These systems are built in two ways:

  • End-to-end: a single system (usually a deep neural network) is built with speech signal as input and target signal as final output (for example spoken english as input and french text text as output). While this approach greatly simplifies the overall design of the system, it comes with two significant drawbacks:
    • lack of modularity: no sub-components can be modified or used in another system
    • large data requirements: necessity to find hard-to-collect supervised task-specific data (input-output pairs)
  • Cascade: a separately built ASR system is used to convert the speech signal into text and the output text is then passed to another back-end system. This approach greatly improves the modularity of the individual components of the pipeline and drastically reduces the need of task-specific data. The main disadvantages are:
    • ASR output is noisy: the downstream network is usually fed with the 1-best hypothesis of the ASR system which is prone to error (no account for uncertainty)
    • Separate optimization: each module is separately optimized and the joint-training of the whole pipeline is almost impossible as we cannot differentiate through the ASR best path

In this project we are seeking for a speech representation interface which has the advantages of both the End-to-End and cascade systems while it does not suffer from the drawbacks of these methods.

Full proposal


  1. To design a speech interface module which converts speech signal into a speech representation with the following properties:
    1. Accelerator friendly: small memory footprint; low computational load; predictable memory access pattern and easy-to-port onto different hardware architectures: can be used for training with GPU/TPU or deployed on devices with limited memory and computing power
    2. Downstream invariant: replacing a speech representation module with another one does not require retraining of the back-end modules.
    3. Differentiable: given task specific pairwise data, one can jointly train the back-end system and the speech representation module
    4. Streaming: in real world applications the ASR systems are used in the streaming mode (at each moment of time, the model only has access to what has been said so far). Any practical  speech representation should also be built under streaming assumption.
    5. Less supervised data dependent: there should be semi-supervised algorithms for improving the quality of speech representation in low or zero supervised resource languages.
    6. Adaptive and extendable: a speech representation should be modifiable or adaptive, and can be extended with other sources of information like language models.
  2. To develop libraries for extracting speech representation and ways to modify and extend them for example by integrating other source of information like language models
  3. To validate the merit of our proposed speech representations, we propose to apply them on two different tasks:
    1. Long-form speech modeling and segmentation
    2. Edge-cloud computing

Finite State Speech Representations

[R1] has proposed a finite state lattice based interpretation of all the common training criteria in speech recognition. The weights of this lattice are computed by a neural network directly built on the top of the speech signal. An example of such lattice is shown in Fig1 (copied from Figure 1 of [R1]). We will use this lattice as a dense representation of the speech signal. Such a lattice already possesses most of the properties listed for a speech representation. In this workshop, we would like to extend this representation with the focus on the following problems:

  1. Improve computation complexity and memory footprint: the computation and memory footprint of the lattice based representation exponentially scales by the number of  states. We would like to explore ways to limit the computational complexity and memory footprint while increasing the number of states.
  2. Semi-supervised training of the lattice: the current way of training lattice based speech representation is using supervised training data. We would like to explore ways to train this lattice with supervised and unpaired speech or text data.

Furthermore, we would like to collaborate with other sub-teams in the following problems:

  1. Improving speech representation by sequence training
  2. Biasing speech representation
  3. Exploring ways to use speech representation in long-form ASR and edge-cloud computing

Figure 1: Dense trellis representing the Q x T output of an ASR-based neural network. Q is the neural network output dimension and T is the input sequence length.


Scale and Extend the Speech Representation

The lattice-based speech representation can be further modified, enriched or compressed through classical Finite State Transducer (FST) operations such as composition, intersection, pruning, ... and finally passed to downstream systems. To guarantee differentiability and hardware portability (GPU / CPU) – essential requirements for any machine learning pipeline -  we propose to revisit the FST framework through the lens of sparse linear algebra. Indeed, expressing FST operations in terms of matrix-vector multiplications (or similar linear operator) greatly facilitates the computation of gradients for a broad range of semiring and makes GPU-based implementation trivial.


Long-form ASR

Most ASR research rely on the simplifying assumption that speech is pre-segmented in short duration utterances usually lasting a few seconds. This assumption is however rarely met in practice. Moreover, the task of segmenting long audio recordings is costly and challenging and remains an open issue. We propose to use our FST-based speech representations to efficiently tackle this problem. The idea is to rely on a sparse linear algebra interpretation of the FST framework to implement a fast, memory efficient and streaming version of the forward-backward and Viterbi algorithm for aligning and segmenting long recordings. The crux of the method is to :

  • Efficiently representing the alignment / decoding trellis via sparse matrices (representing FSTs).
  • Approximate the backward pass by looking at steps into the future.

We plan to evaluate this work by measuring the WER of downstream ASR systems trained on the provided segments.

Edge-cloud distributed inference

Speech-based technologies find applications in a wide variety of environments : computing cluster, personal computer, smartphone, embedded devices, ... However, the ASR pipeline  is hard to adapt to all these targets especially the ones with memory and computing limited resources. More precisely, the inference part of the ASR pipeline is often the bottleneck : in many situations it can take up to 75% of the processing time. We want to demonstrate that our FST-based representations - since they are based on sparse-matrices – are highly efficient and can adapt well in a wide range of computing environments. We propose to benchmark inference with our FST-based representation against other competitive baselines such as Kalid or k2 on different devices : raspberry pi (i.e. the edge), personal computer  and GPU-based environment (i.e. the cloud). To provide a fine-grain analysis we propose to measure the processing time and the memory bandwidth and the efficiency (in FLOPS) of different implementations / representations.

Finally, we propose to go one step further and show how our FST-based speech representations can help in a distributed inference environment. Indeed, one can guess the amount of work needed for inference based on different characteristics of the FSTs (effective beam size, sparsity level, ...) allowing therefore to decide on a per-case basis how and on which node of the distributed system to run the inference. For this project we propose to demonstrate this ability  by making a small scale prototype of an edge-cloud distributed system with 2 agents (1 small, 1 big)  that chooses the inference target based on the nature of the FST-based neural network output.



[R1] E. Variani, K. Wu, M. Riley, D. Rybach, M. Shannon, C. Allauzen, “Global normalization for streaming speech recognition in a modular framework”, Neurips 2022

[R2] K. Wu, E. Variani, T. Bagby, M. Riley, “LAST: Scalable lattice-based speech modeling in jax”, Submitted to ICASSP 2023

Group Members

Full time members

Senior Researchers

  • Lucas Ondel Yang (LISN, CNRS) team leader,
  • Ehsan Variani (Google)
  • Ke Wu (Google)
  • Corey Miller (Rev)
  • Pablo Riera (CONICET)
  • Jan "Yenda" Trmal (Johns Hopkins University)

Junior Researchers

  • Martin Kocour (Brno University of Technology )
  • Desh Raj (Johns Hopkins University)
  • Umberto Capellazo (FBK)
  • Mohamed Salah Zaiem (Telecom Paris)
  • Tina Raissi (RWTH Aachen)
  • Petr Polak, (Charles Univerity)

Undergrade students

  • Imani Finkley (Cornell University)
  • Meona Khetrapal (University of Southern California)

Part-time / Other potential members

  • Michael Riley (Google)
  • Lukas Burget (Brno University of Technology )
  • Cyril Allauzen (Google)
  • Daniele Falavigna (FBK)
  • Alessio Brutti (FBK)
  • Dan Povey (Xiaomi)
  • Georg Heigold (Google)
  • Jennifer Drexler fox (Rev)
Partagez :