Logo JSALT2023

General information

Call For proposals2023 Jelinek Summer Workshop on Speech and Language Technology

Deadline: Monday, November 14, 2022.

We are pleased to invite one page research proposals for a

8 Week Residential Research Workshop on Speech and Language Technology

at Le Mans University in France

from June 12 to August 4, 2023


We invite one-page research proposals for the annual Frederick Jelinek Memorial Workshop in Speech and Language Technology. Proposals should aim to advance human language technologies (HLT) and related areas of Artificial Intelligence (AI) such as computer vision, and may address applications areas such as healthcare and education.  Proposals may address emerging challenges or long-standing problems.  Areas of interest in 2023 include but are not limited to:

  • SPEECH PROCESSING: Dialectal and code-mixed speech recognition, far-field speech recognition, emotion and sentiment recognition, voice conversion and speech synthesis, spoken dialog systems; zero-shot and few-shot learning for speech processing; speech technologies for under-served languages and atypical speech.
  • TEXT PROCESSING: Neural architectures for natural language understanding; knowledge discovery from text in multimodal context; novel approaches to semantic parsing, question-answering, and dialog; translation of informal text and dialectal speech, and simultaneous translation; cross-language information retrieval.
  • MULTIMODAL PROCESSING: Interactive multimodal AI/HLT for emerging applications in education, healthcare delivery and public health; AL/HLT for natural and social sciences; AI/HLT for medical science.
  • EXPLAINABILITY, PRIVACY, EQUITY: Human-interpretable insights into AI/HLT systems; privacy preserving AI/HLT; bringing benefits of AI/HLT systems equitably to all subpopulations and subdomains.

These workshops are a continuation of the Johns Hopkins University CLSP summer workshop series, and will be hosted by various partner universities on a rotating basis.  The research topics selected for investigation by teams in past workshops should serve as good examples for prospective proposers: https://www.clsp.jhu.edu/workshops/

All received proposals will be peer-reviewed cursorily for suitability. Results of this screening will be communicated by November 21, 2022.  Travel conditions permitting, authors of feasible proposals will be invited to an interactive peer-review meeting in Baltimore on December 16-18, 2022. At this meeting, proposals will be revised to incorporate new ideas and address any concerns. Then, two or three research topics and the core teams to tackle them will be selected at this meeting.

We attempt to bring the best researchers to the workshop to collaboratively pursue research on the selected topics. Each topic brings together a diverse team of researchers and students. Authors of successful proposals typically lead these teams. Other senior participants are drawn from academia, industry and government. PhD students familiar with the field are then selected in accordance with their demonstrated performance. Undergraduate participants, selected through a competitive search, are star rising-seniors: new to the field and showing outstanding academic promise.

If you are interested in participating in the 2023 Summer Workshop we ask that you submit a one-page research proposal for consideration, detailing the problem to be addressed. If a topic in your area of interest is chosen as one of the topics to be pursued next summer, we expect you to be available to participate in the workshop for 6+ weeks. We are not asking for an ironclad commitment at this juncture, just a good faith commitment that if a project in your area of interest is chosen, you will actively pursue it. We in turn will make a good faith effort to accommodate any personal/logistical needs to make your travel, residence, and participation in the workshop, including airfare, housing, meals and incidentals.

We intend to have an in-person planning meeting, for we feel strongly that meeting in-person and a residential workshop have been crucial to past successes. But we recognize the uncertainty arising from the lingering COVID pandemic, and will revert to a hybrid or fully virtual event if necessary.    


Please submit proposals to <jsalt2023@lists.johnshopkins.edu (jsalt2023 @ lists.johnshopkins.edu)> by Monday, 11/14/2022.

AI research Internships for undergraduates

The Johns Hopkins University Center for Language and Speech Processing is organizing the Ninth Frederick Jelinek Memorial Summer Workshop from June 12 to August 5, 2023, this year hosted at the University of Le Mans, France, and seeking outstanding members of the current junior class in US-universities to join this residential research experience in human language technologies.  Please complete this application no later than April 13, 2023.

The internship includes a comprehensive 2-week summer school on human language technology (HLT), followed by intensive research projects on select topics for 6 weeks.

The 8-week workshop provides an intense, dynamic intellectual environment.  Undergraduates work closely alongside senior researchers as part of a multi-university research team, which has been assembled for the summer to attack HLT problems of current interest.

Teams and Topics

The teams and topics for 2023 are:

  1. Better Together: Text + Context 
  2. Finite State Methods with Modern Neural Architectures for Speech Applications
  3. Automatic Design of Conversational Models from Human-to-human Conversation
  4. Interpretability for Spoken Interactions: Embeddings to Explain Diarization Decisions

We hope that this highly selective and stimulating experience will encourage students to pursue graduate study in HLT and AI, as it has been doing for many years.

The summer workshop provides:

  • An opportunity to explore an exciting new area of research
  • A two-week tutorial on current speech and language technology
  • Mentoring by experienced researchers
  • Participation in project planning activities
  • A $6,000 stipend and $2,800 towards meals and incidental expenses
  • Private furnished accommodation for the duration of the workshop
  • Travel expenses to and from the workshop venue

Applications should be received by Thursday, April 13, 2023. The applicant must provide the name and contact information of a faculty nominator, who will be asked to upload a recommendation by Tuesday, Apr 18, 2023.

Questions may be directed to jsalt2023 @ lists.johnshopkins.edu 

Applicants are evaluated only on relevant skills, employment experience, past academic record, and the strength of letters of recommendation.  No limitation is placed on the undergraduate major.  Women and underrepresented minorities are encouraged to apply.


The Application Process

The application process has three stages.

  1. Completion and submission of the application form by Apr 13, 2023.
  2. Submitting applicant’s CV to jsalt2023@lists.johnshopkins.edu by Apr 13, 2023.
  3. The applicant’s Faculty Nominator, whose contact was provided in stage 1, will be asked to provide a recommendation letter in support of the applicant’s admission to the program.  The letter is to be submitted electronically,  by April 18, 2023 to jsalt2023 @ lists.johnshopkins.edu. Please note that the application will not be considered complete until it includes both the CV and the letter.

Feel free to contact the JSALT 2023 committee at jsalt2023@lists.johnshopkins.edu with any questions or concerns you may have.


Team Descriptions:

Better Together: Text + Context

It is standard practice to represent documents, (a), as embeddings, (d). We will do this in multiple ways. Embeddings based on deep nets (BERT) capture text and other embeddings based on node2vec and GNNs (graph neural nets), (c), capture citation graphs, (b). Embeddings encode each of N ≈ 200M documents as a vector of K ≈ 768 hidden dimensions. Cosines of two vectors denote the similarity of two documents. We will evaluate these embeddings and show that combinations of text and citations are better than either by itself on standard benchmarks of downstream tasks.

As deliverables, we will make embeddings available to the community so they can use them in a range of applications: ranked retrieval, recommender systems and routing papers to reviewers. Our interdisciplinary team will have expertise in machine learning, artificial intelligence, information retrieval, bibliometrics, NLP and systems. Standard embeddings are time invariant. The representation of a document does not change after it is published. But citation graphs evolve over time. The representation of a document should combine time invariant contributions from the authors with constantly evolving responses from the audience, like social media.


Finite State Methods with Modern Neural Architectures for Speech Applications

Many advanced technologies such as Voice Search, Assistant Devices (e.g. Alexa, Cortana, Google Home, …) or Spoken Machine Translation systems are using speech signals as input. These systems are built in two ways:

  • End-to-end: a single system (usually a deep neural network) is built with speech signal as input and target signal as final output (for example spoken English as input and french text as output). While this approach greatly simplifies the overall design of the system, it comes with two significant drawbacks:
    • lack of modularity: no sub-components can be modified or used in another system
    • large data requirements: the necessity to find hard-to-collect supervised task-specific data (input-output pairs)
  • Cascade: a separately built ASR system is used to convert the speech signal into text and the output text is then passed to another back-end system. This approach greatly improves the modularity of the individual components of the pipeline and drastically reduces the need of task-specific data. The main disadvantages are:
    • ASR output is noisy: the downstream network is usually fed with the 1-best hypothesis of the ASR system which is prone to error (no account for uncertainty)
    • Separate optimization: each module is separately optimized and the joint-training of the whole pipeline is almost impossible as we cannot differentiate through the ASR best path

In this project we are seeking for a speech representation interface which has the advantages of both the End-to-End and cascade systems while it does not suffer from the drawbacks of these methods.


Automatic Design of Conversational Models from Human-to-Human Conversation

Currently used conversation models (or dialog models) are mostly hand designed by data analysts as a conversation graph consisting of the system’s prompts and the user’s answers. The advanced conversation models [1, 2] are based on large language models fine-tuned on the dialog task, and still require significant amounts of training data. These models produce surprisingly fluent outputs but are not trustable because of hallucination (which can produce unexpected and wrong answers), and their adoption in commerce is limited.

Our goal is to explore ways to design conversation models in the form of finite state graphs[1] semi-automatically or fully automatically from an unlabeled set of audio or textual training dialogs. Words, phrases, or user turns can be converted to embeddings using (large) language models trained specifically on conversational data [3, 4]. These embeddings represent points in a vector space and carry semantic information. The conversations are trajectories in the vector space. By merging, pruning, and modeling the trajectories, we can get dialog model skeleton models. These models could be used for fast data content exploration, content visualization, topic detection, and topic-based clustering, speech analysis, and mainly for much faster and cheaper design of fully trustable conversation models for commercial dialog agents. The models can also target some specific dialog strategies – the fastest way to reach a conversation goal (to provide useful information or sell a good or entertain users for the longest time). One promising approach to building a conversational model from data is presented in [4]. Variational Recurrent Neural Networks are trained to get discrete embeddings with a categorical distribution. The categories are conversation states. Then a transition probability matrix among states is calculated, and low probabilities are pruned out to get a graph.


Interpretability for Spoken Interactions: Embeddings to Explain Diarization Decisions

Speaker diarization aims at answering the question of “who speaks when” in a recording. It is a key task for many speech technologies such as automatic speech recognition (ASR), speaker identification and dialog monitoring in different multi-speaker scenarios, including TV/radio, meet- ings, and medical conversations. In many domains, such as health or human-machine interactions, the prediction of speaker segments is not enough and it is necessary to include additional para-linguistic information (age, gender, emotional state, speech pathology, etc.). However, most existing real-world applications are based on mono-modal modules trained separately, thus resulting in sub-optimal solutions. In addition, the current trend for explainable AI is a vital process for transparency of decision-making with machine learning: the user (a doctor, a judge, or a human scientist) has to justify the choice made on the basis of the system output.

This project aims at converting these outputs into interpretable clues (mispronounced phonemes, low speech rate, etc.) which explains the automatic diarization. While the question of simultaneously performed speech recognition and speaker diarization has been addressed under JSALT 2020, this proposal intends to develop a multi-task diarization system based on a joint latent representation of speaker and para-linguistic information. The latent representation embeds multiple modalities such as acoustic and linguistic or vision. This joint embedding space will be projected into a sparse and non-negative space in which all dimensions are interpretable by design. In the end, the diarization output will be a rich segmentation where speech segments are characterized with multiple labels, and interpretable attributes derived from the latent space.

Partagez :