Deadline: Monday, November 14, 2022.
We are pleased to invite one page research proposals for a
8 Week Residential Research Workshop on Speech and Language Technology
at Le Mans University in France
from June 12 to August 4, 2023
We invite one-page research proposals for the annual Frederick Jelinek Memorial Workshop in Speech and Language Technology. Proposals should aim to advance human language technologies (HLT) and related areas of Artificial Intelligence (AI) such as computer vision, and may address applications areas such as healthcare and education. Proposals may address emerging challenges or long-standing problems. Areas of interest in 2023 include but are not limited to:
These workshops are a continuation of the Johns Hopkins University CLSP summer workshop series, and will be hosted by various partner universities on a rotating basis. The research topics selected for investigation by teams in past workshops should serve as good examples for prospective proposers: https://www.clsp.jhu.edu/workshops/
All received proposals will be peer-reviewed cursorily for suitability. Results of this screening will be communicated by November 21, 2022. Travel conditions permitting, authors of feasible proposals will be invited to an interactive peer-review meeting in Baltimore on December 16-18, 2022. At this meeting, proposals will be revised to incorporate new ideas and address any concerns. Then, two or three research topics and the core teams to tackle them will be selected at this meeting.
We attempt to bring the best researchers to the workshop to collaboratively pursue research on the selected topics. Each topic brings together a diverse team of researchers and students. Authors of successful proposals typically lead these teams. Other senior participants are drawn from academia, industry and government. PhD students familiar with the field are then selected in accordance with their demonstrated performance. Undergraduate participants, selected through a competitive search, are star rising-seniors: new to the field and showing outstanding academic promise.
If you are interested in participating in the 2023 Summer Workshop we ask that you submit a one-page research proposal for consideration, detailing the problem to be addressed. If a topic in your area of interest is chosen as one of the topics to be pursued next summer, we expect you to be available to participate in the workshop for 6+ weeks. We are not asking for an ironclad commitment at this juncture, just a good faith commitment that if a project in your area of interest is chosen, you will actively pursue it. We in turn will make a good faith effort to accommodate any personal/logistical needs to make your travel, residence, and participation in the workshop, including airfare, housing, meals and incidentals.
We intend to have an in-person planning meeting, for we feel strongly that meeting in-person and a residential workshop have been crucial to past successes. But we recognize the uncertainty arising from the lingering COVID pandemic, and will revert to a hybrid or fully virtual event if necessary.
Please submit proposals to <jsalt2023@lists.johnshopkins.edu (jsalt2023 @ lists.johnshopkins.edu)> by Monday, 11/14/2022.
The Johns Hopkins University Center for Language and Speech Processing is organizing the Ninth Frederick Jelinek Memorial Summer Workshop from June 12 to August 5, 2023, this year hosted at the University of Le Mans, France, and seeking outstanding members of the current junior class in US-universities to join this residential research experience in human language technologies. Please complete this application no later than April 13, 2023.
The internship includes a comprehensive 2-week summer school on human language technology (HLT), followed by intensive research projects on select topics for 6 weeks.
The 8-week workshop provides an intense, dynamic intellectual environment. Undergraduates work closely alongside senior researchers as part of a multi-university research team, which has been assembled for the summer to attack HLT problems of current interest.
Teams and Topics
The teams and topics for 2023 are:
We hope that this highly selective and stimulating experience will encourage students to pursue graduate study in HLT and AI, as it has been doing for many years.
The summer workshop provides:
Applications should be received by Thursday, April 13, 2023. The applicant must provide the name and contact information of a faculty nominator, who will be asked to upload a recommendation by Tuesday, Apr 18, 2023.
Questions may be directed to jsalt2023 @ lists.johnshopkins.edu
Applicants are evaluated only on relevant skills, employment experience, past academic record, and the strength of letters of recommendation. No limitation is placed on the undergraduate major. Women and underrepresented minorities are encouraged to apply.
The Application Process
The application process has three stages.
Feel free to contact the JSALT 2023 committee at jsalt2023@lists.johnshopkins.edu with any questions or concerns you may have.
Team Descriptions:
Better Together: Text + Context
It is standard practice to represent documents, (a), as embeddings, (d). We will do this in multiple ways. Embeddings based on deep nets (BERT) capture text and other embeddings based on node2vec and GNNs (graph neural nets), (c), capture citation graphs, (b). Embeddings encode each of N ≈ 200M documents as a vector of K ≈ 768 hidden dimensions. Cosines of two vectors denote the similarity of two documents. We will evaluate these embeddings and show that combinations of text and citations are better than either by itself on standard benchmarks of downstream tasks.
As deliverables, we will make embeddings available to the community so they can use them in a range of applications: ranked retrieval, recommender systems and routing papers to reviewers. Our interdisciplinary team will have expertise in machine learning, artificial intelligence, information retrieval, bibliometrics, NLP and systems. Standard embeddings are time invariant. The representation of a document does not change after it is published. But citation graphs evolve over time. The representation of a document should combine time invariant contributions from the authors with constantly evolving responses from the audience, like social media.
Finite State Methods with Modern Neural Architectures for Speech Applications
Many advanced technologies such as Voice Search, Assistant Devices (e.g. Alexa, Cortana, Google Home, …) or Spoken Machine Translation systems are using speech signals as input. These systems are built in two ways:
In this project we are seeking for a speech representation interface which has the advantages of both the End-to-End and cascade systems while it does not suffer from the drawbacks of these methods.
Automatic Design of Conversational Models from Human-to-Human Conversation
Currently used conversation models (or dialog models) are mostly hand designed by data analysts as a conversation graph consisting of the system’s prompts and the user’s answers. The advanced conversation models [1, 2] are based on large language models fine-tuned on the dialog task, and still require significant amounts of training data. These models produce surprisingly fluent outputs but are not trustable because of hallucination (which can produce unexpected and wrong answers), and their adoption in commerce is limited.
Our goal is to explore ways to design conversation models in the form of finite state graphs[1] semi-automatically or fully automatically from an unlabeled set of audio or textual training dialogs. Words, phrases, or user turns can be converted to embeddings using (large) language models trained specifically on conversational data [3, 4]. These embeddings represent points in a vector space and carry semantic information. The conversations are trajectories in the vector space. By merging, pruning, and modeling the trajectories, we can get dialog model skeleton models. These models could be used for fast data content exploration, content visualization, topic detection, and topic-based clustering, speech analysis, and mainly for much faster and cheaper design of fully trustable conversation models for commercial dialog agents. The models can also target some specific dialog strategies – the fastest way to reach a conversation goal (to provide useful information or sell a good or entertain users for the longest time). One promising approach to building a conversational model from data is presented in [4]. Variational Recurrent Neural Networks are trained to get discrete embeddings with a categorical distribution. The categories are conversation states. Then a transition probability matrix among states is calculated, and low probabilities are pruned out to get a graph.
Interpretability for Spoken Interactions: Embeddings to Explain Diarization Decisions
Speaker diarization aims at answering the question of “who speaks when” in a recording. It is a key task for many speech technologies such as automatic speech recognition (ASR), speaker identification and dialog monitoring in different multi-speaker scenarios, including TV/radio, meet- ings, and medical conversations. In many domains, such as health or human-machine interactions, the prediction of speaker segments is not enough and it is necessary to include additional para-linguistic information (age, gender, emotional state, speech pathology, etc.). However, most existing real-world applications are based on mono-modal modules trained separately, thus resulting in sub-optimal solutions. In addition, the current trend for explainable AI is a vital process for transparency of decision-making with machine learning: the user (a doctor, a judge, or a human scientist) has to justify the choice made on the basis of the system output.
This project aims at converting these outputs into interpretable clues (mispronounced phonemes, low speech rate, etc.) which explains the automatic diarization. While the question of simultaneously performed speech recognition and speaker diarization has been addressed under JSALT 2020, this proposal intends to develop a multi-task diarization system based on a joint latent representation of speaker and para-linguistic information. The latent representation embeds multiple modalities such as acoustic and linguistic or vision. This joint embedding space will be projected into a sparse and non-negative space in which all dimensions are interpretable by design. In the end, the diarization output will be a rich segmentation where speech segments are characterized with multiple labels, and interpretable attributes derived from the latent space.