Humans have been able to record their voices and play them back for more than a century. But even now, after decades of technological and computational advancements, using computers to decipher what is being said, and who is saying it, remains a monumental challenge.
Dr. John H. L. Hansen, director of the Center for Robust Speech Systems (CRSS) at the University of Texas at Dallas, provided an update on the state of speaker diarization and recognition during a presentation on November 4, 2022, at the University of Miami’s Richter Library.
Systems that transcribe audio have improved dramatically in recent years, but Hansen said computer programs still struggle to identify other critical components of what they’re hearing—the identity of each speaker, the language they’re speaking, and the meaning behind their words.
“Alexa or Siri, or any of these recognition engines, they’re getting that text information, but they have to do something with that,” Hansen said. “The knowledge of how you take that and move it upstream is actually more critical.”
The presentation was part of UM’s “Data Citizens: A Distinguished Lecture Series” co-sponsored by UM’s Institute for Data Science and Computing (IDSC) and the Miami Clinical and Translational Science Institute (CTSI). The series has been running since 2020 and featured international experts discussing the next waves in Big Data and artificial intelligence.
Hansen provided two key examples to show the advancements that have been made in the world of voice recognition and diarization, and the challenges that remain.
In one case, Hansen led a team of researchers that digitized, and made sense of, audio recordings captured during NASA’s Apollo 11 mission that culminated in Neil Armstrong becoming the first man to step foot on the moon. NASA had recorded every moment of conversation exchanged between the astronauts and mission control for all the Apollo missions, but the analog tapes sat in storage for decades.
Hansen’s team developed a process to digitize the recordings for the Apollo 11 mission, a daunting task considering the recording and playback equipment were 50 years old and held together by at least one bungee cord [photo shown in video above @46:46] . “I literally rewired this thing myself,” he said. “This is a scary thing. Not once did they ask me, ‘Do you know what you’re doing?’”
Using a National Science Foundation (NSF) Community Resource project grant, his team at UTDallas then fine-tuned their software and added 4.2 billion words specifically related to space travel to help with language modeling. That allowed team members to identify each speaker throughout the 19,000 hours of audio captured during the mission. The goal of the project is to build a community-based audio resource that could help researchers look at team-based communications, but the historical value has also proved invaluable. The files were downloaded by researchers at over 160 institutions around the world, and CNN synced up the audio with existing video clips to produce the 50th Anniversary “Apollo 11” movie that the network aired as a documentary special in 2019.
Despite the success of that project, Hansen said over 150,000 hours of audio from the Apollo missions remain on 1-inch, analog tapes in a NASA storage facility. The NSF has awarded his team $1.2 million to continue the work for an additional four years, but Hansen said more will be needed to complete the task.
Hansen’s second example showed the difficulty of capturing and making sense of current-day audio recorded in crowded rooms. In the Apollo recordings, NASA only allowed one person to speak with one astronaut at a time, creating a relatively easy-to-understand back and forth in a controlled setting. That’s not the case in the elementary classrooms that Hansen has been using as a test for his software.
“The problem with children is that they’re not speaking like adults, so there’s pauses, repetitions, non-speech vocalizations like laughter, crying, shouting, running, coughing, sneezing,” he said. “All these factors cause recognition systems not to do so well.”
The dual goals of that long-running project are to improve systems that identify different speakers in different settings, and helping teachers better understand how each child is communicating at different stages of development. He has been working on that project with the LENA Foundation, a Colorado-based nonprofit that works to improve early childhood verbal development.
Hansen said they’ve been able to improve their systems and provide data that teachers and parents use to gauge each student’s progress. But there, too, Hansen said, funding is hard to find. While he applauds the work of large technology companies to improve speech recognition and processing (and many of the 99 Ph.D. and M.S. students he’s advised have gone on to work for Google, Apple, Amazon, and others), Hansen said there just isn’t much funding available for early childhood research. “Most of these companies are not going to turn this into a product because it’s not commercially viable for them,” Hansen said.
In the absence of corporate funding, Hansen said he and others studying voice diarization will just have to keep inventing new ways to push the field ahead to improve verbal communications in classroom settings between teachers and their students.
STORY by: Alan R. Gomez