by Xilin Jiang, Cong Han, Yinghao Aaron Li and Nima Mesgarani
from Columbia University, New York, USA
Abstract In daily life, we encounter a variety of sounds, both desirable and undesirable, with limited control over their presence and volume. Our work introduces “Listen, Chat, and Remix” (LCR), a novel multimodal sound remixer that controls each sound source in a mixture based on user-provided text instructions. LCR distinguishes itself with a user-friendly text interface and its unique ability to remix multiple sound sources simultaneously within a mixture, without needing to separate them. Users input open-vocabulary text prompts, which are interpreted by a large language model to create a semantic filter for remixing the sound mixture. The system then decomposes the mixture into its components, applies the semantic filter, and reassembles filtered components back to the desired output. We developed a 160-hour dataset with over 100k mixtures, including speech and various audio sources, along with text prompts for diverse remixing tasks including extraction, removal, and volume control of single or multiple sources. Our experiments demonstrate significant improvements in signal quality across all remixing tasks and robust performance in zero-shot scenarios with varying numbers and types of sound sources.
This page contains a set of audio samples. We write 4~6 text prompts for each sample and show the remixed sound mixture according to each text prompt. We recommend opening this website with Chrome and wearing headphones for the best audio experience.
Sound mixtures grouped by composition of sources
Only the “2 Speech + 2 Audio (VGGSound)” composition is used for training. All other compositions all zero-shot.
2 Speech + 2 Audio (VGGSound)▶
🔉Input Mixture #1 consists of
female speaker with high pitch, normal tempo, high energy, and neutral emotion
male speaker with low pitch, high tempo, normal energy, and neutral emotion
helicopter
turkey gobbling
✏️"Increase the volume of the speeches and decrease the volume of the background sounds."
✏️"Let's pull out the sound of the fast-talking man and the turkey."
✨Remixed Mixture A
Target Mixture A
✨Remixed Mixture B
Target Mixture B
✏️"Remove all people talking."
✏️"Why not get rid of the man's voice and the turkey's noise, and reduce the helicopter's volume?"
✨Remixed Mixture C
Target Mixture C
✨Remixed Mixture D
Target Mixture D
✏️"Could you kindly eliminate the sound of the helicopter? I appreciate it."
✏️"Please extract the person with an elevated tone."
✨Remixed Mixture E
Target Mixture E
✨Remixed Mixture F
Target Mixture F
🔉Input Mixture #2 consists of
male speaker with low pitch, low tempo, low energy, and sad emotion
female speaker with high pitch, normal tempo, normal energy, and neutral emotion
playing accordion
playing drum kit
✏️"Is it possible to single out the accordion's performance?"
✏️"Lower the volume of the live accordion music that is currently being played, please."
✨Remixed Mixture A
Target Mixture A
✨Remixed Mixture B
Target Mixture B
✏️"Could you raise the decibel level of the gloomy speaker that has a subdued tone?"
✏️"Please raise the sound for the
female speaker with a standard tempo, amplify the playing accordion, reduce the playing drum kit, and decrease the volume for the male speaker with a sluggish pace."
✨Remixed Mixture C
Target Mixture C
✨Remixed Mixture D
Target Mixture D
✏️"I'd like you to exclude the speaker with a high-frequency voice and average vitality, conveying a neutral tone."
✏️"Make everything quieter."
✨Remixed Mixture E
Target Mixture E
✨Remixed Mixture F
Target Mixture F
🔉Input Mixture #3 consists of
male speaker with low pitch, normal tempo, normal energy, and neutral emotion
male speaker with low pitch, high tempo, normal energy, and neutral emotion
underwater bubbling
train horning
✏️"Enhance this recording by removing all the noises."
✏️"Could you raise the audio level of the underwater bubbling sound exclusively?"
✨Remixed Mixture A
Target Mixture A
✨Remixed Mixture B
Target Mixture B
✏️"Can you adjust the sound so
that both speakers are louder, the train horn is quieter, and the underwater bubbling is completely removed from the recording?"
✏️"Is it possible to turn down the speakers' volume and crank up the background ambiance?"
✨Remixed Mixture C
Target Mixture C
✨Remixed Mixture D
Target Mixture D
✏️"I'd like you to edit out the speaker characterized by a faster tempo and the train horn sound altogether."
✏️"Can you remove the speaker with a rapid rhythm?"
✨Remixed Mixture E
Target Mixture E
✨Remixed Mixture F
Target Mixture F
🔉Input Mixture #4 consists of
female speaker with normal pitch, normal tempo, normal energy, and neutral emotion
male speaker with low pitch, normal tempo, high energy, and neutral emotion
playing hammond organ
rain
✏️"Can you edit the recording to extract the sound of the organ and rainfall?"
✏️"Can you modify the sound so that the rain and Hammond organ
are quieter, the female speaker with normal pitch and energy is louder, and the male speaker with low pitch and high energy is entirely eliminated from the recording?"
✨Remixed Mixture A
Target Mixture A
✨Remixed Mixture B
Target Mixture B
✏️"I'd like you to turn down the volume for the lady with the average pitch."
✏️"Please remove the the organ music and both the female and male speakers
in the audio track."
✨Remixed Mixture C
Target Mixture C
✨Remixed Mixture D
Target Mixture D
✏️"Please make the organ music louder."
✏️"Let's extract the part featuring the speaker characterized by typical tone?"
✨Remixed Mixture E
Target Mixture E
✨Remixed Mixture F
Target Mixture F
🔉Input Mixture #5 consists of
female speaker with normal pitch, normal tempo, low energy, and neutral emotion
female speaker with normal pitch, low tempo, normal energy, and neutral emotion
church bell ringing
playing theremin
✏️"Extract the bell ringing from the rest of the audio."
✏️"Could you amplify the audio level of the speaker with normal energy and slow tempo, and also raise the church bell ringing sound, but lower the volume of the speaker with low vitality?"
✨Remixed Mixture A
Target Mixture A
✨Remixed Mixture B
Target Mixture B
✏️"Let's extract all human voices from the recording."
✏️"Reduce the background audio and turn up the volume on the talking parts."
✨Remixed Mixture C
Target Mixture C
✨Remixed Mixture D
Target Mixture D
✏️"Is it feasible to erase the theremin's playing sound?"
✏️"I'd like you to amplify the playing theremin sound and reduce the church bell ringing sound."
✨Remixed Mixture E
Target Mixture E
✨Remixed Mixture F
Target Mixture F
2 Speech▶
🔉Input Mixture #6 consists of
female speaker with normal pitch, normal tempo, low energy, and contempt emotion
female speaker with normal pitch, normal tempo, normal energy, and neutral emotion
✏️"Can you extract the speaker characterized by their contemptuous manner?"
✏️"Is it possible to turn up the volume of the speaker exhibiting typical enthusiasm and reduce the volume of the speaker showing low energy?"
✨Remixed Mixture A
Target Mixture A
✨Remixed Mixture B
Target Mixture B
✏️"Could you isolate the speaker without emotion and speaking in a normal volume?"
✏️"Why not turn up the sound of the contemptuous speaker while removing the speaker maintaining a neutral emotional state?"
✨Remixed Mixture C
Target Mixture C
✨Remixed Mixture D
Target Mixture D
🔉Input Mixture #7 consists of
female speaker with normal pitch, high tempo, normal energy, and neutral emotion
male speaker with low pitch, low tempo, high energy, and neutral emotion
✏️"I'd appreciate it if you could eliminate the speaker who is speaking at a rapid pace."
✏️"The female speaker talks is loud. Could you turn down the volume?"
✨Remixed Mixture A
Target Mixture A
✨Remixed Mixture B
Target Mixture B
✏️"Remove the gentleman with a deep tone."
✏️"Begin by decreasing the volume of the female speaker with a fast tempo, and then increase the volume of the male speaker with a low pitch."
✨Remixed Mixture C
Target Mixture C
✨Remixed Mixture D
Target Mixture D
2 Audio (VGGSound)▶
🔉Input Mixture #8 consists of
playing tabla
missile launch
✏️"Please turn up the volume of the playing tabla sound and remove the missile launch sound."
✏️"Kindly turn down the sound of the rocket being launched."
✨Remixed Mixture A
Target Mixture A
✨Remixed Mixture B
Target Mixture B
✏️"Extract the tabla music for me."
✏️"Could you decrease the volume for both the missile launch and the playing tabla sounds?"
✨Remixed Mixture C
Target Mixture C
✨Remixed Mixture D
Target Mixture D
🔉Input Mixture #9 consists of
fireworks banging
vacuum cleaner cleaning floors
✏️"Is it possible to remove the noise from the vacuum and increase the volume of the fireworks?"
✏️"Please take out the sound of the fireworks banging and enhance the volume of the vacuum cleaner cleaning the floors."
✨Remixed Mixture A
Target Mixture A
✨Remixed Mixture B
Target Mixture B
✏️"Could you eliminate the noise from the fireworks explosions, please?"
✏️"Just extract the firework for me."
✨Remixed Mixture C
Target Mixture C
✨Remixed Mixture D
Target Mixture D
2 Speech + 1 Audio (VGGSound)▶
🔉Input Mixture #10 consists of
male speaker with normal pitch, low tempo, normal energy, and surprised emotion
female speaker with high pitch, high tempo, low energy, and neutral emotion
playing banjo
✏️"Lower the sound level of the banjo playing, remove the woman with a high-paced, low-energy delivery, and increase the volume of the surprised man who speaks slowly with normal enthusiasm."
✏️"Try to delete the surprised male speaker, if you can."
✨Remixed Mixture A
Target Mixture A
✨Remixed Mixture B
Target Mixture B
✏️"Could you decrease the audio level of the female speaker with a fast tempo and low vitality who maintains a neutral emotion?"
✏️"Extract the banjo music from the audio."
✨Remixed Mixture C
Target Mixture C
✨Remixed Mixture D
Target Mixture D
🔉Input Mixture #11 consists of
male speaker with low pitch, high tempo, low energy, and neutral emotion
female speaker with high pitch, normal tempo, normal energy, and neutral emotion
wind chime
✏️"Extract the background sound but make it quieter."
✏️"Can you extract the audio of the speaker with a low-pitched voice
and a brisk tempo?"
✨Remixed Mixture A
Target Mixture A
✨Remixed Mixture B
Target Mixture B
✏️"Increase the audio of the wind chime, decrease the volume of the male speaker
with a fast pace and low enthusiasm, and remove the female speaker with a regular pace and average enthusiasm."
✏️"Please take out the individual with a low pitch."
✨Remixed Mixture C
Target Mixture C
✨Remixed Mixture D
Target Mixture D
1 Speech + 2 Audio (VGGSound)▶
🔉Input Mixture #12 consists of
male speaker with low pitch, high tempo, normal energy, and neutral emotion
cuckoo bird calling
playing oboe
✏️"I'd like to extract the audio of an oboe being played, please."
✏️"Please remove all non-human sounds."
✨Remixed Mixture A
Target Mixture A
✨Remixed Mixture B
Target Mixture B
✏️"Enhance the sound level of the cuckoo bird's call, please."
✏️"First, volume up the man with a high tempo and regular enthusiasm.
Second, volume down the cuckoo bird's calling. Third, remove the oboe music."
✨Remixed Mixture C
Target Mixture C
✨Remixed Mixture D
Target Mixture D
🔉Input Mixture #13 consists of
female speaker with high pitch, high tempo, low energy, and neutral emotion
dog growling
chicken crowing
✏️"Kindly remove the menacing growl produced by the canine."
✏️"Could you extract the sound of the woman speaking and the chicken, and then decrease the chicken's sound?"
✨Remixed Mixture A
Target Mixture A
✨Remixed Mixture B
Target Mixture B
✏️"Please quieten down the woman's voice, but make the dog and chicken's voices louder."
✏️"I only want to keep the animal voices of the dog and chicken in the mix."
✨Remixed Mixture C
Target Mixture C
✨Remixed Mixture D
Target Mixture D
2 Speech + 2 Audio (FSD50K, seen audio labels)▶
🔉Input Mixture #14 consists of
male speaker with low pitch, normal tempo, low energy, and neutral emotion
female speaker with high pitch, normal tempo, normal energy, and neutral emotion
acoustic guitar
(dog) bark
✏️"Make the conversation as clean as possible."
✏️"Boost the volume of the conversation, and also quieten down those distracting background sounds."
✨Remixed Mixture A
Target Mixture A
✨Remixed Mixture B
Target Mixture B
✏️"Could you pull out the dog barking sound for me? Thanks."
✏️"Can you isolate the speaker with a deep tone and low enthusiasm?"
✨Remixed Mixture C
Target Mixture C
✨Remixed Mixture D
Target Mixture D
✏️"I'd like you to turn down the volume of the low-pitch male speaker and remove the dog barking noise."
✏️"Please extract the sound of the guitar and the dog's barking."
✨Remixed Mixture E
Target Mixture E
✨Remixed Mixture F
Target Mixture F
🔉Input Mixture #15 consists of
female speaker with high pitch, normal tempo, normal energy, and neutral emotion
female speaker with normal pitch, normal tempo, normal energy, and neutral emotion
toilet flush
siren
✏️"Let's remove the annoying siren sound."
✏️"Can you edit the audio to extract the speaker characterized by standard pitch?"
✨Remixed Mixture A
Target Mixture A
✨Remixed Mixture B
Target Mixture B
✏️"Could you extract the high-pitched speaker and the wailing siren?"
✏️"I want you to single out the siren sound."
✨Remixed Mixture C
Target Mixture C
✨Remixed Mixture D
Target Mixture D
✏️"I'd like you to decrease the siren volume, lower the sound of the toilet flushing, reduce the volume of the speaker with normal pitch, and boost the volume of the speaker with high pitch."
✏️"Please raise the sound level of the high-pitched speaker, remove the speaker with typical pitch, and erase the siren sound."
✨Remixed Mixture E
Target Mixture E
✨Remixed Mixture F
Target Mixture F
2 Speech + 2 Audio (FSD50K, unseen audio labels)▶
🔉Input Mixture #16 consists of
male speaker with low pitch, low tempo, high energy, and neutral emotion
female speaker with high pitch, normal tempo, high energy, and neutral emotion
scissors
bowed string instrument
✏️"Please get rid of the sound of scissors."
✏️"Pump up the volume on the talks and reduce other noises."
✨Remixed Mixture A
Target Mixture A
✨Remixed Mixture B
Target Mixture B
✏️"Please separate the part featuring someone playing a string instrument."
✏️"I'd like you to isolate both the female speaker and the bowed string instrument sound from the mixture."
✨Remixed Mixture C
Target Mixture C
✨Remixed Mixture D
Target Mixture D
✏️"Remove all speakers from the audio, can you?"
✏️"Is it possible to turn up
the volume of the string instrument and reduce the volume of the scissors?"
✨Remixed Mixture E
Target Mixture E
✨Remixed Mixture F
Target Mixture F
🔉Input Mixture #17 consists of
male speaker with low pitch, normal tempo, normal energy, and neutral emotion
female speaker with normal pitch, normal tempo, high energy, and neutral emotion
musical keyboard
seagull
✏️"Is it possible to extract the song of the seagull?"
✏️"Add more volume to the surrounding, and decrease the speech volume."