Listen, Chat, and Remix: Text-Guided Soundscape Remixing for Enhanced Auditory Experience

by Xilin Jiang, Cong Han, Yinghao Aaron Li and Nima Mesgarani
from Columbia University, New York, USA

abstract

Abstract In daily life, we encounter a variety of sounds, both desirable and undesirable, with limited control over their presence and volume. Our work introduces “Listen, Chat, and Remix” (LCR), a novel multimodal sound remixer that controls each sound source in a mixture based on user-provided text instructions. LCR distinguishes itself with a user-friendly text interface and its unique ability to remix multiple sound sources simultaneously within a mixture, without needing to separate them. Users input open-vocabulary text prompts, which are interpreted by a large language model to create a semantic filter for remixing the sound mixture. The system then decomposes the mixture into its components, applies the semantic filter, and reassembles filtered components back to the desired output. We developed a 160-hour dataset with over 100k mixtures, including speech and various audio sources, along with text prompts for diverse remixing tasks including extraction, removal, and volume control of single or multiple sources. Our experiments demonstrate significant improvements in signal quality across all remixing tasks and robust performance in zero-shot scenarios with varying numbers and types of sound sources.

This page contains a set of audio samples. We write 4~6 text prompts for each sample and show the remixed sound mixture according to each text prompt. We recommend opening this website with Chrome and wearing headphones for the best audio experience.

Sound mixtures grouped by composition of sources

Only the “2 Speech + 2 Audio (VGGSound)” composition is used for training. All other compositions all zero-shot.

2 Speech + 2 Audio (VGGSound)▶

🔉Input Mixture #1 consists of	female speaker with high pitch, normal tempo, high energy, and neutral emotion	male speaker with low pitch, high tempo, normal energy, and neutral emotion	helicopter	turkey gobbling

✨Remixed Mixture A	Target Mixture A	✨Remixed Mixture B	Target Mixture B
✏️"Increase the volume of the speeches and decrease the volume of the background sounds."		✏️"Let's pull out the sound of the fast-talking man and the turkey."

✏️"Remove all people talking."		✏️"Why not get rid of the man's voice and the turkey's noise, and reduce the helicopter's volume?"
✨Remixed Mixture C	Target Mixture C	✨Remixed Mixture D	Target Mixture D

✏️"Could you kindly eliminate the sound of the helicopter? I appreciate it."		✏️"Please extract the person with an elevated tone."
✨Remixed Mixture E	Target Mixture E	✨Remixed Mixture F	Target Mixture F

🔉Input Mixture #2 consists of	male speaker with low pitch, low tempo, low energy, and sad emotion	female speaker with high pitch, normal tempo, normal energy, and neutral emotion	playing accordion	playing drum kit

✨Remixed Mixture A	Target Mixture A	✨Remixed Mixture B	Target Mixture B
✏️"Is it possible to single out the accordion's performance?"		✏️"Lower the volume of the live accordion music that is currently being played, please."

✏️"Could you raise the decibel level of the gloomy speaker that has a subdued tone?"		✏️"Please raise the sound for the female speaker with a standard tempo, amplify the playing accordion, reduce the playing drum kit, and decrease the volume for the male speaker with a sluggish pace."
✨Remixed Mixture C	Target Mixture C	✨Remixed Mixture D	Target Mixture D

✏️"I'd like you to exclude the speaker with a high-frequency voice and average vitality, conveying a neutral tone."		✏️"Make everything quieter."
✨Remixed Mixture E	Target Mixture E	✨Remixed Mixture F	Target Mixture F

🔉Input Mixture #3 consists of	male speaker with low pitch, normal tempo, normal energy, and neutral emotion	male speaker with low pitch, high tempo, normal energy, and neutral emotion	underwater bubbling	train horning

✨Remixed Mixture A	Target Mixture A	✨Remixed Mixture B	Target Mixture B
✏️"Enhance this recording by removing all the noises."		✏️"Could you raise the audio level of the underwater bubbling sound exclusively?"

✏️"Can you adjust the sound so that both speakers are louder, the train horn is quieter, and the underwater bubbling is completely removed from the recording?"		✏️"Is it possible to turn down the speakers' volume and crank up the background ambiance?"
✨Remixed Mixture C	Target Mixture C	✨Remixed Mixture D	Target Mixture D

✏️"I'd like you to edit out the speaker characterized by a faster tempo and the train horn sound altogether."		✏️"Can you remove the speaker with a rapid rhythm?"
✨Remixed Mixture E	Target Mixture E	✨Remixed Mixture F	Target Mixture F

🔉Input Mixture #4 consists of	female speaker with normal pitch, normal tempo, normal energy, and neutral emotion	male speaker with low pitch, normal tempo, high energy, and neutral emotion	playing hammond organ	rain

✨Remixed Mixture A	Target Mixture A	✨Remixed Mixture B	Target Mixture B
✏️"Can you edit the recording to extract the sound of the organ and rainfall?"		✏️"Can you modify the sound so that the rain and Hammond organ are quieter, the female speaker with normal pitch and energy is louder, and the male speaker with low pitch and high energy is entirely eliminated from the recording?"

✏️"I'd like you to turn down the volume for the lady with the average pitch."		✏️"Please remove the the organ music and both the female and male speakers in the audio track."
✨Remixed Mixture C	Target Mixture C	✨Remixed Mixture D	Target Mixture D

✏️"Please make the organ music louder."		✏️"Let's extract the part featuring the speaker characterized by typical tone?"
✨Remixed Mixture E	Target Mixture E	✨Remixed Mixture F	Target Mixture F

🔉Input Mixture #5 consists of	female speaker with normal pitch, normal tempo, low energy, and neutral emotion	female speaker with normal pitch, low tempo, normal energy, and neutral emotion	church bell ringing	playing theremin

✨Remixed Mixture A	Target Mixture A	✨Remixed Mixture B	Target Mixture B
✏️"Extract the bell ringing from the rest of the audio."		✏️"Could you amplify the audio level of the speaker with normal energy and slow tempo, and also raise the church bell ringing sound, but lower the volume of the speaker with low vitality?"

✏️"Let's extract all human voices from the recording."		✏️"Reduce the background audio and turn up the volume on the talking parts."
✨Remixed Mixture C	Target Mixture C	✨Remixed Mixture D	Target Mixture D

✏️"Is it feasible to erase the theremin's playing sound?"		✏️"I'd like you to amplify the playing theremin sound and reduce the church bell ringing sound."
✨Remixed Mixture E	Target Mixture E	✨Remixed Mixture F	Target Mixture F

2 Speech▶

🔉Input Mixture #6 consists of	female speaker with normal pitch, normal tempo, low energy, and contempt emotion	female speaker with normal pitch, normal tempo, normal energy, and neutral emotion

✨Remixed Mixture A	Target Mixture A	✨Remixed Mixture B	Target Mixture B
✏️"Can you extract the speaker characterized by their contemptuous manner?"		✏️"Is it possible to turn up the volume of the speaker exhibiting typical enthusiasm and reduce the volume of the speaker showing low energy?"

✏️"Could you isolate the speaker without emotion and speaking in a normal volume?"		✏️"Why not turn up the sound of the contemptuous speaker while removing the speaker maintaining a neutral emotional state?"
✨Remixed Mixture C	Target Mixture C	✨Remixed Mixture D	Target Mixture D

🔉Input Mixture #7 consists of	female speaker with normal pitch, high tempo, normal energy, and neutral emotion	male speaker with low pitch, low tempo, high energy, and neutral emotion

✨Remixed Mixture A	Target Mixture A	✨Remixed Mixture B	Target Mixture B
✏️"I'd appreciate it if you could eliminate the speaker who is speaking at a rapid pace."		✏️"The female speaker talks is loud. Could you turn down the volume?"

✏️"Remove the gentleman with a deep tone."		✏️"Begin by decreasing the volume of the female speaker with a fast tempo, and then increase the volume of the male speaker with a low pitch."
✨Remixed Mixture C	Target Mixture C	✨Remixed Mixture D	Target Mixture D

2 Audio (VGGSound)▶

🔉Input Mixture #8 consists of	playing tabla	missile launch

✨Remixed Mixture A	Target Mixture A	✨Remixed Mixture B	Target Mixture B
✏️"Please turn up the volume of the playing tabla sound and remove the missile launch sound."		✏️"Kindly turn down the sound of the rocket being launched."

✏️"Extract the tabla music for me."		✏️"Could you decrease the volume for both the missile launch and the playing tabla sounds?"
✨Remixed Mixture C	Target Mixture C	✨Remixed Mixture D	Target Mixture D

🔉Input Mixture #9 consists of	fireworks banging	vacuum cleaner cleaning floors

✨Remixed Mixture A	Target Mixture A	✨Remixed Mixture B	Target Mixture B
✏️"Is it possible to remove the noise from the vacuum and increase the volume of the fireworks?"		✏️"Please take out the sound of the fireworks banging and enhance the volume of the vacuum cleaner cleaning the floors."

✏️"Could you eliminate the noise from the fireworks explosions, please?"		✏️"Just extract the firework for me."
✨Remixed Mixture C	Target Mixture C	✨Remixed Mixture D	Target Mixture D

2 Speech + 1 Audio (VGGSound)▶

🔉Input Mixture #10 consists of	male speaker with normal pitch, low tempo, normal energy, and surprised emotion	female speaker with high pitch, high tempo, low energy, and neutral emotion	playing banjo

✨Remixed Mixture A	Target Mixture A	✨Remixed Mixture B	Target Mixture B
✏️"Lower the sound level of the banjo playing, remove the woman with a high-paced, low-energy delivery, and increase the volume of the surprised man who speaks slowly with normal enthusiasm."		✏️"Try to delete the surprised male speaker, if you can."

✏️"Could you decrease the audio level of the female speaker with a fast tempo and low vitality who maintains a neutral emotion?"		✏️"Extract the banjo music from the audio."
✨Remixed Mixture C	Target Mixture C	✨Remixed Mixture D	Target Mixture D

🔉Input Mixture #11 consists of	male speaker with low pitch, high tempo, low energy, and neutral emotion	female speaker with high pitch, normal tempo, normal energy, and neutral emotion	wind chime

✨Remixed Mixture A	Target Mixture A	✨Remixed Mixture B	Target Mixture B
✏️"Extract the background sound but make it quieter."		✏️"Can you extract the audio of the speaker with a low-pitched voice and a brisk tempo?"

✏️"Increase the audio of the wind chime, decrease the volume of the male speaker with a fast pace and low enthusiasm, and remove the female speaker with a regular pace and average enthusiasm."		✏️"Please take out the individual with a low pitch."
✨Remixed Mixture C	Target Mixture C	✨Remixed Mixture D	Target Mixture D

1 Speech + 2 Audio (VGGSound)▶

🔉Input Mixture #12 consists of	male speaker with low pitch, high tempo, normal energy, and neutral emotion	cuckoo bird calling	playing oboe

✨Remixed Mixture A	Target Mixture A	✨Remixed Mixture B	Target Mixture B
✏️"I'd like to extract the audio of an oboe being played, please."		✏️"Please remove all non-human sounds."

✏️"Enhance the sound level of the cuckoo bird's call, please."		✏️"First, volume up the man with a high tempo and regular enthusiasm. Second, volume down the cuckoo bird's calling. Third, remove the oboe music."
✨Remixed Mixture C	Target Mixture C	✨Remixed Mixture D	Target Mixture D

🔉Input Mixture #13 consists of	female speaker with high pitch, high tempo, low energy, and neutral emotion	dog growling	chicken crowing

✨Remixed Mixture A	Target Mixture A	✨Remixed Mixture B	Target Mixture B
✏️"Kindly remove the menacing growl produced by the canine."		✏️"Could you extract the sound of the woman speaking and the chicken, and then decrease the chicken's sound?"

✏️"Please quieten down the woman's voice, but make the dog and chicken's voices louder."		✏️"I only want to keep the animal voices of the dog and chicken in the mix."
✨Remixed Mixture C	Target Mixture C	✨Remixed Mixture D	Target Mixture D

2 Speech + 2 Audio (FSD50K, seen audio labels)▶

🔉Input Mixture #14 consists of	male speaker with low pitch, normal tempo, low energy, and neutral emotion	female speaker with high pitch, normal tempo, normal energy, and neutral emotion	acoustic guitar	(dog) bark

✨Remixed Mixture A	Target Mixture A	✨Remixed Mixture B	Target Mixture B
✏️"Make the conversation as clean as possible."		✏️"Boost the volume of the conversation, and also quieten down those distracting background sounds."

✏️"Could you pull out the dog barking sound for me? Thanks."		✏️"Can you isolate the speaker with a deep tone and low enthusiasm?"
✨Remixed Mixture C	Target Mixture C	✨Remixed Mixture D	Target Mixture D

✏️"I'd like you to turn down the volume of the low-pitch male speaker and remove the dog barking noise."		✏️"Please extract the sound of the guitar and the dog's barking."
✨Remixed Mixture E	Target Mixture E	✨Remixed Mixture F	Target Mixture F

🔉Input Mixture #15 consists of	female speaker with high pitch, normal tempo, normal energy, and neutral emotion	female speaker with normal pitch, normal tempo, normal energy, and neutral emotion	toilet flush	siren

✨Remixed Mixture A	Target Mixture A	✨Remixed Mixture B	Target Mixture B
✏️"Let's remove the annoying siren sound."		✏️"Can you edit the audio to extract the speaker characterized by standard pitch?"

✏️"Could you extract the high-pitched speaker and the wailing siren?"		✏️"I want you to single out the siren sound."
✨Remixed Mixture C	Target Mixture C	✨Remixed Mixture D	Target Mixture D

✏️"I'd like you to decrease the siren volume, lower the sound of the toilet flushing, reduce the volume of the speaker with normal pitch, and boost the volume of the speaker with high pitch."		✏️"Please raise the sound level of the high-pitched speaker, remove the speaker with typical pitch, and erase the siren sound."
✨Remixed Mixture E	Target Mixture E	✨Remixed Mixture F	Target Mixture F

2 Speech + 2 Audio (FSD50K, unseen audio labels)▶

🔉Input Mixture #16 consists of	male speaker with low pitch, low tempo, high energy, and neutral emotion	female speaker with high pitch, normal tempo, high energy, and neutral emotion	scissors	bowed string instrument

✨Remixed Mixture A	Target Mixture A	✨Remixed Mixture B	Target Mixture B
✏️"Please get rid of the sound of scissors."		✏️"Pump up the volume on the talks and reduce other noises."

✏️"Please separate the part featuring someone playing a string instrument."		✏️"I'd like you to isolate both the female speaker and the bowed string instrument sound from the mixture."
✨Remixed Mixture C	Target Mixture C	✨Remixed Mixture D	Target Mixture D

✏️"Remove all speakers from the audio, can you?"		✏️"Is it possible to turn up the volume of the string instrument and reduce the volume of the scissors?"
✨Remixed Mixture E	Target Mixture E	✨Remixed Mixture F	Target Mixture F

🔉Input Mixture #17 consists of	male speaker with low pitch, normal tempo, normal energy, and neutral emotion	female speaker with normal pitch, normal tempo, high energy, and neutral emotion	musical keyboard	seagull

✨Remixed Mixture A	Target Mixture A	✨Remixed Mixture B	Target Mixture B
✏️"Is it possible to extract the song of the seagull?"		✏️"Add more volume to the surrounding, and decrease the speech volume."