Audio Samples of Listen, Chat, and Remix: Text-Guided Soundscape Remixing for Enhanced Auditory Experience

Abstract

In daily life, we encounter a variety of sounds, both desirable and undesirable, with limited control over their presence and volume. Our work introduces “Listen, Chat, and Remix” (LCR), a novel multimodal sound remixer that controls each sound source in a mixture based on user-provided text instructions. LCR distinguishes itself with a user-friendly text interface and its unique ability to remix multiple sound sources simultaneously within a mixture, without needing to separate them. Users input open-vocabulary text prompts, which are interpreted by a large language model to create a semantic filter for remixing the sound mixture. The system then decomposes the mixture into its components, applies the semantic filter, and reassembles filtered components back to the desired output. We developed a 160-hour dataset with over 100k mixtures, including speech and various audio sources, along with text prompts for diverse remixing tasks including extraction, removal, and volume control of single or multiple sources. Our experiments demonstrate significant improvements in signal quality across all remixing tasks and robust performance in zero-shot scenarios with varying numbers and types of sound sources.

This page contains a set of audio samples in support of the paper.

We provide five samples for sound mixtures consisting of 2 Speech (TextrolSpeech) + 2 Audio (VGGSound) and two samples for each one of the zero-shot sound mixture compositions. For every sample, we write 4~6 text prompts, and show the edited and target sound mixture according to each prompt.

We also provide five real mixtures from AudioSet.

We recommend opening this website with Chrome and wearing headphones for the best audio experience.

Sound Mixture Compositions

In-distribution
  • 2 Speech + 2 Audio (VGGSound)
  • Zero-shot
  • 2 Speech
  • 2 Audio (VGGSound)
  • 2 Speech + 1 Audio (VGGSound)
  • 1 Speech + 2 Audio (VGGSound)
  • 2 Speech + 2 Audio (FSD50K, seen audio labels)
  • 2 Speech + 2 Audio (FSD50K, unseen audio labels)
  • AudioSet In-the-wild Mixtures
  • 2 Speech + 2 Audio (VGGSound)


    Input Mixture #1 female speaker with high pitch, normal tempo, high energy, and neutral emotion male speaker with low pitch, high tempo, normal energy, and neutral emotion helicopter turkey gobbling

    Text prompt A: "Increase the volume of the speeches and decrease the volume of the background sounds."

    Text prompt B: "Let's pull out the sound of the fast-talking man and the turkey."

    Remixed Mixture A Target Mixture A Remixed Mixture B Target Mixture B

    Text prompt C: "Remove all people talking."

    Text prompt D: "Why not get rid of the man's voice and the turkey's noise, and reduce the helicopter's volume?"

    Remixed Mixture C Target Mixture C Remixed Mixture D Target Mixture D

    Text prompt E: "Could you kindly eliminate the sound of the helicopter? I appreciate it."

    Text prompt F: "Please extract the person with an elevated tone."

    Remixed Mixture E Target Mixture E Remixed Mixture F Target Mixture F

    Input Mixture #2 male speaker with low pitch, low tempo, low energy, and sad emotion female speaker with high pitch, normal tempo, normal energy, and neutral emotion playing accordion playing drum kit

    Text prompt A: "Is it possible to single out the accordion's performance?"

    Text prompt B: "Lower the volume of the live accordion music that is currently being played, please."

    Remixed Mixture A Target Mixture A Remixed Mixture B Target Mixture B

    Text prompt C: "Could you raise the decibel level of the gloomy speaker that has a subdued tone?"

    Text prompt D: "Please raise the sound for the female speaker with a standard tempo, amplify the playing accordion, reduce the playing drum kit, and decrease the volume for the male speaker with a sluggish pace."

    Remixed Mixture C Target Mixture C Remixed Mixture D Target Mixture D

    Text prompt E: "I'd like you to exclude the speaker with a high-frequency voice and average vitality, conveying a neutral tone."

    Text prompt F: "Make everything quieter."

    Remixed Mixture E Target Mixture E Remixed Mixture F Target Mixture F

    Input Mixture #3 male speaker with low pitch, normal tempo, normal energy, and neutral emotion male speaker with low pitch, high tempo, normal energy, and neutral emotion underwater bubbling train horning

    Text prompt A: "Enhance this recording by removing all the noises."

    Text prompt B: "Could you raise the audio level of the underwater bubbling sound exclusively?"

    Remixed Mixture A Target Mixture A Remixed Mixture B Target Mixture B

    Text prompt C: "Can you adjust the sound so that both speakers are louder, the train horn is quieter, and the underwater bubbling is completely removed from the recording?"

    Text prompt D: "Is it possible to turn down the speakers' volume and crank up the background ambiance?"

    Remixed Mixture C Target Mixture C Remixed Mixture D Target Mixture D

    Text prompt E: "I'd like you to edit out the speaker characterized by a faster tempo and the train horn sound altogether."

    Text prompt F: "Can you remove the speaker with a rapid rhythm?"

    Remixed Mixture E Target Mixture E Remixed Mixture F Target Mixture F

    Input Mixture #4 female speaker with normal pitch, normal tempo, normal energy, and neutral emotion male speaker with low pitch, normal tempo, high energy, and neutral emotion playing hammond organ rain

    Text prompt A: "Can you edit the recording to extract the sound of the organ and rainfall?"

    Text prompt B: "Can you modify the sound so that the rain and Hammond organ are quieter, the female speaker with normal pitch and energy is louder, and the male speaker with low pitch and high energy is entirely eliminated from the recording?"

    Remixed Mixture A Target Mixture A Remixed Mixture B Target Mixture B

    Text prompt C: "I'd like you to turn down the volume for the lady with the average pitch."

    Text prompt D: "Please remove the the organ music and both the female and male speakers in the audio track."

    Remixed Mixture C Target Mixture C Remixed Mixture D Target Mixture D

    Text prompt E: "Please make the organ music louder."

    Text prompt F: "Let's extract the part featuring the speaker characterized by typical tone?"

    Remixed Mixture E Target Mixture E Remixed Mixture F Target Mixture F

    Input Mixture #5 female speaker with normal pitch, normal tempo, low energy, and neutral emotion female speaker with normal pitch, low tempo, normal energy, and neutral emotion church bell ringing playing theremin

    Text prompt A: "Extract the bell ringing from the rest of the audio."

    Text prompt B: "Could you amplify the audio level of the speaker with normal energy and slow tempo, and also raise the church bell ringing sound, but lower the volume of the speaker with low vitality?"

    Remixed Mixture A Target Mixture A Remixed Mixture B Target Mixture B

    Text prompt C: "Let's extract all human voices from the recording."

    Text prompt D: "Reduce the background audio and turn up the volume on the talking parts."

    Remixed Mixture C Target Mixture C Remixed Mixture D Target Mixture D

    Text prompt E: "Is it feasible to erase the theremin's playing sound?"

    Text prompt F: "I'd like you to amplify the playing theremin sound and reduce the church bell ringing sound."

    Remixed Mixture E Target Mixture E Remixed Mixture F Target Mixture F

    2 Speech


    Input Mixture #6 female speaker with normal pitch, normal
    tempo, low energy, and contempt emotion
    female speaker with normal pitch, normal
    tempo, normal energy, and neutral emotion

    Text prompt A: "Can you extract the speaker characterized by their contemptuous manner?"

    Text prompt B: "Is it possible to turn up the volume of the speaker exhibiting typical enthusiasm and reduce the volume of the speaker showing low energy?"

    Remixed Mixture A Target Mixture A Remixed Mixture B Target Mixture B

    Text prompt C: "Could you isolate the speaker without emotion and speaking in a normal volume?"

    Text prompt D: "Why not turn up the sound of the contemptuous speaker while removing the speaker maintaining a neutral emotional state?"

    Remixed Mixture C Target Mixture C Remixed Mixture D Target Mixture D

    Input Mixture #7 female speaker with normal pitch, high
    tempo, normal energy, and neutral emotion
    male speaker with low pitch, low
    tempo, high energy, and neutral emotion

    Text prompt A: "I'd appreciate it if you could eliminate the speaker who is speaking at a rapid pace."

    Text prompt B: "The female speaker talks is loud. Could you turn down the volume?"

    Remixed Mixture A Target Mixture A Remixed Mixture B Target Mixture B

    Text prompt C: "Remove the gentleman with a deep tone."

    Text prompt D: "Begin by decreasing the volume of the female speaker with a fast tempo, and then increase the volume of the male speaker with a low pitch."

    Remixed Mixture C Target Mixture C Remixed Mixture D Target Mixture D

    2 Audio (VGGSound)


    Input Mixture #8 playing tabla missile launch

    Text prompt A: "Please turn up the volume of the playing tabla sound and remove the missile launch sound."

    Text prompt B: "Kindly turn down the sound of the rocket being launched."

    Remixed Mixture A Target Mixture A Remixed Mixture B Target Mixture B

    Text prompt C: "Extract the tabla music for me."

    Text prompt D: "Could you decrease the volume for both the missile launch and the playing tabla sounds?"

    Remixed Mixture C Target Mixture C Remixed Mixture D Target Mixture D

    Input Mixture #9 fireworks banging vacuum cleaner cleaning floors

    Text prompt A: "Is it possible to remove the noise from the vacuum and increase the volume of the fireworks?"

    Text prompt B: "Please take out the sound of the fireworks banging and enhance the volume of the vacuum cleaner cleaning the floors."

    Remixed Mixture A Target Mixture A Remixed Mixture B Target Mixture B

    Text prompt C: "Could you eliminate the noise from the fireworks explosions, please?"

    Text prompt D: "Just extract the firework for me."

    Remixed Mixture C Target Mixture C Remixed Mixture D Target Mixture D

    2 Speech + 1 Audio (VGGSound)


    Input Mixture #10 male speaker with normal pitch, low tempo, normal energy, and surprised emotion female speaker with high pitch, high tempo, low energy, and neutral emotion playing banjo

    Text prompt A: "Lower the sound level of the banjo playing, remove the woman with a high-paced, low-energy delivery, and increase the volume of the surprised man who speaks slowly with normal enthusiasm."

    Text prompt B: "Try to delete the surprised male speaker, if you can."

    Remixed Mixture A Target Mixture A Remixed Mixture B Target Mixture B

    Text prompt C: "Could you decrease the audio level of the female speaker with a fast tempo and low vitality who maintains a neutral emotion?"

    Text prompt D: "Extract the banjo music from the audio."

    Remixed Mixture C Target Mixture C Remixed Mixture D Target Mixture D

    Input Mixture #11 male speaker with low pitch, high tempo, low energy, and neutral emotion female speaker with high pitch, normal tempo, normal energy, and neutral emotion wind chime

    Text prompt A: "Extract the background sound but make it quieter."

    Text prompt B: "Can you extract the audio of the speaker with a low-pitched voice and a brisk tempo?"

    Remixed Mixture A Target Mixture A Remixed Mixture B Target Mixture B

    Text prompt C: "Increase the audio of the wind chime, decrease the volume of the male speaker with a fast pace and low enthusiasm, and remove the female speaker with a regular pace and average enthusiasm."

    Text prompt D: "Please take out the individual with a low pitch."

    Remixed Mixture C Target Mixture C Remixed Mixture D Target Mixture D

    1 Speech + 2 Audio (VGGSound)


    Input Mixture #12 male speaker with low pitch, high tempo, normal energy, and neutral emotion cuckoo bird calling playing oboe

    Text prompt A: "I'd like to extract the audio of an oboe being played, please."

    Text prompt B: "Please remove all non-human sounds."

    Remixed Mixture A Target Mixture A Remixed Mixture B Target Mixture B

    Text prompt C: "Enhance the sound level of the cuckoo bird's call, please."

    Text prompt D: "First, volume up the man with a high tempo and regular enthusiasm. Second, volume down the cuckoo bird's calling. Third, remove the oboe music."

    Remixed Mixture C Target Mixture C Remixed Mixture D Target Mixture D

    Input Mixture #13 female speaker with high pitch, high tempo, low energy, and neutral emotion dog growling chicken crowing

    Text prompt A: "Kindly remove the menacing growl produced by the canine."

    Text prompt B: "Could you extract the sound of the woman speaking and the chicken, and then decrease the chicken's sound?"

    Remixed Mixture A Target Mixture A Remixed Mixture B Target Mixture B

    Text prompt C: "Please quieten down the woman's voice, but make the dog and chicken's voices louder."

    Text prompt D: "I only want to keep the animal voices of the dog and chicken in the mix."

    Remixed Mixture C Target Mixture C Remixed Mixture D Target Mixture D

    2 Speech + 2 Audio (FSD50K, seen audio labels)


    Input Mixture #14 male speaker with low pitch, normal tempo, low energy, and neutral emotion female speaker with high pitch, normal tempo, normal energy, and neutral emotion acoustic guitar (dog) bark

    Text prompt A: "Make the conversation as clean as possible."

    Text prompt B: "Boost the volume of the conversation, and also quieten down those distracting background sounds."

    Remixed Mixture A Target Mixture A Remixed Mixture B Target Mixture B

    Text prompt C: "Could you pull out the dog barking sound for me? Thanks."

    Text prompt D: "Can you isolate the speaker with a deep tone and low enthusiasm?"

    Remixed Mixture C Target Mixture C Remixed Mixture D Target Mixture D

    Text prompt E: "I'd like you to turn down the volume of the low-pitch male speaker and remove the dog barking noise."

    Text prompt F: "Please extract the sound of the guitar and the dog's barking."

    Remixed Mixture E Target Mixture E Remixed Mixture F Target Mixture F

    Input Mixture #15 female speaker with high pitch, normal tempo, normal energy, and neutral emotion female speaker with normal pitch, normal tempo, normal energy, and neutral emotion toilet flush siren

    Text prompt A: "Let's remove the annoying siren sound."

    Text prompt B: "Can you edit the audio to extract the speaker characterized by standard pitch?"

    Remixed Mixture A Target Mixture A Remixed Mixture B Target Mixture B

    Text prompt C: "Could you extract the high-pitched speaker and the wailing siren?"

    Text prompt D: "I want you to single out the siren sound."

    Remixed Mixture C Target Mixture C Remixed Mixture D Target Mixture D

    Text prompt E: "I'd like you to decrease the siren volume, lower the sound of the toilet flushing, reduce the volume of the speaker with normal pitch, and boost the volume of the speaker with high pitch."

    Text prompt F: "Please raise the sound level of the high-pitched speaker, remove the speaker with typical pitch, and erase the siren sound."

    Remixed Mixture E Target Mixture E Remixed Mixture F Target Mixture F

    2 Speech + 2 Audio (FSD50K, unseen audio labels)


    Input Mixture #16 male speaker with low pitch, low tempo, high energy, and neutral emotion female speaker with high pitch, normal tempo, high energy, and neutral emotion scissors bowed string instrument

    Text prompt A: "Please get rid of the sound of scissors."

    Text prompt B: "Pump up the volume on the talks and reduce other noises."

    Remixed Mixture A Target Mixture A Remixed Mixture B Target Mixture B

    Text prompt C: "Please separate the part featuring someone playing a string instrument."

    Text prompt D: "I'd like you to isolate both the female speaker and the bowed string instrument sound from the mixture."

    Remixed Mixture C Target Mixture C Remixed Mixture D Target Mixture D

    Text prompt E: "Remove all speakers from the audio, can you?"

    Text prompt F: "Is it possible to turn up the volume of the string instrument and reduce the volume of the scissors?"

    Remixed Mixture E Target Mixture E Remixed Mixture F Target Mixture F

    Input Mixture #17 male speaker with low pitch, normal tempo, normal energy, and neutral emotion female speaker with normal pitch, normal tempo, high energy, and neutral emotion musical keyboard seagull

    Text prompt A: "Is it possible to extract the song of the seagull?"

    Text prompt B: "Add more volume to the surrounding, and decrease the speech volume."

    Remixed Mixture A Target Mixture A Remixed Mixture B Target Mixture B

    Text prompt C: "Could you eliminate the music from a keyboard?"

    Text prompt D: "Identify and isolate the lady speaking in a high energy."

    Remixed Mixture C Target Mixture C Remixed Mixture D Target Mixture D

    Text prompt E: "Eliminate any non-speech sounds in the surroundings."

    Text prompt F: "Can you edit the audio to increase the volume of the female speaker with normal pitch and high energy, decrease the sound of the male speaker with low pitch and normal energy, raise the volume of the seagull noise, and lower the volume of the musical keyboard?"

    Remixed Mixture E Target Mixture E Remixed Mixture F Target Mixture F

    AudioSet In-the-wild Mixtures


    Input Mixture #18

    Text prompt A: "Eliminate any non-speech sounds in the surroundings."

    Text prompt B: "Extract the cat sound."

    Text prompt C: "Remove the cat sound."

    Text prompt D: "Remove both the human talker and the cat sound."

    Remixed Mixture A Remixed Mixture B Remixed Mixture C Remixed Mixture D

    Input Mixture #19

    Text prompt A: "Eliminate any non-speech sounds in the surroundings."

    Text prompt B: "Extract the animal sound."

    Text prompt C: "Remove the female speaker with a high pitch."

    Text prompt D: "Extract the sound of the animal and the bird singing."

    Remixed Mixture A Remixed Mixture B Remixed Mixture C Remixed Mixture D

    Input Mixture #20

    Text prompt A: "Eliminate any non-speech sounds in the surroundings."

    Text prompt B: "Extract the music."

    Text prompt C: "Remove the animal quacking."

    Text prompt D: "Extract the animal quacking sound and the music."

    Remixed Mixture A Remixed Mixture B Remixed Mixture C Remixed Mixture D

    Input Mixture #21

    Text prompt A: "Eliminate any non-speech sounds in the surroundings."

    Text prompt B: "Extract the voice of a person with a high pitch."

    Text prompt C: "Remove the background music."

    Text prompt D: "Remove the voice of a person with a high pitch and the music."

    Remixed Mixture A Remixed Mixture B Remixed Mixture C Remixed Mixture D

    Input Mixture #22

    Text prompt A: "Eliminate any non-speech sounds in the surroundings."

    Text prompt B: "Extract the dog barking."

    Text prompt C: "Remove the male shouting."

    Text prompt D: "Extract the dog barking and the male shouting."

    Remixed Mixture A Remixed Mixture B Remixed Mixture C Remixed Mixture D