This page contains a set of audio samples in support of the paper.
We provide five samples for sound mixtures consisting of 2 Speech (TextrolSpeech) + 2 Audio (VGGSound) and two samples for each one of the zero-shot sound mixture compositions. For every sample, we write 4~6 text prompts, and show the edited and target sound mixture according to each prompt.
We also provide five real mixtures from AudioSet.
We recommend opening this website with Chrome and wearing headphones for the best audio experience.
Input Mixture #1 | female speaker with high pitch, normal tempo, high energy, and neutral emotion | male speaker with low pitch, high tempo, normal energy, and neutral emotion | helicopter | turkey gobbling |
---|---|---|---|---|
Text prompt A: "Increase the volume of the speeches and decrease the volume of the background sounds." |
Text prompt B: "Let's pull out the sound of the fast-talking man and the turkey." |
||
Remixed Mixture A | Target Mixture A | Remixed Mixture B | Target Mixture B |
---|---|---|---|
Text prompt C: "Remove all people talking." |
Text prompt D: "Why not get rid of the man's voice and the turkey's noise, and reduce the helicopter's volume?" |
||
Remixed Mixture C | Target Mixture C | Remixed Mixture D | Target Mixture D |
Text prompt E: "Could you kindly eliminate the sound of the helicopter? I appreciate it." |
Text prompt F: "Please extract the person with an elevated tone." |
||
Remixed Mixture E | Target Mixture E | Remixed Mixture F | Target Mixture F |
Input Mixture #2 | male speaker with low pitch, low tempo, low energy, and sad emotion | female speaker with high pitch, normal tempo, normal energy, and neutral emotion | playing accordion | playing drum kit |
---|---|---|---|---|
Text prompt A: "Is it possible to single out the accordion's performance?" |
Text prompt B: "Lower the volume of the live accordion music that is currently being played, please." |
||
Remixed Mixture A | Target Mixture A | Remixed Mixture B | Target Mixture B |
---|---|---|---|
Text prompt C: "Could you raise the decibel level of the gloomy speaker that has a subdued tone?" |
Text prompt D: "Please raise the sound for the female speaker with a standard tempo, amplify the playing accordion, reduce the playing drum kit, and decrease the volume for the male speaker with a sluggish pace." |
||
Remixed Mixture C | Target Mixture C | Remixed Mixture D | Target Mixture D |
Text prompt E: "I'd like you to exclude the speaker with a high-frequency voice and average vitality, conveying a neutral tone." |
Text prompt F: "Make everything quieter." |
||
Remixed Mixture E | Target Mixture E | Remixed Mixture F | Target Mixture F |
Input Mixture #3 | male speaker with low pitch, normal tempo, normal energy, and neutral emotion | male speaker with low pitch, high tempo, normal energy, and neutral emotion | underwater bubbling | train horning |
---|---|---|---|---|
Text prompt A: "Enhance this recording by removing all the noises." |
Text prompt B: "Could you raise the audio level of the underwater bubbling sound exclusively?" |
||
Remixed Mixture A | Target Mixture A | Remixed Mixture B | Target Mixture B |
---|---|---|---|
Text prompt C: "Can you adjust the sound so that both speakers are louder, the train horn is quieter, and the underwater bubbling is completely removed from the recording?" |
Text prompt D: "Is it possible to turn down the speakers' volume and crank up the background ambiance?" |
||
Remixed Mixture C | Target Mixture C | Remixed Mixture D | Target Mixture D |
Text prompt E: "I'd like you to edit out the speaker characterized by a faster tempo and the train horn sound altogether." |
Text prompt F: "Can you remove the speaker with a rapid rhythm?" |
||
Remixed Mixture E | Target Mixture E | Remixed Mixture F | Target Mixture F |
Input Mixture #4 | female speaker with normal pitch, normal tempo, normal energy, and neutral emotion | male speaker with low pitch, normal tempo, high energy, and neutral emotion | playing hammond organ | rain |
---|---|---|---|---|
Text prompt A: "Can you edit the recording to extract the sound of the organ and rainfall?" |
Text prompt B: "Can you modify the sound so that the rain and Hammond organ are quieter, the female speaker with normal pitch and energy is louder, and the male speaker with low pitch and high energy is entirely eliminated from the recording?" |
||
Remixed Mixture A | Target Mixture A | Remixed Mixture B | Target Mixture B |
---|---|---|---|
Text prompt C: "I'd like you to turn down the volume for the lady with the average pitch." |
Text prompt D: "Please remove the the organ music and both the female and male speakers in the audio track." |
||
Remixed Mixture C | Target Mixture C | Remixed Mixture D | Target Mixture D |
Text prompt E: "Please make the organ music louder." |
Text prompt F: "Let's extract the part featuring the speaker characterized by typical tone?" |
||
Remixed Mixture E | Target Mixture E | Remixed Mixture F | Target Mixture F |
Input Mixture #5 | female speaker with normal pitch, normal tempo, low energy, and neutral emotion | female speaker with normal pitch, low tempo, normal energy, and neutral emotion | church bell ringing | playing theremin |
---|---|---|---|---|
Text prompt A: "Extract the bell ringing from the rest of the audio." |
Text prompt B: "Could you amplify the audio level of the speaker with normal energy and slow tempo, and also raise the church bell ringing sound, but lower the volume of the speaker with low vitality?" |
||
Remixed Mixture A | Target Mixture A | Remixed Mixture B | Target Mixture B |
---|---|---|---|
Text prompt C: "Let's extract all human voices from the recording." |
Text prompt D: "Reduce the background audio and turn up the volume on the talking parts." |
||
Remixed Mixture C | Target Mixture C | Remixed Mixture D | Target Mixture D |
Text prompt E: "Is it feasible to erase the theremin's playing sound?" |
Text prompt F: "I'd like you to amplify the playing theremin sound and reduce the church bell ringing sound." |
||
Remixed Mixture E | Target Mixture E | Remixed Mixture F | Target Mixture F |
Input Mixture #6 | female speaker with normal pitch, normal tempo, low energy, and contempt emotion |
female speaker with normal pitch, normal tempo, normal energy, and neutral emotion |
---|---|---|
Text prompt A: "Can you extract the speaker characterized by their contemptuous manner?" |
Text prompt B: "Is it possible to turn up the volume of the speaker exhibiting typical enthusiasm and reduce the volume of the speaker showing low energy?" |
||
Remixed Mixture A | Target Mixture A | Remixed Mixture B | Target Mixture B |
---|---|---|---|
Text prompt C: "Could you isolate the speaker without emotion and speaking in a normal volume?" |
Text prompt D: "Why not turn up the sound of the contemptuous speaker while removing the speaker maintaining a neutral emotional state?" |
||
Remixed Mixture C | Target Mixture C | Remixed Mixture D | Target Mixture D |
Input Mixture #7 | female speaker with normal pitch, high tempo, normal energy, and neutral emotion |
male speaker with low pitch, low tempo, high energy, and neutral emotion |
---|---|---|
Text prompt A: "I'd appreciate it if you could eliminate the speaker who is speaking at a rapid pace." |
Text prompt B: "The female speaker talks is loud. Could you turn down the volume?" |
||
Remixed Mixture A | Target Mixture A | Remixed Mixture B | Target Mixture B |
---|---|---|---|
Text prompt C: "Remove the gentleman with a deep tone." |
Text prompt D: "Begin by decreasing the volume of the female speaker with a fast tempo, and then increase the volume of the male speaker with a low pitch." |
||
Remixed Mixture C | Target Mixture C | Remixed Mixture D | Target Mixture D |
Input Mixture #8 | playing tabla | missile launch |
---|---|---|
Text prompt A: "Please turn up the volume of the playing tabla sound and remove the missile launch sound." |
Text prompt B: "Kindly turn down the sound of the rocket being launched." |
||
Remixed Mixture A | Target Mixture A | Remixed Mixture B | Target Mixture B |
---|---|---|---|
Text prompt C: "Extract the tabla music for me." |
Text prompt D: "Could you decrease the volume for both the missile launch and the playing tabla sounds?" |
||
Remixed Mixture C | Target Mixture C | Remixed Mixture D | Target Mixture D |
Input Mixture #9 | fireworks banging | vacuum cleaner cleaning floors |
---|---|---|
Text prompt A: "Is it possible to remove the noise from the vacuum and increase the volume of the fireworks?" |
Text prompt B: "Please take out the sound of the fireworks banging and enhance the volume of the vacuum cleaner cleaning the floors." |
||
Remixed Mixture A | Target Mixture A | Remixed Mixture B | Target Mixture B |
---|---|---|---|
Text prompt C: "Could you eliminate the noise from the fireworks explosions, please?" |
Text prompt D: "Just extract the firework for me." |
||
Remixed Mixture C | Target Mixture C | Remixed Mixture D | Target Mixture D |
Input Mixture #10 | male speaker with normal pitch, low tempo, normal energy, and surprised emotion | female speaker with high pitch, high tempo, low energy, and neutral emotion | playing banjo |
---|---|---|---|
Text prompt A: "Lower the sound level of the banjo playing, remove the woman with a high-paced, low-energy delivery, and increase the volume of the surprised man who speaks slowly with normal enthusiasm." |
Text prompt B: "Try to delete the surprised male speaker, if you can." |
||
Remixed Mixture A | Target Mixture A | Remixed Mixture B | Target Mixture B |
---|---|---|---|
Text prompt C: "Could you decrease the audio level of the female speaker with a fast tempo and low vitality who maintains a neutral emotion?" |
Text prompt D: "Extract the banjo music from the audio." |
||
Remixed Mixture C | Target Mixture C | Remixed Mixture D | Target Mixture D |
Input Mixture #11 | male speaker with low pitch, high tempo, low energy, and neutral emotion | female speaker with high pitch, normal tempo, normal energy, and neutral emotion | wind chime |
---|---|---|---|
Text prompt A: "Extract the background sound but make it quieter." |
Text prompt B: "Can you extract the audio of the speaker with a low-pitched voice and a brisk tempo?" |
||
Remixed Mixture A | Target Mixture A | Remixed Mixture B | Target Mixture B |
---|---|---|---|
Text prompt C: "Increase the audio of the wind chime, decrease the volume of the male speaker with a fast pace and low enthusiasm, and remove the female speaker with a regular pace and average enthusiasm." |
Text prompt D: "Please take out the individual with a low pitch." |
||
Remixed Mixture C | Target Mixture C | Remixed Mixture D | Target Mixture D |
Input Mixture #12 | male speaker with low pitch, high tempo, normal energy, and neutral emotion | cuckoo bird calling | playing oboe |
---|---|---|---|
Text prompt A: "I'd like to extract the audio of an oboe being played, please." |
Text prompt B: "Please remove all non-human sounds." |
||
Remixed Mixture A | Target Mixture A | Remixed Mixture B | Target Mixture B |
---|---|---|---|
Text prompt C: "Enhance the sound level of the cuckoo bird's call, please." |
Text prompt D: "First, volume up the man with a high tempo and regular enthusiasm. Second, volume down the cuckoo bird's calling. Third, remove the oboe music." |
||
Remixed Mixture C | Target Mixture C | Remixed Mixture D | Target Mixture D |
Input Mixture #13 | female speaker with high pitch, high tempo, low energy, and neutral emotion | dog growling | chicken crowing |
---|---|---|---|
Text prompt A: "Kindly remove the menacing growl produced by the canine." |
Text prompt B: "Could you extract the sound of the woman speaking and the chicken, and then decrease the chicken's sound?" |
||
Remixed Mixture A | Target Mixture A | Remixed Mixture B | Target Mixture B |
---|---|---|---|
Text prompt C: "Please quieten down the woman's voice, but make the dog and chicken's voices louder." |
Text prompt D: "I only want to keep the animal voices of the dog and chicken in the mix." |
||
Remixed Mixture C | Target Mixture C | Remixed Mixture D | Target Mixture D |
Input Mixture #14 | male speaker with low pitch, normal tempo, low energy, and neutral emotion | female speaker with high pitch, normal tempo, normal energy, and neutral emotion | acoustic guitar | (dog) bark |
---|---|---|---|---|
Text prompt A: "Make the conversation as clean as possible." |
Text prompt B: "Boost the volume of the conversation, and also quieten down those distracting background sounds." |
||
Remixed Mixture A | Target Mixture A | Remixed Mixture B | Target Mixture B |
---|---|---|---|
Text prompt C: "Could you pull out the dog barking sound for me? Thanks." |
Text prompt D: "Can you isolate the speaker with a deep tone and low enthusiasm?" |
||
Remixed Mixture C | Target Mixture C | Remixed Mixture D | Target Mixture D |
Text prompt E: "I'd like you to turn down the volume of the low-pitch male speaker and remove the dog barking noise." |
Text prompt F: "Please extract the sound of the guitar and the dog's barking." |
||
Remixed Mixture E | Target Mixture E | Remixed Mixture F | Target Mixture F |
Input Mixture #15 | female speaker with high pitch, normal tempo, normal energy, and neutral emotion | female speaker with normal pitch, normal tempo, normal energy, and neutral emotion | toilet flush | siren |
---|---|---|---|---|
Text prompt A: "Let's remove the annoying siren sound." |
Text prompt B: "Can you edit the audio to extract the speaker characterized by standard pitch?" |
||
Remixed Mixture A | Target Mixture A | Remixed Mixture B | Target Mixture B |
---|---|---|---|
Text prompt C: "Could you extract the high-pitched speaker and the wailing siren?" |
Text prompt D: "I want you to single out the siren sound." |
||
Remixed Mixture C | Target Mixture C | Remixed Mixture D | Target Mixture D |
Text prompt E: "I'd like you to decrease the siren volume, lower the sound of the toilet flushing, reduce the volume of the speaker with normal pitch, and boost the volume of the speaker with high pitch." |
Text prompt F: "Please raise the sound level of the high-pitched speaker, remove the speaker with typical pitch, and erase the siren sound." |
||
Remixed Mixture E | Target Mixture E | Remixed Mixture F | Target Mixture F |
Input Mixture #16 | male speaker with low pitch, low tempo, high energy, and neutral emotion | female speaker with high pitch, normal tempo, high energy, and neutral emotion | scissors | bowed string instrument |
---|---|---|---|---|
Text prompt A: "Please get rid of the sound of scissors." |
Text prompt B: "Pump up the volume on the talks and reduce other noises." |
||
Remixed Mixture A | Target Mixture A | Remixed Mixture B | Target Mixture B |
---|---|---|---|
Text prompt C: "Please separate the part featuring someone playing a string instrument." |
Text prompt D: "I'd like you to isolate both the female speaker and the bowed string instrument sound from the mixture." |
||
Remixed Mixture C | Target Mixture C | Remixed Mixture D | Target Mixture D |
Text prompt E: "Remove all speakers from the audio, can you?" |
Text prompt F: "Is it possible to turn up the volume of the string instrument and reduce the volume of the scissors?" |
||
Remixed Mixture E | Target Mixture E | Remixed Mixture F | Target Mixture F |
Input Mixture #17 | male speaker with low pitch, normal tempo, normal energy, and neutral emotion | female speaker with normal pitch, normal tempo, high energy, and neutral emotion | musical keyboard | seagull |
---|---|---|---|---|
Text prompt A: "Is it possible to extract the song of the seagull?" |
Text prompt B: "Add more volume to the surrounding, and decrease the speech volume." |
||
Remixed Mixture A | Target Mixture A | Remixed Mixture B | Target Mixture B |
---|---|---|---|
Text prompt C: "Could you eliminate the music from a keyboard?" |
Text prompt D: "Identify and isolate the lady speaking in a high energy." |
||
Remixed Mixture C | Target Mixture C | Remixed Mixture D | Target Mixture D |
Text prompt E: "Eliminate any non-speech sounds in the surroundings." |
Text prompt F: "Can you edit the audio to increase the volume of the female speaker with normal pitch and high energy, decrease the sound of the male speaker with low pitch and normal energy, raise the volume of the seagull noise, and lower the volume of the musical keyboard?" |
||
Remixed Mixture E | Target Mixture E | Remixed Mixture F | Target Mixture F |
Input Mixture #18 |
---|
Text prompt A: "Eliminate any non-speech sounds in the surroundings." |
Text prompt B: "Extract the cat sound." | Text prompt C: "Remove the cat sound." |
Text prompt D: "Remove both the human talker and the cat sound." |
Remixed Mixture A | Remixed Mixture B | Remixed Mixture C | Remixed Mixture D |
---|---|---|---|
Input Mixture #19 |
---|
Text prompt A: "Eliminate any non-speech sounds in the surroundings." |
Text prompt B: "Extract the animal sound." | Text prompt C: "Remove the female speaker with a high pitch." |
Text prompt D: "Extract the sound of the animal and the bird singing." |
Remixed Mixture A | Remixed Mixture B | Remixed Mixture C | Remixed Mixture D |
---|---|---|---|
Input Mixture #20 |
---|
Text prompt A: "Eliminate any non-speech sounds in the surroundings." |
Text prompt B: "Extract the music." | Text prompt C: "Remove the animal quacking." |
Text prompt D: "Extract the animal quacking sound and the music." |
Remixed Mixture A | Remixed Mixture B | Remixed Mixture C | Remixed Mixture D |
---|---|---|---|
Input Mixture #21 |
---|
Text prompt A: "Eliminate any non-speech sounds in the surroundings." |
Text prompt B: "Extract the voice of a person with a high pitch." | Text prompt C: "Remove the background music." |
Text prompt D: "Remove the voice of a person with a high pitch and the music." |
Remixed Mixture A | Remixed Mixture B | Remixed Mixture C | Remixed Mixture D |
---|---|---|---|
Input Mixture #22 |
---|
Text prompt A: "Eliminate any non-speech sounds in the surroundings." |
Text prompt B: "Extract the dog barking." | Text prompt C: "Remove the male shouting." |
Text prompt D: "Extract the dog barking and the male shouting." |
Remixed Mixture A | Remixed Mixture B | Remixed Mixture C | Remixed Mixture D |
---|---|---|---|