The following samples are generated by the evaluated baselines and our proposed model (right-most column). Each sample consists of 10 seconds of audio divided by a "beep" sound in the middle. The first 5 seconds are the ground-truth audio used to prime to audio language model. The second 5 seconds are the generated continuation of the audio language model.
ground-truth | CPC | wav2vec2 | logmel-2ms | logmel-10ms | ours-2ms-2048vocab |
---|---|---|---|---|---|