The following samples are generated by the evaluated baselines and our proposed model (right-most column). Each sample consists of 10 seconds of audio divided by a "beep" sound in the middle. The first 5 seconds are the ground-truth audio used to prime to audio language model. The second 5 seconds are the generated continuation of the audio language model.

ground-truth CPC wav2vec2 logmel-2ms logmel-10ms ours-2ms-2048vocab