Samples

The following samples are generated by the evaluated baselines and our proposed model (right-most column). Each sample consists of 10 seconds of audio divided by a "beep" sound in the middle. The first 5 seconds are the ground-truth audio used to prime to audio language model. The second 5 seconds are the generated continuation of the audio language model.

ground-truth	CPC	wav2vec2	logmel-2ms	logmel-10ms	ours-2ms-2048vocab