How can I use this model in inference?
Hi Anton,
Thanks for sharing this model.
I have a question about using this model in inference. Suppose I have a single-channel audio that has 2 speakers talking without overlapping. What kind of output should I expect from this model? How can I tell which part of the audio is spoken by which speaker?
Looking forward to your reply.
Thanks,
Qiantong
The output will be a 2d tensor, with [shape sequence_length, 2]. My understading is that I could get the frame-level prediction by applying torch.sigmoid(output) > 0.5
? However, the output looks a bit messy on my test sample.
Hi Anton,
Thanks a lot for sharing this model. I am posting my question here since it is related to what Qiantong has asked.
I can see that the output is of shape (num_frames, num_speakers). Could you please guide us on how to map each frame to its corresponding time-stamp? Or better said how to chunk the audio to pieces based on speakers.
Best
Chakka
Hi
Did anyone find a solution to map model outputs to timestamps ?