how are you running it? locally using python on your own GPU? With whisper.cpp? a colab instance?when i do a whisper after 4 min processing, why a output running same until the end ??
264
00:39:14,000 --> 00:39:24,000
OK?
265
00:39:24,000 --> 00:39:32,000
OK?
266
00:39:32,000 --> 00:39:42,000
OK?
its start on 04.00 min
So it's looking for previous tokens for context, and if there is a lot of non-normal dialogue (e.g. sex) then it can throw in a wrench in things. There are a few things to try. 1) try different models, 2) use a VAD, 3) break larger wav files into chunks and stitch them together.
Personally I've found using fine-tuned models by Japanese researchers to perform the best, and I break larger wav files into smaller ones. Some movies just perform better than others for transcription. If there are a lot of people talking, or if there isn't a lot of dialogue, or if there is a shower running or some other distracting noise I get sub-optimal results.