Not a fan of those settings tb
Can't say I'm a fan of those settings. The VAD settings aren't supposed to be touched (besides threshold and VAD method pyannote vs silero). They are already optimized for most content. Yours are so far of the default, I'm really wondering how you got to these settings. Your threshold is also so low that you will have many hallucinations because of it.
You also cant really chat/prompt to these models, they are designed to do one thing and one thing only (transcribe/translate). Afaik prompting to it can only cause more hallucinations. Just saying it's adult content might be harmless tho.
I'm not an expert by any means, but this info is what I've gotten from nosing around in the githubs.
I'm also of the opposite opinion that I don't mind hallucinations, I rather have a bit more hallucinations than missing out on dialogue. It's usually pretty obvious when there's 1 word wrongly transcribed and it's very rare for me that a whole sentence is wrongly transcribed unless I put that VAD threshold very low. I'm sure with good prompting LLMs can also catch some of these bad transcriptions, but I haven't gone too deep into comparing what I use in my prompt to make LLMs catch some of these ("Handle Whisper ASR transcription errors contextually")
Thanks for the feedback T221152. I have tweaked the crap out of WhisperAI. I could tell you horror stories of having a decent result and then tweaking a little more and then forgetting the orginal settings, going back to square One and removing all of the components and starting again. I had a nightmare trying to get Faster-Whisper to function. IN any case, we are of differing opinions on the hallucination issue and I see your point.
My various configuration settings are a result of endless experimentation to try to catch dialog that is whispered, or otherwise will be missed by the default settings. Another issue, for me, is the hallucination problem and this is a constant battle, accuracy at the expense of likely hallucinations. The more specific that you design your settings the more likely it is that you will get hallucinations. You have said that you don't mind hallucinations and that is where we will have to disagree. Ultimately though, since Whisper is not perfect, not by a long shot, I just have to go with the best that I can get, not perfect, not even close, but good enough for me.
I also have tried all of the models given a variety of scenarios and one of the annoyances to me is that Once a hallucination starts it will often continue over very clear dialog so that Whisper completely skips that dialog. That particular aspect of the hallucination really drives me nuts. IN any case, for me, It was more of a situation where I was not completely satisfied with the results but they were, 'good enough'.
The large V2 vs. Large V3 discussion I have never completely convinced myself of one over the other so I simply went with V3. IN essense the technology is not yet reliable. It gives you a approximation that, for me, is good enough but not perfect. My final analysis is that almost all of the dialog in a JAV title is just a rehashing of the same old thing over and over again. Where there is clearly a conversation, this is where I find that my settings yield a pretty accurate result. I say "pretty accurate' because I have not found a configuration that completely gives me a perfect result.
That said, I always pay close attention to your posts because you seem to have a better handle on the specifics than I so I really do appreciate your feedback.