For some reason many subs generated via Whisper will have a bunch of subtitles start exactly at 30 seconds, and totally rushed through and offset. Most of the time the timing is great, but I've seen this initial rush a few times and I can't tell how/why it is happening like this.
For the most part, the timings are pretty good compared to most JAV subtitles I've seen. They aren't anime quality, but for stuff that is automatically generated they seem pretty good.
I think I'm in a spot where I can write a Whisper intro thread, but I think I need to understand logprob_threshold. I understand the no_speech_threshold mechanic just fine, but logprob_threshold I do not have a good sense for. Also, compression_ratio_threshold is confusing as well and I don't know when to adjust it higher or lower than the default 2.4.
For condition_on_previous_text I think I have settled on just setting it to False and keeping it there. You'll get scattered totally left field lines but they can be deleted or replaced in the edit. The upside is it gets stuck in loops way less often, and will be more willing to resort to descriptive emotes like "(moaning)" or "(crying)" "
(Heavy breathing)
(gagging)
*panting*
When you are conditioning to previous text, it will fit to a specific style of transcriptions, but with it set to False, if it sounds like some woman is gagging on something, it will reference some subtitle in those 680k hours of data where some woman sounds like she's gagging and will use the corresponding subtitle that noted (gagging).
For the most part, the timings are pretty good compared to most JAV subtitles I've seen. They aren't anime quality, but for stuff that is automatically generated they seem pretty good.
I think I'm in a spot where I can write a Whisper intro thread, but I think I need to understand logprob_threshold. I understand the no_speech_threshold mechanic just fine, but logprob_threshold I do not have a good sense for. Also, compression_ratio_threshold is confusing as well and I don't know when to adjust it higher or lower than the default 2.4.
For condition_on_previous_text I think I have settled on just setting it to False and keeping it there. You'll get scattered totally left field lines but they can be deleted or replaced in the edit. The upside is it gets stuck in loops way less often, and will be more willing to resort to descriptive emotes like "(moaning)" or "(crying)" "
(Heavy breathing)
(gagging)
*panting*
When you are conditioning to previous text, it will fit to a specific style of transcriptions, but with it set to False, if it sounds like some woman is gagging on something, it will reference some subtitle in those 680k hours of data where some woman sounds like she's gagging and will use the corresponding subtitle that noted (gagging).
Last edited: