Whisper and its many forms

  • Throughout the month of April 2024, participate in the FileJoker Thread Contest OPEN TO EVERYONE!

    From 1st to 30th of April 2024, members can earn cash rewards by posting Filejoker-Exclusive threads in the Direct-Downloads subforums.

    There are $1000 in prizes, and the top prize is $450!

    For the full rules and how to enter, check out the thread
  • Akiba-Online is sponsored by FileJoker.

    FileJoker is a required filehost for all new posts and content replies in the Direct Downloads subforums.

    Failure to include FileJoker links for Direct Download posts will result in deletion of your posts or worse.

    For more information see
    this thread.

SamKook

Grand Wizard
Staff member
Super Moderator
Uploader
May 10, 2009
3,554
4,919
The AI is just guesswork so sometimes it think noise is people speaking and sometimes it thinks people speaking is noise. That can happen even if it's loud enough.
AI as we have it(which is machine learning, not actual intelligence since it relies 100% on its training being accurate) will never be perfect so manual editing/fiddling will always be required.

If you don't tell avidemux to increase the volume, it just copies it as-is, which is the point of the demuxing part of my tutorial.
A problem that can happen if you increase the volume overall at all times is that it might makes some parts too loud to be recognized as speech so that should really only be used for specific portions.

What you want to have more speech be recognized is to play with the whisper settings themselves. There is no perfect settings to always use or it would be the default so it'll depend on what you want. More speech recognized also means more noise recognized as speech.

large-v3 will recognize more speech as large-v2 from what I saw but the quality of the translation doesn't seem to be as good so an easy way would be to use v3 to fill in what v2 misses.

There is a thread that discusses tuning the settings that you may want to look at here: https://www.akiba-online.com/thread...age-an-intro-guide-to-subtitling-jav.2115103/
 
Last edited:
  • Like
Reactions: Taako and composite

composite

Active Member
Jul 25, 2015
222
146
The AI is just guesswork so sometimes it think noise is people speaking and sometimes it thinks people speaking is noise. That can happen even if it's loud enough.
AI as we have it(which is machine learning, not actual intelligence since it relies 100% on its training being accurate) will never be perfect so manual editing/fiddling will always be required.

If you don't tell avidemux to increase the volume, it just copies it as-is, which is the point of the demuxing part of my tutorial.
A problem that can happen if you increase the volume overall at all times is that it might makes some parts too loud to be recognized as speech so that should really only be used for specific portions.

What you want to have more speech be recognized is to play with the whisper settings themselves. There is no perfect settings to always use or it would be the default so it'll depend on what you want. More speech recognized also means more noise recognized as speech.

large-v3 will recognize more speech as large-v2 from what I was but the quality of the translation doesn't seem to be as good so an easy way would be to use v3 to fill in what v2 misses.

There is a thread that discusses tuning the settings that you may want to look at here: https://www.akiba-online.com/thread...age-an-intro-guide-to-subtitling-jav.2115103/
Many thanks for the reply!
 

mei2

Well-Known Member
Dec 6, 2018
221
362
I use the colab and it works great. But sometimes it doesn't transcribe all the audio. Someone is speaking but there's no translated text.

Does anyone have suggestions how to fix that? Does avidemux not make the volume loud enough for the translation program to hear the audio? Or something else? thanks.

I second the earlier post by @SamKook. There are some techniques that can help you to capture the dialogs that are difficult to capture or missing, however they all have a trade-off. Here are some suggestions:

  1. Make sure your audio is good and there are no dropped frames or bad headers for the parts that you need to transcribe
  2. Decrease the VAD threshold. A value of 0.2 usually is good for details. I sometimes dial it down to 0.18
  3. Increase the values for beam_size and best_of
  4. Increase temperature value
  5. Increase the value for patience
  6. Normalise the audio volume --this one is tricky, you don't want to distort the spectogram. I suggest keeping it under -1 db.

The trade off with many of these is hallucination, and speed.

Reading your post, I suspect the missing dialog/lines are a side effect of the VAD. Especially if you're using Silero VAD. In my case I have started to drop VAD entirely. The developer of Silero VAD is working on a new version which is supposed to address the issue with dropping dialogs. I have been waiting for the new version for some time now. I hope he can finish his work soon.

Also, a quick workaround for the missing dialog is to use SubtitleEdit. Just select the parts that are missing and transcribe those using say whispercpp.
 
  • Like
Reactions: composite

composite

Active Member
Jul 25, 2015
222
146
I second the earlier post by @SamKook. There are some techniques that can help you to capture the dialogs that are difficult to capture or missing, however they all have a trade-off. Here are some suggestions:

  1. Make sure your audio is good and there are no dropped frames or bad headers for the parts that you need to transcribe
  2. Decrease the VAD threshold. A value of 0.2 usually is good for details. I sometimes dial it down to 0.18
  3. Increase the values for beam_size and best_of
  4. Increase temperature value
  5. Increase the value for patience
  6. Normalise the audio volume --this one is tricky, you don't want to distort the spectogram. I suggest keeping it under -1 db.

The trade off with many of these is hallucination, and speed.

Reading your post, I suspect the missing dialog/lines are a side effect of the VAD. Especially if you're using Silero VAD. In my case I have started to drop VAD entirely. The developer of Silero VAD is working on a new version which is supposed to address the issue with dropping dialogs. I have been waiting for the new version for some time now. I hope he can finish his work soon.

Also, a quick workaround for the missing dialog is to use SubtitleEdit. Just select the parts that are missing and transcribe those using say whispercpp.
Looking at the colab there's only 4 "required" settings:

audio_path, model_size, language and translation_mode.

Then 7 "advanced" settings:

deepl_authkey, source_separation, vad_threshold, chunk_threshold, deepl_target_lang, max_attempts and initial_prompt

How do I change the values for beam_size and best_of? And then temperature value and patience?
 
  • Like
Reactions: ArtemisINFJ

mei2

Well-Known Member
Dec 6, 2018
221
362
Looking at the colab there's only 4 "required" settings:
How do I change the values for beam_size and best_of? And then temperature value and patience?

I just made a fork of the WhisperWithVAD that exposes those parameters. Here:

WhisperWithVAD-PRO


I needed to call it something else than the original name, I went with PRO to indicate that there are more advanced settings for the users :)
I have tested it only on 2 test audios. Let me know if any issues.
If I make any new updates I will keep them in this repository:

 

ArtemisINFJ

God Slayer, Dawnbreaker
Nov 5, 2022
68
84
I just made a fork of the WhisperWithVAD that exposes those parameters. Here:

WhisperWithVAD-PRO


I needed to call it something else than the original name, I went with PRO to indicate that there are more advanced settings for the users :)
I have tested it only on 2 test audios. Let me know if any issues.
If I make any new updates I will keep them in this repository:

I've wanted to say thank you for your contribution especially WhisperJAV. Ngl it really helps a lot with creating an earlier versions of my fav jav. What's your plan for the next update for the future iteration?
 
Last edited:
  • Like
Reactions: Taako

mei2

Well-Known Member
Dec 6, 2018
221
362
I've wanted to say thank you for your contribution especially WhisperJAV. Ngl it really helps a lot with creating an earlier versions of my fav jav. What's your plan for the next update for the future iteration?

Yes, speed was the main purpose of WhisperJAV, to make subbing possible for the same day release. Speed was primary while quality was secondary.
I would like the next iteration/project to be about quality. I'm open to suggestions / ideas. I think there is a lot of room to improve the quality of subs by preparing the audio better. At any rate I am open to suggestions / feature requests.

PS. I looked into training / fine-tuning Whisper for jav but my early research shows that the loss will be higher than the gain.
 
  • Like
Reactions: ArtemisINFJ

ArtemisINFJ

God Slayer, Dawnbreaker
Nov 5, 2022
68
84
Yes, speed was the main purpose of WhisperJAV, to make subbing possible for the same day release. Speed was primary while quality was secondary.
I would like the next iteration/project to be about quality. I'm open to suggestions / ideas. I think there is a lot of room to improve the quality of subs by preparing the audio better. At any rate I am open to suggestions / feature requests.

PS. I looked into training / fine-tuning Whisper for jav but my early research shows that the loss will be higher than the gain.
I don't have any good ideas at the moment. But I highly support you on improving the quality of the next iteration since I find that the existing model that we commonly used aren't living up to its standard. The model usually get messed up on AV that have multiple complex scene which really takes away the experience of subtitling process.

PS. I have looked and try the WhisperVAD Pro of ur version. My early test comes out to be better than the existing WhisperVAD. But the transcoding process seemingly taking a bit longer than the usual.

May you share your early research results on your project for fine-tuning model for JAV. I do not have high skills on this matter but maybe I can help with others such as variables and etc
 

mei2

Well-Known Member
Dec 6, 2018
221
362
PS. I have looked and try the WhisperVAD Pro of ur version. My early test comes out to be better than the existing WhisperVAD. But the transcoding process seemingly taking a bit longer than the usual.

Yes, the PRO version is expected to be slower. There are 3 main reasons for that:
  • it uses word timestamps for more accurate timing,
  • it uses the new option provided by whsiper to reduce hallucination and repetition (hallucination_silence_threshold),
  • it uses higher threshold for patience to get more accurate word predictions (patience=2).

The whisper option hallucination_silence_threshold is still under development/refinement. It has a tendency for false positives. If you see some obvious lines are missing, you can reduce that value or remove it. As always, every option in whisper comes with a trade off :)
 

Dom047

New Member
May 5, 2016
9
4
test.JPG

keep gettin ffmpeg error on both the PRO and regular VAD versions.
The PRO also gave me an error before this that "source separation" wasn't defined even if its checked.

Anyways hope someone can help me solve this issue.
 

SamKook

Grand Wizard
Staff member
Super Moderator
Uploader
May 10, 2009
3,554
4,919
keep gettin ffmpeg error on both the PRO and regular VAD versions.
The PRO also gave me an error before this that "source separation" wasn't defined even if its checked.

Anyways hope someone can help me solve this issue.

My first guess would be that you didn't fill out the audio path properly, what did you put in that field?
 

Dom047

New Member
May 5, 2016
9
4
My first guess would be that you didn't fill out the audio path properly, what did you put in that field?
i usually do file upload but this time i tried linking my google drive and then providing the path that way. il try to run it again later and verify if its a path issue, i thought ffmpeg would have meant somethin else. i dont know much about this stuff but appreciate the input.
 

SamKook

Grand Wizard
Staff member
Super Moderator
Uploader
May 10, 2009
3,554
4,919
If you look at the line that creates the error closely, you see it's when ffmpeg loads the audio as an input(audio_path is the variable name that hold the path you provide) for splitting it into chunks(or pre-processing it, the code cuts there and I haven't looked at what it does exactly) for the VAD system.

It could be an issue with ffmpeg itself but everyone would have the same issue if it was and the other part of the equation is the only thing that requires a user input so it's much more likely to be a user input issue. It could be something else, but without more in depth information, it's always safer to go with the more likely option.
 
  • Like
Reactions: mei2

granca

Member
Mar 4, 2017
62
78
shorten the length of the audio file dramatically improve the reduction on hallucination in my experience, especially since large model tend to hallucinate a quite lot on long files (compared to base model).. script based on the old google model used to do that quite effectively.. the code i was using was also automatic a lot of the prep work on the audio file if i remember correctly.. i might be able to dig up the code from github if you are interested.