Whisper and its many forms

SamKook · Feb 19, 2024

The AI is just guesswork so sometimes it think noise is people speaking and sometimes it thinks people speaking is noise. That can happen even if it's loud enough.
AI as we have it(which is machine learning, not actual intelligence since it relies 100% on its training being accurate) will never be perfect so manual editing/fiddling will always be required.

If you don't tell avidemux to increase the volume, it just copies it as-is, which is the point of the demuxing part of my tutorial.
A problem that can happen if you increase the volume overall at all times is that it might makes some parts too loud to be recognized as speech so that should really only be used for specific portions.

What you want to have more speech be recognized is to play with the whisper settings themselves. There is no perfect settings to always use or it would be the default so it'll depend on what you want. More speech recognized also means more noise recognized as speech.

large-v3 will recognize more speech as large-v2 from what I saw but the quality of the translation doesn't seem to be as good so an easy way would be to use v3 to fill in what v2 misses.

There is a thread that discusses tuning the settings that you may want to look at here: https://www.akiba-online.com/thread...age-an-intro-guide-to-subtitling-jav.2115103/

composite · Feb 19, 2024

SamKook said:
The AI is just guesswork so sometimes it think noise is people speaking and sometimes it thinks people speaking is noise. That can happen even if it's loud enough.
AI as we have it(which is machine learning, not actual intelligence since it relies 100% on its training being accurate) will never be perfect so manual editing/fiddling will always be required.

If you don't tell avidemux to increase the volume, it just copies it as-is, which is the point of the demuxing part of my tutorial.
A problem that can happen if you increase the volume overall at all times is that it might makes some parts too loud to be recognized as speech so that should really only be used for specific portions.

What you want to have more speech be recognized is to play with the whisper settings themselves. There is no perfect settings to always use or it would be the default so it'll depend on what you want. More speech recognized also means more noise recognized as speech.

large-v3 will recognize more speech as large-v2 from what I was but the quality of the translation doesn't seem to be as good so an easy way would be to use v3 to fill in what v2 misses.

There is a thread that discusses tuning the settings that you may want to look at here: https://www.akiba-online.com/thread...age-an-intro-guide-to-subtitling-jav.2115103/

Many thanks for the reply!

mei2 · Feb 19, 2024

composite said:
I use the colab and it works great. But sometimes it doesn't transcribe all the audio. Someone is speaking but there's no translated text.

Does anyone have suggestions how to fix that? Does avidemux not make the volume loud enough for the translation program to hear the audio? Or something else? thanks.

I second the earlier post by @SamKook. There are some techniques that can help you to capture the dialogs that are difficult to capture or missing, however they all have a trade-off. Here are some suggestions:

Make sure your audio is good and there are no dropped frames or bad headers for the parts that you need to transcribe
Decrease the VAD threshold. A value of 0.2 usually is good for details. I sometimes dial it down to 0.18
Increase the values for beam_size and best_of
Increase temperature value
Increase the value for patience
Normalise the audio volume --this one is tricky, you don't want to distort the spectogram. I suggest keeping it under -1 db.

The trade off with many of these is hallucination, and speed.

Reading your post, I suspect the missing dialog/lines are a side effect of the VAD. Especially if you're using Silero VAD. In my case I have started to drop VAD entirely. The developer of Silero VAD is working on a new version which is supposed to address the issue with dropping dialogs. I have been waiting for the new version for some time now. I hope he can finish his work soon.

Also, a quick workaround for the missing dialog is to use SubtitleEdit. Just select the parts that are missing and transcribe those using say whispercpp.

composite · Mar 4, 2024

mei2 said:
I second the earlier post by @SamKook. There are some techniques that can help you to capture the dialogs that are difficult to capture or missing, however they all have a trade-off. Here are some suggestions:

Make sure your audio is good and there are no dropped frames or bad headers for the parts that you need to transcribe

Decrease the VAD threshold. A value of 0.2 usually is good for details. I sometimes dial it down to 0.18

Increase the values for beam_size and best_of

Increase temperature value

Increase the value for patience

Normalise the audio volume --this one is tricky, you don't want to distort the spectogram. I suggest keeping it under -1 db.

The trade off with many of these is hallucination, and speed.

Reading your post, I suspect the missing dialog/lines are a side effect of the VAD. Especially if you're using Silero VAD. In my case I have started to drop VAD entirely. The developer of Silero VAD is working on a new version which is supposed to address the issue with dropping dialogs. I have been waiting for the new version for some time now. I hope he can finish his work soon.

Also, a quick workaround for the missing dialog is to use SubtitleEdit. Just select the parts that are missing and transcribe those using say whispercpp.

Looking at the colab there's only 4 "required" settings:

audio_path, model_size, language and translation_mode.

Then 7 "advanced" settings:

deepl_authkey, source_separation, vad_threshold, chunk_threshold, deepl_target_lang, max_attempts and initial_prompt

How do I change the values for beam_size and best_of? And then temperature value and patience?

mei2 · Mar 5, 2024

composite said:
Looking at the colab there's only 4 "required" settings:
How do I change the values for beam_size and best_of? And then temperature value and patience?

I just made a fork of the WhisperWithVAD that exposes those parameters. Here:

WhisperWithVAD-PRO

I needed to call it something else than the original name, I went with PRO to indicate that there are more advanced settings for the users

I have tested it only on 2 test audios. Let me know if any issues.
If I make any new updates I will keep them in this repository:

WhisperJAV/notebook at main · meizhong986/WhisperJAV

Contribute to meizhong986/WhisperJAV development by creating an account on GitHub.

github.com

ArtemisINFJ · Mar 5, 2024

mei2 said:
I just made a fork of the WhisperWithVAD that exposes those parameters. Here:

WhisperWithVAD-PRO

I needed to call it something else than the original name, I went with PRO to indicate that there are more advanced settings for the users
I have tested it only on 2 test audios. Let me know if any issues.
If I make any new updates I will keep them in this repository:

WhisperJAV/notebook at main · meizhong986/WhisperJAV

Contribute to meizhong986/WhisperJAV development by creating an account on GitHub.

github.com

I've wanted to say thank you for your contribution especially WhisperJAV. Ngl it really helps a lot with creating an earlier versions of my fav jav. What's your plan for the next update for the future iteration?

mei2 · Mar 6, 2024

ArtemisINFJ said:
I've wanted to say thank you for your contribution especially WhisperJAV. Ngl it really helps a lot with creating an earlier versions of my fav jav. What's your plan for the next update for the future iteration?

Yes, speed was the main purpose of WhisperJAV, to make subbing possible for the same day release. Speed was primary while quality was secondary.
I would like the next iteration/project to be about quality. I'm open to suggestions / ideas. I think there is a lot of room to improve the quality of subs by preparing the audio better. At any rate I am open to suggestions / feature requests.

PS. I looked into training / fine-tuning Whisper for jav but my early research shows that the loss will be higher than the gain.

ArtemisINFJ · Mar 6, 2024

mei2 said:
Yes, speed was the main purpose of WhisperJAV, to make subbing possible for the same day release. Speed was primary while quality was secondary.
I would like the next iteration/project to be about quality. I'm open to suggestions / ideas. I think there is a lot of room to improve the quality of subs by preparing the audio better. At any rate I am open to suggestions / feature requests.

PS. I looked into training / fine-tuning Whisper for jav but my early research shows that the loss will be higher than the gain.

I don't have any good ideas at the moment. But I highly support you on improving the quality of the next iteration since I find that the existing model that we commonly used aren't living up to its standard. The model usually get messed up on AV that have multiple complex scene which really takes away the experience of subtitling process.

PS. I have looked and try the WhisperVAD Pro of ur version. My early test comes out to be better than the existing WhisperVAD. But the transcoding process seemingly taking a bit longer than the usual.

May you share your early research results on your project for fine-tuning model for JAV. I do not have high skills on this matter but maybe I can help with others such as variables and etc

mei2 · Mar 6, 2024

ArtemisINFJ said:
PS. I have looked and try the WhisperVAD Pro of ur version. My early test comes out to be better than the existing WhisperVAD. But the transcoding process seemingly taking a bit longer than the usual.

Yes, the PRO version is expected to be slower. There are 3 main reasons for that:

it uses word timestamps for more accurate timing,
it uses the new option provided by whsiper to reduce hallucination and repetition (hallucination_silence_threshold),
it uses higher threshold for patience to get more accurate word predictions (patience=2).

The whisper option hallucination_silence_threshold is still under development/refinement. It has a tendency for false positives. If you see some obvious lines are missing, you can reduce that value or remove it. As always, every option in whisper comes with a trade off

Dom047 · Mar 22, 2024

keep gettin ffmpeg error on both the PRO and regular VAD versions.
The PRO also gave me an error before this that "source separation" wasn't defined even if its checked.

Anyways hope someone can help me solve this issue.

SamKook · Mar 22, 2024

Dom047 said:
keep gettin ffmpeg error on both the PRO and regular VAD versions.
The PRO also gave me an error before this that "source separation" wasn't defined even if its checked.

Anyways hope someone can help me solve this issue.

My first guess would be that you didn't fill out the audio path properly, what did you put in that field?

Dom047 · Mar 22, 2024

SamKook said:
My first guess would be that you didn't fill out the audio path properly, what did you put in that field?

i usually do file upload but this time i tried linking my google drive and then providing the path that way. il try to run it again later and verify if its a path issue, i thought ffmpeg would have meant somethin else. i dont know much about this stuff but appreciate the input.

SamKook · Mar 22, 2024

If you look at the line that creates the error closely, you see it's when ffmpeg loads the audio as an input(audio_path is the variable name that hold the path you provide) for splitting it into chunks(or pre-processing it, the code cuts there and I haven't looked at what it does exactly) for the VAD system.

It could be an issue with ffmpeg itself but everyone would have the same issue if it was and the other part of the equation is the only thing that requires a user input so it's much more likely to be a user input issue. It could be something else, but without more in depth information, it's always safer to go with the more likely option.

granca · Mar 24, 2024

shorten the length of the audio file dramatically improve the reduction on hallucination in my experience, especially since large model tend to hallucinate a quite lot on long files (compared to base model).. script based on the old google model used to do that quite effectively.. the code i was using was also automatic a lot of the prep work on the audio file if i remember correctly.. i might be able to dig up the code from github if you are interested.

Search

Search

Whisper and its many forms

Throughout the month of April 2024, participate in the FileJoker Thread Contest OPEN TO EVERYONE!

SamKook

Grand Wizard

composite

Active Member

mei2

Well-Known Member

composite

Active Member

mei2

Well-Known Member

WhisperJAV/notebook at main · meizhong986/WhisperJAV

ArtemisINFJ

God Slayer, Dawnbreaker

WhisperJAV/notebook at main · meizhong986/WhisperJAV

mei2

Well-Known Member

ArtemisINFJ

God Slayer, Dawnbreaker

mei2

Well-Known Member

Dom047

New Member

SamKook

Grand Wizard

Dom047

New Member

SamKook

Grand Wizard

granca

Member