Whisper (OpenAI) - Automatic English Subtitles for Any Film in Any Language - An Intro & Guide to Subtitling JAV

I am trying to transcribe NEO-017.mp4, is it normal to wait longer than hour?
Performance depends on your hardware and the specific software(there's tons of whisper variations with different models) you're using so that's impossible to answer unless you share that information first.
 
  • Like
Reactions: mei2
Edited: Depends on what pipeline you're using. NEO-017.mp4 is 144 minutes. As SamKook said it is impossible to say by sure. But if I need to make a guess based on the T4 Colab env, I would say:

Fidelity | Agressive : 40min - 50min
Balanced | Agressive : 15min - 20min

Plus it usually takes 5-10min do the install and load.
 
Edited: Depends on what pipeline you're using. NEO-017.mp4 is 144 minutes. As SamKook said it is impossible to say by sure. But if I need to make a guess based on the T4 Colab env, I would say:

Fidelity | Agressive : 40min - 50min
Balanced | Agressive : 15min - 20min

Plus it usually takes 5-10min do the install and load.

Hi,

I tested it with 4 differents movies and everything works well. Took around 43 min on Fidelity/agressive for each movies, thank you for the fix @mei2

Note: During downtime when they are not talking, Whisper AI generate a full japanese recipe ahaha. Example: I used the leftover rice and mixed it with soy sauce and ginger and more sentences of that type.

Am I using the wrong setting and am I the only one who get these hallucination from Whisper? Not a big deal tbh, i'm just wondering

-Besh
 
  • Like
Reactions: ToastFrench
Hi,

I tested it with 4 differents movies and everything works well. Took around 43 min on Fidelity/agressive for each movies, thank you for the fix @mei2

Note: During downtime when they are not talking, Whisper AI generate a full japanese recipe ahaha. Example: I used the leftover rice and mixed it with soy sauce and ginger and more sentences of that type.

Am I using the wrong setting and am I the only one who get these hallucination from Whisper? Not a big deal tbh, i'm just wondering

-Besh

The aggressive sensitivity does have more risk of hallucination. The flood gates are open with agressive, and actually I widened it even more for the latest release (which I think I went too far). I suggest you use Auditok scene detect, and TEN VAD sementer for the aggressive sensitivity.
 
  • Like
Reactions: Besh
Hi,

I tested it with 4 differents movies and everything works well. Took around 43 min on Fidelity/agressive for each movies, thank you for the fix @mei2

Note: During downtime when they are not talking, Whisper AI generate a full japanese recipe ahaha. Example: I used the leftover rice and mixed it with soy sauce and ginger and more sentences of that type.

Am I using the wrong setting and am I the only one who get these hallucination from Whisper? Not a big deal tbh, i'm just wondering

-Besh
I got a bread recipe once, but the most common hallucinations I'm seeing are "Please subscribe to my Channel! Follow me on Twitter!"
 
Hhmm, the sanitization filters should have caught those two hallucinations. Do you get those in Japanese transcription or in direct to English?
Using the Transcription Mode Direct To English.
I used the same video trying the three different Balanced Modes-the video had a pretty long music interlude at the start (over 2 minutes), then the first dialogue is a phone call (you only hear the voice on the phone, and it's purposefully fuzzy to indicate its a phone call)

During the Interlude (no dialogue) on Aggressive I got a long weird doll making instruction or something-13 lines long! "Put the pins in and attach the head part to the body", etc. It goes on for a while, lol. But it also captured the fuzzy phone conversation most accurately.

On Balanced I got "Please subscribe to our channel and follow us on Twitter", and "Please subscribe to our channel" and "Please subscribe to my channel and give it a high rating." It also captured the phone conversation somewhat.

On Conservative I got "Please subscribe to my channel!" just twice, but it also completely missed the phone conversation.

Seems like this video would be a good test of your Ensemble mode-possibly running Aggressive and then Balanced would catch the phone conversation and eliminate the weird doll making instructions. However, I can't get it to translate in Ensemble mode (I can't get Local LLM to work, and don't have a subscription to any of the services yet).

Edit: Hey, just checked your site-you are up to 1.8.10? I'm still on 1.8.8 on the Mac; I'll update.
 
Last edited:
  • Like
Reactions: Besh
Seems like this video would be a good test of your Ensemble mode-possibly running Aggressive and then Balanced would catch the phone conversation and eliminate the weird doll making instructions. However, I can't get it to translate in Ensemble mode (I can't get Local LLM to work, and don't have a subscription to any of the services yet).

Edit: Hey, just checked your site-you are up to 1.8.10? I'm still on 1.8.8 on the Mac; I'll update.

Yes, please send me the link to video or the ID.

My sanitization filters for direct to English are not as strong as for Japanese. If you keep track of the hallucination phrases please send it to me, I add it to the filter.
 
Yes, please send me the link to video or the ID.

My sanitization filters for direct to English are not as strong as for Japanese. If you keep track of the hallucination phrases please send it to me, I add it to the filter.
Done!
"Please subscribe to my channel!" is really common with all three Balanced modes, whenever there is dead space or just music.
When I come across more I'll send it to you either here or on your Github.
 
It's really a good idea to have like a collection of movies that has settings work best honestly or like a place we can contribute best result if you know what i mean :).
 
How big is the difference in subs quality between balanced and fidelity? Are there just small differences or are the results a lot better using fidelity? I have loads of VRAM so I could run either, but it does seem to be slower, so I'm wondering whether it is worth it.
 
How's the quality of the japanese to english translations? It's been awhile since I've last followed this thread. Last time I used AI translation it did a below average job. I think I used openwhisper, it's been awhile and I can't recall exactly. Hopefully things have come a long way and the updates have made it better. Would like to hear what your experiences has been like with the translations.
 
There are so many options with WhisperJAV, I'm really just starting to experiment with some of the possibilities.
The Transcription Mode/Balanced is probably fine for basic translation-I've gotten better results than the typical Chinese->Japanese->English route that most SubTitleCat subs seem to have.

But some of the options are pretty wild.
There is a Tone setting-Standard or Adult/Explicit, which seems like it would maybe enhance language during sex scenes.

This one video I'm translating starts with a couple watching a Youtube of a Workout video, and the "Standard" translation (which, based on my basic Japanese language skills is pretty accurate), has the instructor shouting out typical exercise commands, but the "Adult/Explicit" has changed them all to be completely suggestive and raunchy, lol.
 
How's the quality of the japanese to english translations? It's been awhile since I've last followed this thread. Last time I used AI translation it did a below average job. I think I used openwhisper, it's been awhile and I can't recall exactly. Hopefully things have come a long way and the updates have made it better. Would like to hear what your experiences has been like with the translations.

It's an AI and it will continue to improve over time. With that said, I do think this awesome collaboration is worth it and has improved a lot since the beginning. It will never be on par with the work done by a native Japanese speaker who translates a movie you pay $250 to have translated.

But for the average Joe who doesn't understand a word of Japanese and just wants to know what's going on with the story, it is the best tool available. On top of that, it's free and takes 45 minutes.

Yes, there are hallucinations — yes, I sometimes get a recipe for making bread, or a prompt to subscribe to some Twitter account and leave a good review lol, but I still cum by the end of the scene, which is the point with adult content.

-Besh
 
I have started experimenting with the Whisper JAV GUI. For plot-heavy Attackers movies I am not getitng good results yet. There is a lot to potentially control.

What translation layer do people use if they are using Anime Whisper for transcription to Japanese?

Since Attackers stuff has a lot of plot, it feels like the Adult option is way too filthy in what should be normal plot scenes.

Are there suggestions on what Scene detection to use, or the prompt for the translation regarding the plot? Scene Detection doing summarizes is really interesting but I don't know the best way to leverage this. The documentation does not say much on what the scene summary is, or how it is generated, or how the coarse plots are identified.

Is there any audio-only format that works, or does it always require a full video file? For my usual translations of faster whisper I can always use a .mka file that is just the audio extracted from from a video with MKVtoolNix.

EDIT: I think one of the issues is that the translation just ignores scenes below the specified threshold... but that is crazy? I don't know the combination of scene lengths or whatever to use to prevent this skipped stuff. I also have no idea why I would have blank lines or "X untranslated" output, and the documentation/discussion on the repo don't have descriptions. I can't find anything for these errors, either, it just looks like doing a separate AI translation of Anime Whisper always ends up dropping/missing/ignoring a lot of lines the translate option in SubtitleEdit does properly translate.

My intended flow is to supplement the Anime Whisper transcriptions with the extra scene logic and prompts that WhisperJAV offers.
 
Last edited:
How big is the difference in subs quality between balanced and fidelity? Are there just small differences or are the results a lot better using fidelity? I have loads of VRAM so I could run either, but it does seem to be slower, so I'm wondering whether it is worth it.


I tried to plot the trade-off between balanced and fidelity pipelines here. The actual trade off very much depends on the characteristics of the movie. The plot is based on average test cases that I use, which by no means are exhaustive or extensive :)



1775854036240.png



The magic word is trade-off :)


1775854257041.png
 
Since Attackers stuff has a lot of plot, it feels like the Adult option is way too filthy in what should be normal plot scenes.

Are there suggestions on what Scene detection to use, or the prompt for the translation regarding the plot? Scene Detection doing summarizes is really interesting but I don't know the best way to leverage this. The documentation does not say much on what the scene summary is, or how it is generated, or how the coarse plots are identified.

I don't have much experience with subbing Attackers. In general, I'd say:

- Auditok + TEN, if the plots have flights of high and low volume (and no constant background noise).
- Semantic + Silero v3.1, if the dialog is low volume or if the background noise ratio is high.

For AI translation I would use the standard profile.
 
  • Like
Reactions: panop857
I don't have much experience with subbing Attackers. In general, I'd say:

- Auditok + TEN, if the plots have flights of high and low volume (and no constant background noise).
- Semantic + Silero v3.1, if the dialog is low volume or if the background noise ratio is high.

For AI translation I would use the standard profile.

I appreciate your work! I think the transcriptions are working, it is specifically the AI translation part that is constantly failing. The intermediate transcripts look good. Attackers movies often start with serious drama and plot and then the eventual sex scenes often have dialogue interspersed with what models often just dismiss as sex noises and moaning.

Do any of the Cloud AI translations not need API keys? I am trying local Ollama, but getting the same issues with a variety of models. Now trying gemma3:4b as it is smaller and runs a bit faster so I can see the errors. I will try to update Python, I guess-- I am getting a warning but not an error for using 3.10. Maybe the demands on formatting are too long for the local models to reliably match without failing. Any suggestions on what models are the most stable would be helpful. There is some discussion here that has been helpful to me but it seems inconclusive. https://github.com/meizhong986/WhisperJAV/issues/271
Maybe upping the number of retries will work.

My logs:

[TRANSLATE] If using local LLM, watch for timeout errors (consider smaller batch size).
Resuming translation
Translating 864 lines in 1 scenes
Found fuzzy match for line 1 in translations
Found fuzzy match for line 2 in translations
Found fuzzy match for line 3 in translations
Found fuzzy match for line 4 in translations
Found fuzzy match for line 5 in translations
Found fuzzy match for line 6 in translations
Found fuzzy match for line 7 in translations
Found fuzzy match for line 8 in translations
Found fuzzy match for line 9 in translations
Found fuzzy match for line 10 in translations
Scene 1 batch 1: 0 lines and 0 untranslated.
Summary: Introduction of the characters and the setting – Sawamura, Takagi, and Miyuki, with the establishment of the publishing house.
WARNING:root:Scene 1 batch 1 failed validation, requesting retranslation
Found fuzzy match for line 1 in translations
Found fuzzy match for line 2 in translations
Found fuzzy match for line 3 in translations
Found fuzzy match for line 4 in translations
Found fuzzy match for line 5 in translations
Found fuzzy match for line 6 in translations
Found fuzzy match for line 7 in translations
Found fuzzy match for line 8 in translations
Found fuzzy match for line 9 in translations
Found fuzzy match for line 10 in translations
Scene 1 batch 1: 0 lines and 0 untranslated.
Retry failed validation: No translation found for 10 lines
Errors encountered translating scene 1 batch 1
Scene 1 batch 2: 10 lines and 0 untranslated.
Summary: Sawamura, Takagi, and Miyuki discuss their father’s past, including his career in construction and current hospitalization, reassuring each other about the new apartment they’ve rented.
Scene 1 batch 3: 10 lines and 0 untranslated.
Summary: Sawamura, Takagi, and Miyuki discuss their father’s hospitalization and financial difficulties, hinting at unresolved debts and a strained relationship.
Scene 1 batch 4: 10 lines and 0 untranslated.
Scene 1 batch 5: 10 lines and 0 untranslated.
Summary: The characters are discussing Satou, with Satsuki expressing appreciation for Miyuki’s articles and requesting her to work at her place.
Summary was truncated from 162 to 145 characters
Scene 1 batch 6: 10 lines and 0 untranslated.
Scene 1 batch 7: 10 lines and 0 untranslated.
Scene 1 batch 8: 10 lines and 0 untranslated.
Summary: Takagi is expressing her desire for intimacy and hinting at a potential affair, while also inquiring about Sawamura's relationship with Minami-kun.
WARNING:root:1 lines were empty and were not written to the output file
Scene 1 batch 9: 8 lines and 0 untranslated.
Summary: Takagi expresses satisfaction with the progress of settling the situation, subtly hinting at his desire to reward Sawamura for his efforts and perhaps more.
WARNING:root:1 lines were empty and were not written to the output file
Scene 1 batch 10: 10 lines and 0 untranslated.
WARNING:root:1 lines were empty and were not written to the output file
Scene 1 batch 11: 11 lines and 0 untranslated.
WARNING:root:1 lines were empty and were not written to the output file
Unable to match 0 lines with a source line
Scene 1 batch 12: 10 lines and 7 untranslated.
 
Last edited:
Just loading up previously working .srt files and trying to use Ollama is not working any better, so maybe the errors are just an issue with the GUI/Ollama interaction that appears to be new. I can use the translation functionality just fine, but I was hoping to leverage the instructions to the translation LLMs to better guide the plot/tone of the plot-heavy movies.
 
Last edited: