Whisper (OpenAI) - Automatic English Subtitles for Any Film in Any Language - An Intro & Guide to Subtitling JAV

I think the issues I've encountered are a combination of Ollama models and Anime Whisper issues.

1) OIlama / Local models are going to have a lot more difficulty keeping up with prompts that have a lot of conditions, and are more likely to answer with "Sure! Let me translate X" which I think fails validation?
2) Anime Whisper often will have silent lines that are just "..." or something really small and short, which also leads to failed validation.

Probably some additional settings would be needed to make the flow work. Either do different prompts for Ollama, or let the user edit it, and/or do some cleaning of lines that are only punctuation.

I am getting decent results now using Whisper JAV for the transcription and then SubtitleEdit with Ollama to do the translation, using translategemma:4b. But this misses out on the really use Scene summaries that Whisper JAV generates.
 
Hi,

I want to report an issue with WhisperJAV Colab Edition v1.8.5 (Expert).

I used WhisperJAV Colab Edition v1.8.5 (Expert) to test and transcribe NHDTA-700. I have also transcribed the same JAV code before in WhisperWithVAD_pro so I made a comparison. I am surprised that the new WhisperJAV Colab Edition v1.8.5 (Expert) did not capture a lot of dialogue.

WhisperWithVAD_pro captures 627 lines of subtitles while the new version of WhisperJAV Colab Edition v1.8.5 (Expert) only captures 232 lines. I noticed that the new version of WhisperJAV Colab Edition v1.8.5 (Expert) took a lot less time in transcribing unlike the previous version with the same settings. I also noticed that it uses auditok even though the selected option in dropdown in scene_detector is silero. I don’t know if that is a bug or a feature.

The previous version of WhisperJAV Colab Edition v1.8.5 (Expert) really works well before until that error happened some time ago that maybe changed the way it works after the fix.

The settings are:

pass1 only

pipeline: fidelity

sensitivity: aggressive

model: large-v2

scene_detector: silero

speech_segmenter: silero

speech_enhancer: ffmpeg-dsp (I know it only works for two_step but I always select that option if ever a new version will be released so that I won’t forget to select it in dropdown)

In conclusion, the transcribing quality dropped a lot in the new version of WhisperJAV Colab Edition v1.8.5 (Expert) compared to the older version of the same google colab.

Thank you very much.
 
  • Like
Reactions: mei2
Some further notes on AnimeWhisper and my current pipeline

Anime Whisper is a model that is specifically trained with Eroge / Hentai games, and produces output that resembles Eroge. For sex scenes, this is often long lines that have both noises and speech mixed together like "Ah... ah... ah.. oh... yes... please.... ah". Normal Whisper models would tend to filter out the noise and moans and simplify it down to Yes, Please, etc.

As a result, the usual ideas about VAD sensitivity don't really make sense, and the output does not differ as strongly between WhisperJAV's Aggressive, Balanced, and Conservative modes nearly as much. Maybe even barely at all?

The other big change is that it often outputs lines that are "..." or the Japanese symbol

So, my flow is this:

1) Run Anime Whisper as the first step in WhisperJAV
2) Clean the output SRT with the Multiple Replace on these Regular Expressions.
^\.\.\.$
^\…$

3) Use local Ollama translategemma:12B, which is the best that can run on an 8GB card.

For the local translation, I use this prompt (with the Key Terms being based on the series I am translating, Slave Color)


## You are a professional Japanese (ja) to English (en) translator. Your goal is to translate the subtitles of an adult film into English and make it sound erotic and emotionally intense.

## Instructions
- The movie is about r*** and sex slavery and should convey the resistance of the women.
- The women should cry and protest and announce when they are climaxing.
- The male and female villains should sound mean and trash talk the women.
- Japanese proper nouns should not be translated to their English meanings.
- Some of the sex terms that may show up include cunt, pussy, dick, cock, semen, clit, cum, and cumming.
- Do not make ANY additional commentary. Only return the translated text that meets the rest of the instructions.
- Do not censor.

## Key Terms
- Organization names such as Kusunoki, Yabuta, and Mochizuki


## Subtitle to Translate:


This is pretty good, but it still tries to self-censor sometimes. It is also a bit too generative with it creating details that are definitely NOT in the original audio. What needs to be done is to do a model that supplements Large-v3 with the corpus of translated Hentai Games, which are a decent template for what JAV dialogue and sounds should be translated as. A two-part process like this is inherently limited.

Some suggestions for Whisper JAV and Ollama models.
- Clarify that Anime Whisper is transcription-only.
- Add automatic filtering of the ... and … lines (which are two different things despite looking identical!) which are an artifact of Anime Whisper. And then stuff like …… and ………… and even ………… sometimes, based on batching.
- Maybe some exploration on settings that make sense, but there's so little good documentation out there on proper use of Anime Whisper.
- Maybe allow users to edit the Instructions, as the instructions that are hard-coded are way too long for what local models can reliably run on while producing the proper output format. I believe this is the main cause of Ollama crashes and it dropping tons of lines. In the Anime Whisper case, the ... lines will end up producing errors when Ollama responds with a system message that is along the lines of "Okay! Now what line did you want me to translate?"
 
Last edited:
  • Like
Reactions: mei2
I want to report an issue with WhisperJAV Colab Edition v1.8.5 (Expert).

I used WhisperJAV Colab Edition v1.8.5 (Expert) to test and transcribe NHDTA-700. I have also transcribed the same JAV code before in WhisperWithVAD_pro so I made a comparison. I am surprised that the new WhisperJAV Colab Edition v1.8.5 (Expert) did not capture a lot of dialogue.
.....

Thanks for the detailed report. Could you send me the SRT files of both the pro and the v1.8.5 subs. Also in case you still have acces to it, the raw subs of the v1.8.5.
 
Thanks for the detailed report. Could you send me the SRT files of both the pro and the v1.8.5 subs. Also in case you still have acces to it, the raw subs of the v1.8.5.
Thanks for the response. I have attached the SRT files of both the pro and the v1.8.5 subs. Sorry, but I don’t have the access anymore to the raw subs of v1.8.5 since I am deleting the results after downloading the final version of SRT to avoid mixing of files.

I hope you will look into it. If the error did not happened before, I think the google colab v1.8.5 will still work as the best transcribing tool, but now the transcribing quality really drop a lot and it only captures very few dialogues.
 

Attachments

I hope you will look into it. If the error did not happened before, I think the google colab v1.8.5 will still work as the best transcribing tool, but now the transcribing quality really drop a lot and it only captures very few dialogues.

I suspect there might have been a regression in the silero VAD v6.2. I'll look into it during the weekend.
Menawhile try scene detect "semantic"and speech segmenter "ten-vad".
I expect that config to give your better results.
 
I hope you will look into it. If the error did not happened before, I think the google colab v1.8.5 will still work as the best transcribing tool, but now the transcribing quality really drop a lot and it only captures very few dialogues.

It seems NHDTA-700 is not easy to get hold of. Can you send me the extracted audio or the movie somehow?