It's been a long time since I posted. I recently discovered how good Deepseek is, and I got excited. I spent ~30 hours on this Python Script to translate Chinese SRT files to English using Deepseek v3. Why use it? My Python script translates in batches, so it has the context of previous lines. Create an account on fireworks ai. Click on your profile picture and get your API KEY. It comes with $1 free credit. Check usage in billing. $1 can translate at least 100 subtitles. I wouldn't top up as it's not the cheapest. I plan to convert the Python script to use the official Deepseek API which is cheaper but I'll wait for my ChatGPT o3 quota to refresh before coding again. I couldn't make more subs as I spent all my credit testing the same subs over and over again for any differences when I change the system prompt and temperature values.
Set your API KEY in powershell and restart the terminal. I put some instructions at the top of the script but if you get stuck just ask AI. The script is ready to use! I put all the best default settings. I find temperature 0.9 to be the upper limit for explicit dirty language before responses are likely to ignore the rules and format. If you find translations mismatching with the timecode, then lower temperature by 0.1 each run. Top_p 0.95 is good. SYS_MSG is already good but if you want to change the erotic instructions then that's ok. BATCH_SIZE_DEFAULT 500 is a good upper limit. Too high and it will exceed the max tokens, too low and it will make more API calls, increasing costs. If you need to change source language to Japanese, then rewrite the SYS_MSG and rewrite the example.
I'll talk about the challenges and how the code works in case anyone wants to play with the code. Fixing bugs of problematic responses and testing took majority of the time. "SEQ_NUMBER\nORIGINAL_TEXT <eol>\n" is how I send it to get translated. I omit the timecode to save tokens. Before this I tried a <space> in between the number and text but it wasn't a robust structure and I encountered line merges all the time, so do not do this. The <eol> is a safety net that reinforces the structure. When temperature is too high, the responses are less likely to follow the rules. Problematic responses given back can contain ">" after the number, the incorrect number, the English translation will merge two Chinese lines and translate as one (this is the biggest problem which causes timecode mismatch when combining), and other quirky oddities. I accounted for nearly all this in the code. Map_translated function parses the response and cleans it up. Suspected short and missing lines get sent to the prompt again for re-translation. If you need to fix bugs, the log files are useful for checking responses. Things I wanted to test but got lazy: testing different values of top_p between 0.96 to 1 with different values of temperature, and testing the code without inserting <eol>. If we truly don't need <eol>, then we can save a few tokens each line, but I'm not 100% sure if it's useful in preventing line merges or not. Anyway, you're free to do whatever you want with the code.
Subs of Mori Hinako (favorite actress), Akai Miki, ROE-168 (mother-son), MIAA-750 (female slutty boss, amazing loud plopping cowgirl), YUJ-031 (high energy, enthusiastic girl that kisses a lot) and more!
View attachment 3670033View attachment 3670034View attachment 3670035View attachment 3670036View attachment 3670037