Intriguing. I would say it is technically possible but practicably impossible. I have tried using LTX desktop with a 16Gb Nvidia GPU. You start with an image, give the model a prompt and it generates a video from that point. For 1080p the max length is 5 seconds.
For my test, I started with this image.
and prompted for the man to slap her arse a couple of times whilst continuing to **** her.
Problem 1 - the model pretty quickly deleted the gentleman's ding dong, rendering him incapable of any further action.
A further problem is that the original image does not contain the mans face, so it is hallucinated...
I guess what this means in practical terms is that every 5 second snippet will have to contain pretty clear images of the participants faces.
On a side note, it is VERY difficult (if not impossible) to get the model to follow the prompt. In the example aboce, he did not once slap her arse.
I have also been working on this image ...
trying to get the man to grab the womans right ankle. I have tried about 20+ prompts so far with no success. At low resolutions clip generation takes about 20-30 seconds. If you get a good result and want to improve the resolution you need to use another tool as it is apparently impossible to get LTX to generate exactly the same video even with the same source and the same prompt.
Basically the statistical bias of the training data will outweigh the prompt almost every time. Because the model has not been trained on a man grabbing a womans ankle, all instructions fail. Instead, disappointingly, there is a lot of unprompted neck grabbing.
Overall the short term future does not offer much promise.