I don't know how those settings work exactly, but what we have as AI doesn't think at all, it's all educated guesses from doing/associating something over and over again(when I see this, that is the expected result) so if the data it's trained on has issues, those issues are passed on to your results, it doesn't know it's just guessing since for it, what it knows is what it is.
Since most youtube video will have that sentence or something similar at the end, the AI learns that this is the normal thing to say at the end of something and if you use a VAD, you end up with a lot of ends since it's all split into chunks.
They are getting better but it's still happening.