I can't say about the colab version since I haven't looked into its code very much, but I know it's splitting the audio in many parts so it has to modify it.
With that said, whisper itself, internally, is converting the audio to 16k mono wav so no matter what, it'll end up as that and since wav...