What is a Vision Transformer (ViT) and how is it used in computer vision tasks? How does the Vision Transformer architecture differ from traditional convolutional neural networks? What role do image patches and self-attention mechanisms play in ViT models? What are the advantages of using Vision Transformers for image classification and recognition? What are the challenges and limitations of Vision Transformers in deep learning applications?