What is gradient descent and how is it used to optimize deep learning models? How does gradient descent update weights and minimize the loss function? What are the different types of gradient descent such as batch, stochastic, and mini-batch? What challenges like local minima and learning rate selection arise when using gradient descent? How can these challenges be addressed to improve model performance?