What is mini-batch gradient descent and how does it differ from batch and stochastic gradient descent? How are mini-batches created and used to update model parameters? What are the advantages of using mini-batch gradient descent in training large-scale machine learning models? What challenges, such as choosing batch size and convergence issues, can arise with mini-batch gradient descent? How does mini-batch gradient descent impact model performance and training efficiency?