What is Stochastic Gradient Descent and how does it differ from traditional gradient descent? How does SGD update model parameters using individual data samples? What are the advantages of using SGD in large-scale machine learning problems? What challenges such as noise and convergence issues arise with SGD? How can these challenges be managed to improve model performance?