What is the Adam optimizer and how does it work in deep learning model training? How does Adam combine the advantages of momentum and adaptive learning rates? What are the key parameters used in the Adam optimization algorithm? Why is Adam widely used compared to other optimizers like SGD and RMSprop? What are the benefits and limitations of using the Adam optimizer in neural network training?