What is Adagrad and how does it work as an optimization algorithm in deep learning? How does Adagrad adapt learning rates for different parameters during training? What problems does Adagrad solve compared to standard gradient descent methods? How does Adagrad differ from optimizers like RMSprop and Adam? What are the advantages and limitations of using Adagrad in neural network training?