Mind you that the gradient is the first derivative vector of a function. The second derivative matrix called the Hessian is an entire other beast that provides further information on the max./min. points of a function. When dealing
with vectors and matrices, Calculus overlaps with Linear Algebra which deals with these objects.
Complex statistical models that do not have easily available closed solutions (read as nice solutions) generally require numerical methods to solve them, and by solving we mean finding the optimal parameters that minimize the errors
of that model (read as best fit).
Derivatives provide a linear approximation to functions that make these numerical methods computationally feasible! This is very desirable as we do not want to spend days on end waiting for our computer to solve our statistical
models, or do we?
But like everything in life, things get very complicated down the road. Firstly, finding the global max./min. is not as straightforward as it seems. As our algorithms try to find where the derivative is equal to 0, they might mistake
a local max./min. for the global max./min. (think of a function with several hills, one bigger than the other). Another issue for statistical models is how we defined the "error" that we wish to minimize. What will we consider
an "error"? What metric will we use to define an "error"? (Remember, a metric is a mathematical function that defines the distance between two points. As long as it satisfies three important conditions, we may define a multitude
of metrics favorable to our problem).