I'm reading a book on deep learning and I'm a bit confused about one of the ideas the author mentioned. This is from the book Deep Learning with Python by Francois Chollet:

A gradient is the derivative of a tensor operation. It’s the generalization of the concept of derivatives to functions of multidimensional inputs: that is, to functions that take tensors as inputs.

Consider an input vector x, a matrix W, a target y, and a loss function loss. You can use W to compute a target candidate y_pred, and compute the loss, or mismatch,between the target candidate y_pred and the target y:

y_pred = dot(W, x)

loss_value = loss(y_pred, y)

If the data inputs x and y are frozen, then this can be interpreted as a function mapping values of W to loss values:

loss_value = f(W)

Let’s say the current value of W is W0. Then the derivative of f in the point W0 is a tensor gradient(f)(W0) with the same shape as W, where each coefficient gradient(f)(W0)[i,j] indicates the direction and magnitude of the change in loss_value you observe when modifying W0[i,j]. That tensor gradient(f)(W0) is the gradient of the function f(W)=loss_value in W0.

You saw earlier that the derivative of a function f(x) of a single coefficient can be interpreted as the slope of the curve of f. Likewise, gradient(f)(W0) can be interpreted as the tensor describing the curvature of f(W) around W0.

For this reason, in much the same way that, for a function f(x), you can reduce the value of f(x) by moving x a little in the opposite direction from the derivative,with a function f(W) of a tensor, you can reduce f(W) by moving W in the opposite direction from the gradient: for example, W1=W0-step*gradient(f)(W0) (where step is a small scaling factor). That means going against the curvature, which intuitively should put you lower on the curve. Note that the scaling factor step is needed because gradient(f)(W0) only approximates the curvature when you’re close to W0,so you don’t want to get too far from W0.

I don't understand why we subtract -step * gradient (f) (W0) from the weight and not just -step, since -step * gradient (f) (W0) represents a loss while -step is the parameter (i.e the x value i.e small change in weight)

Related posts

Recent Viewed