Screen Time Solution¶
# Initialize m and b
m = torch.tensor([0], requires_grad=True, dtype=torch.float32)
b = torch.tensor([0], requires_grad=True, dtype=torch.float32)
# Convert sleep and screentime to tensors
x = torch.from_numpy(screentime)
y = torch.from_numpy(sleep)
# Run 10K iterations
for i in range(10_000):
# Calculate yhat
yhat = m*x + b
# Calculate the loss
loss = torch.sqrt(torch.mean((yhat  y)**2))
# "Reverse mode differentiation"
loss.backward()
# Update m and b (gradient step)
with torch.no_grad():
m = 0.02*m.grad
b = 0.02*b.grad
# Zero out the gradients
m.grad = None
b.grad = None
print(f"(m, b) = ({m.item()}, {b.item()})")
# (m, b) = (0.755, 11.241)
Plot
fig, ax = plt.subplots(layout='tight')
ax.scatter(screentime, sleep)
xlims = np.array(ax.get_xlim())
yvals = m.item() * xvals + b.item()
ax.plot(xlims, yvals, c='tab:red', linewidth=3)
ax.set_xlabel('screentime')
ax.set_ylabel('sleep')
Explanation¶
We'll flesh out the algorithm through a sequence of improving drafts.
# Initialize m and b
m = torch.tensor([0], requires_grad=True, dtype=torch.float32)
b = torch.tensor([0], requires_grad=True, dtype=torch.float32)
# Convert sleep and screentime to tensors
x = torch.from_numpy(screentime)
y = torch.from_numpy(sleep)
# Calculate yhat
yhat = m*x + b
# Calculate the loss
loss = torch.sqrt(torch.mean((yhat  y)**2))
# "Reverse mode differentiation"
loss.backward()
# Update m and b (gradient step)
m = m  0.02*m.grad
b = b  0.02*b.grad

Initialize
m
andb
.m = torch.tensor([0], requires_grad=True, dtype=torch.float32) b = torch.tensor([0], requires_grad=True, dtype=torch.float32) print(m) # tensor([0.], requires_grad=True) print(b) # tensor([0.], requires_grad=True)
Here we initialize
m
as a tensor with a singlefloat32
value, 0. Then we do the same forb
.Since we initialize
m
andb
"from scratch", these are considered leaf tensors.print(m.is_leaf) # True print(b.is_leaf) # True

Convert
sleep
andscreentime
to tensors.x = torch.from_numpy(screentime) y = torch.from_numpy(sleep)
These are also leaf tensors.
print(x.is_leaf) # True print(y.is_leaf) # True

Calculate
yhat
, the predicted sleep values based on the currentm
andb
.yhat = m*x + b print(yhat) # tensor([0., 0., ... 0., 0., 0.], dtype=torch.float64, grad_fn=<AddBackward0>)
Since
yhat
is created from existing tensors, it is not a leaf tensor.print(yhat.is_leaf) # False

Calculate the loss (root mean square error).
loss = torch.sqrt(torch.mean((yhat  y)**2)) # (1)! print(loss) # tensor(4.7595, dtype=torch.float64, grad_fn=<SqrtBackward0>)

Alternatively,
loss = (((yhat  y)**2).mean()).sqrt()


Calculate the gradient of the
loss
with respect tom
andb
.Here's an oversimplified depiction of our computation graph.
m
,b
,x
, andy
are leaves andloss
is the root. Therefore, we can callloss.backward()
, and PyTorch will automatically calculate the gradient ofloss
with respect tom
andb
.loss.backward() print(m.grad) # tensor([6.2051]) print(b.grad) # tensor([0.8798])
This tells us that increasing
m
should decrease theloss
(and the same goes forb
).What about
x
,y
, andyhat
?x
andy
Even thoughx
andy
are leaves in the computational graph, their gradients won't be computed because theirrequires_grad
attribute is turned off.print(x.requires_grad) # False print(y.requires_grad) # False
This is because
x
andy
were created usingtorch.from_numpy()
which does not turn onrequires_grad
.This is the behavior we desire, but if you wanted
x
andy
gradients to be calculated, you should create them like this.x = torch.tensor(screentime, requires_grad=True) y = torch.tensor(sleep, requires_grad=True)
yhat
yhat
gradients are calculated but not retained. We know this becauseyhat.requires_grad
is True andyhat.retains_grad
is False. Furthermore, if you try to printyhat.grad
, pytorch returnsNone
with a warning.print(yhat.grad) # None
UserWarning: The
.grad
attribute of a Tensor that is not a leaf Tensor is being accessed. Its.grad
attribute won't be populated duringautograd.backward()
. If you indeed want the.grad
field to be populated for a nonleaf Tensor, use.retain_grad()
on the nonleaf Tensor...Again, this is the behavior we desire, but if you wanted
yhat
gradients to be retained, callyhat.retain_grad()
right after creating it.yhat = m*x + b yhat.retain_grad()
Terminology note
In the context of neural networks, this is what people call backpropagation. But in the more general sense, this is known as reverse mode differentiation (aka reverse accumulation).

Update
m
andb
(gradient step).In step 5, we learned that
loss
should decrease if we make small increases tom
andb
. Here we updatem
andb
accordingly, using a gradient step with alpha (aka "step size") equal to 0.02.m = m  0.02*m.grad b = b  0.02*b.grad print(m.item(), b.item()) # 0.1241 0.0176
Important
The
=
operator won't work here! (PyTorch doesn't allow in place operations on tensors that require gradients.)m = 0.02*m.grad b = 0.02*b.grad
RuntimeError: a leaf Variable that requires grad is being used in an inplace operation.
Problem
This is the first iteration of our gradient descent algorithm. See what happens when we run the next iteration..
yhat = m*x + b
loss = torch.sqrt(torch.mean((yhat  y)**2))
loss.backward()
m = m  0.02*m.grad # < error!
TypeError: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed...
On this second iteration of gradient descent, it turns out m
is not a leaf tensor! That's because, in step 6 above, m = m  0.02*m.grad
actually copies m
's data into a new block of memory. You can see this by printing m's memory address before and after the update.
print(f'address before: {m.data_ptr()}')
m = m  0.02*m.grad
print(f'address after: {m.data_ptr()}')
# address before: 94445597513088
# address after: 94445597510720
We'll fix this in Draft 2 (link above).
# Initialize m and b
m = torch.tensor([0], requires_grad=True, dtype=torch.float32)
b = torch.tensor([0], requires_grad=True, dtype=torch.float32)
# Convert sleep and screentime to tensors
x = torch.from_numpy(screentime)
y = torch.from_numpy(sleep)
# Calculate yhat
yhat = m*x + b
# Calculate the loss
loss = torch.sqrt(torch.mean((yhat  y)**2))
# "Reverse mode differentiation"
loss.backward()
# Update m and b (gradient step)
with torch.no_grad():
m = 0.02*m.grad
b = 0.02*b.grad
Here we've changed
m = m  0.02*m.grad
b = b  0.02*b.grad
to
with torch.no_grad():
m = 0.02*m.grad
b = 0.02*b.grad
in order to update m
and b
in place as opposed to copying the tensors.
torch.no_grad
is a context manager that temporarily disables gradient calculation. This allows us to use in place operators like =
and +=
.
Info
PyTorch prevents you from using in place operators on tensors with requires_grad == True
, but you can use them on tensors with requires_grad == False
.
Alternatively, we could directly access and update the underlying data using the Tensor.data
attribute.
m.data = m  0.02*m.grad
b.data = b  0.02*b.grad
We've only implemented one iteration of gradient descent. We need to implement more iterations!
# Initialize m and b
m = torch.tensor([0], requires_grad=True, dtype=torch.float32)
b = torch.tensor([0], requires_grad=True, dtype=torch.float32)
# Convert sleep and screentime to tensors
x = torch.from_numpy(screentime)
y = torch.from_numpy(sleep)
# Run 10K iterations
for i in range(10_000):
# Calculate yhat
yhat = m*x + b
# Calculate the loss
loss = torch.sqrt(torch.mean((yhat  y)**2))
# "Reverse mode differentiation"
loss.backward()
# Update m and b (gradient step)
with torch.no_grad():
m = 0.02*m.grad
b = 0.02*b.grad
Here we've wrapped the gradient descent process into a for loop that runs 10,000 iterations.
Problem
Let's inspect the first 10 iterations by inserting a informative print statement in the loop.
# Initialize m and b
m = torch.tensor([0], requires_grad=True, dtype=torch.float32)
b = torch.tensor([0], requires_grad=True, dtype=torch.float32)
# Convert sleep and screentime to tensors
x = torch.from_numpy(screentime)
y = torch.from_numpy(sleep)
# Run 10K iterations
for i in range(10_000):
# Calculate yhat
yhat = m*x + b
# Calculate the loss
loss = torch.sqrt(torch.mean((yhat  y)**2))
# "Reverse mode differentiation"
loss.backward()
# Print stuff
if i < 10:
print(
f"iteration {i}: "
f"(m, b) = ({round(m.item(), 3)}, {round(b.item(), 3)}), "
f"(m.grad, b.grad) = ({round(m.grad.item(), 3)}, {round(b.grad.item(), 3)}), "
f"loss = {round(loss.item(), 3)}"
)
# Update m and b (gradient step)
with torch.no_grad():
m = 0.02*m.grad
b = 0.02*b.grad
iteration 0: (m, b) = (0.0, 0.0), (m.grad, b.grad) = (6.635, 0.917), loss = 5.659
iteration 1: (m, b) = (0.133, 0.018), (m.grad, b.grad) = (12.49, 1.77), loss = 4.81
iteration 2: (m, b) = (0.383, 0.054), (m.grad, b.grad) = (15.369, 2.338), loss = 3.642
iteration 3: (m, b) = (0.69, 0.1), (m.grad, b.grad) = (12.287, 2.216), loss = 3.666
iteration 4: (m, b) = (0.936, 0.145), (m.grad, b.grad) = (6.342, 1.708), loss = 4.839
iteration 5: (m, b) = (1.062, 0.179), (m.grad, b.grad) = (0.346, 1.088), loss = 5.663
iteration 6: (m, b) = (1.056, 0.201), (m.grad, b.grad) = (7.018, 0.471), loss = 5.63
iteration 7: (m, b) = (0.915, 0.21), (m.grad, b.grad) = (12.883, 0.026), loss = 4.751
iteration 8: (m, b) = (0.658, 0.21), (m.grad, b.grad) = (15.652, 0.108), loss = 3.582
iteration 9: (m, b) = (0.344, 0.207), (m.grad, b.grad) = (12.409, 0.497), loss = 3.668
Notice how the parameters change wildly from iteration to iteration. This indicates that something's not quite right..
It turns out that PyTorch accumulates gradients by default!
# Initialize m and b
m = torch.tensor([0], requires_grad=True, dtype=torch.float32)
b = torch.tensor([0], requires_grad=True, dtype=torch.float32)
# Convert sleep and screentime to tensors
x = torch.from_numpy(screentime)
y = torch.from_numpy(sleep)
# Run 10K iterations
for i in range(10_000):
# Calculate yhat
yhat = m*x + b
# Calculate the loss
loss = torch.sqrt(torch.mean((yhat  y)**2))
# "Reverse mode differentiation"
loss.backward()
# Print stuff
if i < 10:
print(
f"iteration {i}: "
f"(m, b) = ({round(m.item(), 3)}, {round(b.item(), 3)}), "
f"(m.grad, b.grad) = ({round(m.grad.item(), 3)}, {round(b.grad.item(), 3)}), "
f"loss = {round(loss.item(), 3)}"
)
# Update m and b (gradient step)
with torch.no_grad():
m = 0.02*m.grad
b = 0.02*b.grad
# Zero out the gradients
m.grad = None
b.grad = None
Because PyTorch accumulates gradients by default, we have to manually "zero out the gradients" by setting them to None
.
m.grad = None
b.grad = None
Now, the first 10 gradient descent iterations appear more stable.
iteration 0: (m, b) = (0.0, 0.0), (m.grad, b.grad) = (6.635, 0.917), loss = 5.659
iteration 1: (m, b) = (0.133, 0.018), (m.grad, b.grad) = (5.855, 0.853), loss = 4.81
iteration 2: (m, b) = (0.25, 0.035), (m.grad, b.grad) = (4.765, 0.755), loss = 4.17
iteration 3: (m, b) = (0.345, 0.051), (m.grad, b.grad) = (3.485, 0.63), loss = 3.763
iteration 4: (m, b) = (0.415, 0.063), (m.grad, b.grad) = (2.296, 0.506), loss = 3.553
iteration 5: (m, b) = (0.461, 0.073), (m.grad, b.grad) = (1.409, 0.41), loss = 3.463
iteration 6: (m, b) = (0.489, 0.081), (m.grad, b.grad) = (0.832, 0.346), loss = 3.429
iteration 7: (m, b) = (0.506, 0.088), (m.grad, b.grad) = (0.48, 0.306), loss = 3.416
iteration 8: (m, b) = (0.515, 0.094), (m.grad, b.grad) = (0.27, 0.282), loss = 3.41
iteration 9: (m, b) = (0.521, 0.1), (m.grad, b.grad) = (0.147, 0.268), loss = 3.407
Notice how the parameters change more smoothly instead of wildly bouncing around as they did in the previous draft.
At the end of 10,000 iterations, we get m
= 0.755 and b
= 11.241.