An Example Of Using The PyTorch masked_fill() Function

I’m doing a deep dive into the machine learning Attention mechanism and the Transformer architecture. In some ways, this is among the most difficult code I’ve ever come across in my entire career. A Transformer is a deep neural system that can solve natural language processing problems, like translating English to German. If a standard deep neural network is like adding 2 + 2, then a Transformer is like advanced multi-variate Calculus.

Because of the complexity, I know from painful past experience that it’s important to slowly but steadily dissect each part of the problem. Jumping ahead by skipping over things that appear to be simple, almost always comes back to bite you as the saying goes.

One tiny part of the crazy-complex Transformer code is tensor masking using the PyTorch masked_fill() function. You use a mask when you have a tensor and you want to convert some of the values in the tensor to something else.

Suppose you have a 2×3 tensor named “source”:

[[1.0, 2.0, 3.0],
 [4.0, 5.0, 6.0]]

And you want to change the values at [0,0] and [1,1] to 9.9 to get:

[[9.9, 2.0, 3.0],
 [4.0, 9.9, 6.0]]

You could easily write a function to do this. You’d iterate through the tensor and change the target cells to 9.9. But a more general approach is to use a mask. You’d create a mask named “msk” that has a magic 0s at the positions you want to change, and magic 1s elsewhere:

[[0, 1, 1],
 [1, 0, 1]]

Then you’d call masked_fill() like so:

result = source.masked_fill(msk == 0, 9.9)

In words, “scan through the msk matrix and when there is a cell with value 0, change the corresponding cell in source to 9.9”

The values in the msk matrix must be type Boolean, which in PyTorch means type uint8. The meaning of the magic 0s and 1s can be reversed, for example you can change values when the cell in the mask is 1 rather than 0.

Common mask-fill scenarios are zeroing-out the lower diagonal part of a matrix, setting some matrix cells to a very small value (like minus infinity) before doing some max operation, and so on.

The simple masked_fill() function isn’t really so simple. This is what I mean by Transformer architecture being among the most complex code I’ve ever seen: Transformer code isn’t long — maybe a few hundred lines of code — but almost every statement is like masked_fill() in the sense that there are many layers of hidden complexity.

The covid-19 pandemic has had all kinds of unexpected consequences. Left: This designer mask costs $300 — more than I spend on all my clothes for a full year (I am definitely not a snappy dresser). The sales of eye makeup have soared, which makes sense. Center: The world of fashion has had to integrate masks into their overall look. These masks are simple but quite attractive in my opinion. Right: Some people were already prepared for the pandemic, like this guy caught on a convenience store security camera, sporting a stylish hoodie plus zip-up mask while debating with the clerk about social justice and gun control.

# masked_fill_demo.py

import torch as T
import numpy as np
device = T.device("cpu")

def my_masker(tsr, msk, v):
  res = tsr.clone()
  for i in range(len(tsr)):
    for j in range(len(tsr[0])):
      if msk[i][j] == 0:
        res[i][j] = v
  return res

print("\nBegin masked_fill() demo ")

data = np.array([[1.0, 2.0, 3.0],
                 [4.0, 5.0, 6.0]], dtype=np.float32)
tsr = T.tensor(data, dtype=T.float32).to(device)

print("\nThe tensor is:")
print(tsr)

msk = np.array([[0, 1, 1],
                [1, 0, 1]], dtype=np.uint8)
msk = T.tensor(msk, dtype=T.uint8)
print("\nThe mask is: ")
print(msk)

T.set_printoptions(precision=1)
result = tsr.masked_fill(msk == 0, 9.9)
print("\nThe result of mask_fill(msk==0, 9.9) is: ")
print(result)

res = my_masker(tsr, msk, 9.9)
print("\nThe result using custom masking function is: ")
print(res)

print("\nEnd demo ")