Trying to Understand Scaled Dot Product Attention for Transformer Architecture

The latest neural architecture for natural language processing problems is called Transformer architecture. Transformer architecture is extraordinarily complex — probably the most complicated software system I’ve ever encountered.

I know from experience that when learning how a complex system works, the best approach for me is to work from bottom-up by looking at small pieces of the puzzle, one at a time. Once I understand all the components pieces, then I have a fighting chance to figure out how the entire system works.

I refactored a TensorFlow code example I found into PyTorch

Based on past experience, I believe I am months away from fully understanding Transformer architecture. But I’m starting to identify what the key pieces of the system are. One key piece of Transformer architecture is called scaled dot product attention (SDPA). SDPA is extremely tricky by itself. I currently think of SDPA as just an abstract function — I don’t have an intuition of what SDPA means in terms of Transformer architecture.

I’ve been frustrated somewhat because I’ve seen about 40 blog posts on SDPA on the Internet — and almost all of them are just copy-paste of the exact same thing, with no real explanation. After many hours of reading online, I found a short code example of SDPA in the TensorFlow documentation. Yes!

I saw the same two images for scaled dot product attention over and over and over and over. The diagram on the left and the equation on the right are equivalent, but neither is much help unless you already know how SDPA works.

Because I now use mostly PyTorch rather than TensorFlow, I translated the TensorFlow documentation code to PyTorch. It wasn’t a trivial task.

I won’t try to explain what’s going on in the code, because, to be honest, I really don’t know.

But at this point, I have the very first beginnings of understanding SDPA; but I know I have many, many more hours of exploration. For things like SDPA, getting a simple input-output example working is the first step. You can only know software systems by coding them.

This exploration of SDPA — one small part of Transformer architecture — confirmed my belief that Transformer architecture is incredibly complex, which however makes it incredibly interesting.

You can only know people by talking to them. Here are three actresses from the 1930s who I wish I could go back in time and talk to, so that I could understand what their lives were like. All three have a mysterious but happy look. I would like to have known what their hopes and dreams and passions were. Left: Myrna Loy (1905-1993). Center: Anna May Wong (1905-1961). Right: Merle Oberon (1911-1979).

# sdpa_demo.py
# PyTorch scaled dot product attention
# based on TensorFlow documentation at:
# www.tensorflow.org/tutorials/text/transformer

import numpy as np
import torch as T

def scaled_dot_prod_att(q, k, v):
  print("\nq = "); print(q)
  print("\nk = "); print(k)
  print("\nv = "); print(v)

  kt = T.transpose(k, 0, 1)
  q_k = T.matmul(q, kt)
  print("\nq times k transpose =")
  print(q_k)

  dk = k.shape[-1]  # 3
  print("\ndim k = ")
  print(dk)

  logits = q_k / np.sqrt(dk)
  print("\nAfter dividing by sqrt(dim k) logits = ")
  print(logits)

  att_wts = T.softmax(logits, dim=-1)
  print("\nAfter applyting softmax, att_wts = ")
  print(att_wts)

  oupt = T.matmul(att_wts, v)
  print("\nAfter multiplyting by v, oupt = ")
  print(oupt)

  return (oupt, att_wts)

def main():
  print("\nBegin scaled dot product attention demo \n")
  T.set_printoptions(precision=4)

  q = T.tensor([[0, 10, 0]], dtype=T.float32)
  k = T.tensor([[10, 0, 0], [0, 10, 0],
        [0, 0, 10], [0, 0, 10]], dtype=T.float32)
  v = T.tensor([[1, 0], [10, 0], [100, 5], [1000, 6]],
        dtype=T.float32)

  (oupt, att_wts) = scaled_dot_prod_att(q, k, v)

  print("\nEnd demo \n")

if __name__ == "__main__":
  main()

1 Response to Trying to Understand Scaled Dot Product Attention for Transformer Architecture

Thorsten Kleppe says:

September 22, 2020 at 8:53 am

Hello Dr. McCaffrey, sometimes I wish me your attention. So maybe I can reach this goal with a sophisticated idea?

What would you say if you have 10000 outputs and you only need to calculate the backpropagation for one neuron? That’s the idea behind the ReLUmax function (a regular softmax function with ReLU activation), despite there is no guaranty, it is possible.

So here it is:
https://github.com/grensen/ReLUmax

If your Transformer is alive, I hope it will be Optimus Prime.

Loading...