I’ve been exploring PyTorch Transformer Architecture models sequence-to-sequence problems for several months. TA architecture systems are among the most complicated software things I’ve ever worked with.
I recently completed a demo implementation of my idea of the simplest possible sequence-to-sequence. That demo is incomplete because it trained a seq-to-seq model but did not use the trained model to make a prediction. See https://jamesmccaffreyblog.com/2022/09/09/simplest-transformer-seq-to-seq-example/.
Unlike relatively simple neural networks, such as a multi-class classifier, using a trained seq-to-seq model is a significant challenge. So I took the trained model and wrote a demo program to use the model to make a prediction.
My input sequence is [1, 4,5,6,7,6,5,4, 2]. The 1 is start-of-sequence, the 2 is end-of-sequence. Token 3 is for unknown and token 0 is for padding. I didn’t use 0 or 3 in my demo. The correct output is [1, 5, 6, 7, 8, 7, 6, 5, 2]. My demo didn’t do too well but at least it emitted a legal output sequence: [1, 5, 5, 4, 5, 8, 4, 4, 2].
There are many things that I don’t fully understand about Transformer seq-to-seq systems, including my own demo. But for difficult machine learning topics, persistence and determination are the keys to successful learning.

Transformer software systems are difficult to figure out. There are a surprisingly large number of movies where a human transforms into a snake. Here are three where the plot is difficult to figure out. Left: “The Reptile” (1966) is an English movie about a young woman who transforms into a snake because of a Malay curse. Center: “Cult of the Cobra” (1955) is movie about six men who unintentionally witness a ceremony of an evil cult of women who can transform into snakes. You’d think they’d stay away from mysterious women with dark reptilian eyes after that, but no, they don’t. Right: “The Sorcerer and the White Snake” (2011) is a Chinese movie. The plot baffled me but there are two women who can turn into snakes.
Demo code:
# seq2seq_use.py
# Transformer seq-to-seq usage example
# PyTorch 1.12.1-CPU Anaconda3-2020.02 Python 3.7.6
# Windows 10/11
import numpy as np
import torch as T
import math
device = T.device('cpu')
T.set_num_threads(1)
# -----------------------------------------------------------
class TransformerNet(T.nn.Module):
def __init__(self):
# vocab_size = 12, embed_dim = d_model = 4, seq_len = 9/10
super(TransformerNet, self).__init__() # classic syntax
self.embed = T.nn.Embedding(12, 4) # word embedding
self.pos_enc = PositionalEncoding(4) # positional
self.trans = T.nn.Transformer(d_model=4, nhead=2, \
dropout=0.0, batch_first=True) # d_model div by nhead
self.fc = T.nn.Linear(4, 12) # embed_dim to vocab_size
def forward(self, src, tgt, tgt_mask):
s = self.embed(src)
t = self.embed(tgt)
s = self.pos_enc(s) # [bs,seq=10,embed]
t = self.pos_enc(t) # [bs,seq=9,embed]
z = self.trans(src=s, tgt=t, tgt_mask=tgt_mask)
z = self.fc(z)
return z
# -----------------------------------------------------------
class PositionalEncoding(T.nn.Module): # documentation code
def __init__(self, d_model: int, dropout: float=0.0,
max_len: int=5000):
super(PositionalEncoding, self).__init__() # old syntax
self.dropout = T.nn.Dropout(p=dropout)
pe = T.zeros(max_len, d_model) # like 10x4
position = \
T.arange(0, max_len, dtype=T.float).unsqueeze(1)
div_term = T.exp(T.arange(0, d_model, 2).float() * \
(-np.log(10_000.0) / d_model))
pe[:, 0::2] = T.sin(position * div_term)
pe[:, 1::2] = T.cos(position * div_term)
pe = pe.unsqueeze(0).transpose(0, 1)
self.register_buffer('pe', pe) # allows state-save
def forward(self, x):
x = x + self.pe[:x.size(0), :]
return self.dropout(x)
# -----------------------------------------------------------
def main():
# 0. get started
print("\nBegin PyTorch Transformer seq-to-seq use demo ")
T.manual_seed(1)
np.random.seed(1)
# 1. create Transformer network
print("\nCreating batch-first Transformer network ")
model = TransformerNet().to(device)
model.eval()
# 2. load trained model wts and biases
print("\nLoading saved model weights and biases ")
fn = ".\\Models\\transformer_seq_model_200_epochs.pt"
model.load_state_dict(T.load(fn))
# -----------------------------------------------------------
src = T.tensor([[1, 4,5,6,7,6,5,4, 2]],
dtype=T.int64).to(device)
print("\nsrc sequence: ")
print(src)
print("\ncorrect output: ")
print("[[1, 5, 6, 7, 8, 7, 6, 5, 2]]")
print("\nPredicted output: ")
tgt_in = T.tensor([[1]], dtype=T.int64).to(device) # SOS
for i in range(20): # max output 20 tokens
n = tgt_in.size(1)
t_mask = \
T.nn.Transformer.generate_square_subsequent_mask(n)
with T.no_grad():
preds = model(src, tgt_in, tgt_mask=t_mask)
# [bs,tgt_in,embed]
next_token = T.argmax( preds[-1][-1] ) # last set 12 values
# print(next_token); input()
next_token = next_token.reshape(1,1)
tgt_in = T.cat((tgt_in, next_token), dim=1)
print(tgt_in)
if next_token[0][0].item() == 2: # EOS
break
print("\nEnd PyTorch Transformer seq-to-seq use demo ")
if __name__ == "__main__":
main()

.NET Test Automation Recipes
Software Testing
SciPy Programming Succinctly
Keras Succinctly
R Programming
2026 Visual Studio Live
2025 Summer MLADS Conference
2026 DevIntersection Conference
2025 Machine Learning Week
2025 Ai4 Conference
2026 G2E Conference
2026 iSC West Conference
You must be logged in to post a comment.