When I was first learning PyTorch, I implemented a demo of the IMDB movie review sentiment analysis problem using an LSTM. I recently revisited that code to incorporate all the things I learned about PyTorch since that early example.
My overall approach is to preprocess the IMDB data by encoding each word as an integer ID, rather than encoding on the fly during training. IDs are sorted by frequency where small ID numbers are the most common words. This makes it easy to filter out of rare words like “floozle”. Preparing the raw movie data is the most difficult part of creating the sentiment analysis system.
I created a root directory named IMDB with subdirectories Data and Models. I downloaded the 50,000 movie reviews from https://ai.stanford.edu/~amaas/data/sentiment/ as aclImdb_v1.tar.gz to the root IMDB directory, then unzipped using the 7-Zip utility to get file aclImdb_v1.tar, and then I unzipped that file to get an aclImdb directory that contains all the movie reviews. I moved that directory and its contents into the Data directory.

Here I illustrate the data preprocessing for tiny reviews that are 20 words or less. Notice there is a duplicate review.
The goal of my preprocessing is to create files imdb_train_50w.txt and imdb_test_50w.txt for training and testing respectively. These are files where the movie reviews are very small — 50 words or less — because working with the entire dataset of reviews is very difficult. This generated just 620 training items/reviews which is too few to get good results. In a non-demo NLP scenario you need several thousand training items.
The words in the reviews are tokenized into integer values like “the” = 4 and “movie” = 20. I reserved 0 for (PAD) to pad all reviews to exactly 50 words. Most punctuation is stripped out and all words are converted to lower case. Each line in the train and test files is one review where padding is at the beginning and the class label to predict (0 = negative, 1 = positive) is the last value on each line. This preprocessing script is complicated and took me several days of coding and debugging. See the code below.
The program to create and train a sentiment analysis model using a PyTorch LSTM also took several days of work. The model definition is:
import numpy as np
import torch as T
device = T.device('cpu')
class LSTM_Net(T.nn.Module):
def __init__(self):
# vocab_size = 129892
super(LSTM_Net, self).__init__()
self.embed = T.nn.Embedding(129892, 32)
self.lstm = T.nn.LSTM(32, 75)
self.drop = T.nn.Dropout(0.10)
self.fc1 = T.nn.Linear(75, 10) # 0=neg, 1=pos
self.fc2 = T.nn.Linear(10, 2)
def forward(self, x):
# x = review/sentence. length = 50 (fixed w/ padding)
z = self.embed(x)
z = z.view(50, 1, 32) # "seq batch input"
lstm_oupt, (h_n, c_n) = self.lstm(z)
z = lstm_oupt[-1]
z = self.drop(z)
z = T.tanh(self.fc1(z))
z = self.fc2(z) # CrossEntropyLoss will apply softmax
return z
There are virtually unlimited design choices for an LSTM-based network. There are no good rules of thumb for design — it’s all trial and error guided by experience.
The make_data_files.py data preprocessing program determined that there are 129,892 distinct words/tokens in the entire training data. This is far too many words to get good results so in a non-demo scenario I’d filter the vocabulary down to just the 10 or 20 thousand most common words/tokens.
Each word ID in an input review is converted into an embedding vector of 32 values (in a non-demo scenario 100 values for embedding is more common). The LSTM component converts these to 75 values. These 75 values are passed to two Linear layers that map down to 10 values then down to 2 final result values (0 or 1).
For simplicity, during training I used a batch size of 1, meaning I processed just one review at a time. In a non-demo scenario, I’d probably use a batch size of 16.
After the model was trained, I fed a movie review of “the movie was a great waste of my time” to the model. I converted each word manually: “the” = 4, “movie” = 20, “was” = 16, etc., by using the vocab_dict dictionary in the make_data_files.py program. In a non-demo scenario, I would have programmatically determined the ID values for each word by using the vocab_file.txt file that was generated by the data preparation program.
The prediction result for the review was [0.9984, 0.0016] which maps to class 0, which is a negative review.
Whew! Natural language processing problems like movie review analysis are mysterious and very, very difficult. But very, very interesting.

Three mystery movies that I give positive sentiment reviews to. Left: “Murder on the Orient Express” (1974). Center: “Sherlock Holmes and the House of Fear” (1945). Right: “The Nice Guys” (2016).
Demo code for make_data_files.py. Replace “lt” (less-than), etc. with symbols. My blog editor chokes on those symbols.
# make_data_files.py
#
# input: source Stanford 50,000 data files reviews
# output: one combined train file, one combined test file
# output files are in index version, using the Keras dataset
# format where 0 = padding, 1 = 'start', 2 = OOV, 3 = unused
# 4 = most frequent word ('the'), 5 = next most frequent, etc.
# i'm skipping the start=1 because it makes no sense here.
# these data files will be loaded into memory then feed
# a built-in Embedding layer (rather than custom embeddings)
# the reviews will be just those that have 50 words or less.
# short reviews will have 0s pre-pended. the class
# label (0 or 1) is the very last value.
import os
# allow the Windws cmd shell to deal with wacky characters
import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer)
# -------------------------------------------------------------
def get_reviews(dir_path, num_reviews, punc_str):
punc_table = {ord(char): None for char in punc_str} # dict
reviews = [] # list-of-lists of words
ctr = 1
for file in os.listdir(dir_path):
if ctr "gt" num_reviews: break
curr_file = os.path.join(dir_path, file)
f = open(curr_file, "r", encoding="utf8")
for line in f:
line = line.strip()
if len(line) "gt" 0: # number characters
# print(line) # to show non-ASCII == errors
line = line.translate(punc_table) # remove punc
line = line.lower() # lower case
line = " ".join(line.split()) # remove consecutive WS
word_list = line.split(" ") # list of words
reviews.append(word_list) #
f.close() # close curr file
ctr += 1
return reviews
# -------------------------------------------------------------
def make_vocab(all_reviews):
word_freq_dict = {} # key = word, value = frequency
for i in range(len(all_reviews)):
reviews = all_reviews[i]
for review in reviews:
for word in review:
if word in word_freq_dict:
word_freq_dict[word] += 1
else:
word_freq_dict[word] = 1
kv_list = [] # list of word-freq tuples so can sort
for (k,v) in word_freq_dict.items():
kv_list.append((k,v))
# list of tuples index is 0-based rank, val is (word,freq)
sorted_kv_list = \
sorted(kv_list, key=lambda x: x[1], \
reverse=True) # sort by freq
f = open(".\\vocab_file.txt", "w", encoding="utf8")
vocab_dict = {}
# key = word, value = 1-based rank
# ('the' = 1, 'a' = 2, etc.)
for i in range(len(sorted_kv_list)):
w = sorted_kv_list[i][0] # word is at [0]
vocab_dict[w] = i+1 # 1-based as in Keras dataset
f.write(w + " " + str(i+1) + "\n") # word-space-index
f.close()
return vocab_dict
# -------------------------------------------------------------
def generate_file(reviews_lists, outpt_file, w_or_a,
vocab_dict, max_review_len, label_char):
# write first time, append later
fout = open(outpt_file, w_or_a, encoding="utf8")
offset = 3 # Keras offset: 'the' = 1 (most frequent)
for i in range(len(reviews_lists)): # walk each review
curr_review = reviews_lists[i]
n_words = len(curr_review)
if n_words "gt" max_review_len:
continue # next i, continue without writing anything
n_pad = max_review_len - n_words # num of 0s to pre-pend
for j in range(n_pad): # write padding to get 50 values
fout.write("0 ")
for word in curr_review:
# a word in test set might not have been in training set
if word not in vocab_dict:
fout.write("2 ") # 2 is out-of-vocab index
else:
idx = vocab_dict[word] + offset
fout.write("%d " % idx)
fout.write(label_char + "\n") # add label '0' or '1
fout.close()
# -------------------------------------------------------------
def main():
remove_chars = "!\"#$%&()*+,-./:;"lt"="gt"?@[\\]^_`{|}~"
# leave ' for words like it's
print("\nLoading all reviews into memory - be patient ")
pos_train_reviews = get_reviews(".\\aclImdb\\train\\pos",
12500, remove_chars)
neg_train_reviews = get_reviews(".\\aclImdb\\train\\neg",
12500, remove_chars)
pos_test_reviews = get_reviews(".\\aclImdb\\test\\pos",
12500, remove_chars)
neg_test_reviews = get_reviews(".\\aclImdb\\test\\neg",
12500, remove_chars)
# mp = max(len(l) for l in pos_train_reviews) # 2469
# mn = max(len(l) for l in neg_train_reviews) # 1520
# mm = max(mp, mn) # longest review is 2469
# print(mp, mn)
# -------------------------------------------------------------
print("\nAnalyzing reviews and making vocabulary ")
vocab_dict = make_vocab([pos_train_reviews,
neg_train_reviews]) # key = word, value = word rank
v_len = len(vocab_dict)
# need this value, plus 4, for Embedding: 129888+4 = 129892
print("\nVocab size = %d -- use this +4 for \
Embedding nw " % v_len)
max_review_len = 20 # use None for all reviews (any len)
# if max_review_len == None or max_review_len "gt" mm:
# max_review_len = mm
print("\nGenerating training file len %d words or less " \
% max_review_len)
generate_file(pos_train_reviews, ".\\imdb_train_20w.txt",
"w", vocab_dict, max_review_len, "1")
generate_file(neg_train_reviews, ".\\imdb_train_20w.txt",
"a", vocab_dict, max_review_len, "0")
print("Generating test file with len %d words or less " \
% max_review_len)
generate_file(pos_test_reviews, ".\\imdb_test_20w.txt",
"w", vocab_dict, max_review_len, "1")
generate_file(neg_test_reviews, ".\\imdb_test_20w.txt",
"a", vocab_dict, max_review_len, "0")
# inspect a generated file
# vocab_dict was used indirectly (offset)
print("\nDisplaying encoded training file: \n")
f = open(".\\imdb_train_20w.txt", "r", encoding="utf8")
for line in f:
print(line, end="")
f.close()
# -------------------------------------------------------------
print("\nDisplaying decoded training file: \n")
index_to_word = {}
index_to_word[0] = ""lt"PAD"gt""
index_to_word[1] = ""lt"ST"gt""
index_to_word[2] = ""lt"OOV"gt""
for (k,v) in vocab_dict.items():
index_to_word[v+3] = k
f = open(".\\imdb_train_20w.txt", "r", encoding="utf8")
for line in f:
line = line.strip()
indexes = line.split(" ")
for i in range(len(indexes)-1): # last is '0' or '1'
idx = (int)(indexes[i])
w = index_to_word[idx]
print("%s " % w, end="")
print("%s " % indexes[len(indexes)-1])
f.close()
if __name__ == "__main__":
main()
Demo code for imdb_lstm.py. Replace “lt” (less-than), etc. with symbols. My blog editor chokes on those symbols.
# imdb_lstm.py
# PyTorch 1.9.0-CPU Anaconda3-2020.02 Python 3.7.6
# Windows 10
import numpy as np
import torch as T
device = T.device('cpu')
# -----------------------------------------------------------
class LSTM_Net(T.nn.Module):
def __init__(self):
# vocab_size = 129892
super(LSTM_Net, self).__init__()
self.embed = T.nn.Embedding(129892, 32)
self.lstm = T.nn.LSTM(32, 75)
self.drop = T.nn.Dropout(0.10)
self.fc1 = T.nn.Linear(75, 10)
self.fc2 = T.nn.Linear(10, 2) # 0=neg, 1=pos
def forward(self, x):
# x = review/sentence. length = 50 (fixed w/ padding)
z = self.embed(x)
z = z.view(50, 1, 32) # "seq batch input"
lstm_oupt, (h_n, c_n) = self.lstm(z)
z = lstm_oupt[-1]
z = self.drop(z)
z = T.tanh(self.fc1(z))
z = self.fc2(z) # CrossEntropyLoss will apply softmax
return z
# -----------------------------------------------------------
def accuracy(model, data_x, data_y):
# data_x and data_y are lists of tensors
model.eval()
num_correct = 0; num_wrong = 0
for i in range(len(data_x)):
X = data_x[i]
Y = data_y[i].reshape(1)
with T.no_grad():
oupt = model(X)
idx = T.argmax(oupt.data)
if idx == Y: # predicted == target
num_correct += 1
else:
num_wrong += 1
acc = (num_correct * 100.0) / (num_correct + num_wrong)
model = model.train()
return acc
# -----------------------------------------------------------
def main():
# 0. get started
print("\nBegin PyTorch IMDB LSTM demo ")
print("Using only reviews with 50 or less words ")
T.manual_seed(1)
np.random.seed(1)
# 1. load data from file
print("\nLoading preprocessed train and test data ")
max_review_len = 50 # exact review length
train_xy = np.loadtxt(".\\Data\\imdb_train_50w.txt",
delimiter=" ", usecols=range(0,51), dtype=np.int64)
train_x = train_xy[:,0:50]
train_y = train_xy[:,50]
test_xy = np.loadtxt(".\\Data\\imdb_test_50w.txt",
delimiter=" ", usecols=range(0,51), dtype=np.int64)
test_x = test_xy[:,0:50]
test_y = test_xy[:,50]
# 1b. convert to tensors
train_x = T.tensor(train_x, dtype=T.int64).to(device)
train_y = T.tensor(train_y, dtype=T.int64).to(device)
test_x = T.tensor(test_x, dtype=T.int64).to(device)
test_y = T.tensor(test_y, dtype=T.int64).to(device)
N = len(train_x)
print("Data loaded. Number train items = %d " % N)
# -----------------------------------------------------------
# 2. create network
net = LSTM_Net().to(device)
# 3. train model
loss_func = T.nn.CrossEntropyLoss() # does log-softmax()
optimizer = T.optim.Adam(net.parameters(), lr=1.0e-3)
max_epochs = 12
log_interval = 2 # display progress
print("\nStarting training with bat_size = 1")
for epoch in range(0, max_epochs):
net.train() # set training mode
indices = np.arange(N)
np.random.shuffle(indices)
tot_err = 0.0
for i in range(N): # one review at a time
j = indices[i]
X = train_x[j]
Y = train_y[j].reshape(1)
optimizer.zero_grad()
oupt = net(X)
loss_val = loss_func(oupt, Y)
tot_err += loss_val.item()
loss_val.backward() # compute gradients
optimizer.step() # update weights
if epoch % log_interval == 0:
print("epoch = %4d |" % epoch, end="")
print(" avg loss = %7.4f |" % (tot_err / N), end="")
train_acc = accuracy(net, train_x, train_y)
print(" accuracy = %7.2f%%" % train_acc)
# test_acc = accuracy(net, test_x, test_y) #
# print(" test accuracy = %7.2f%%" % test_acc)
print("Training complete")
# -----------------------------------------------------------
# 4. evaluate model
test_acc = accuracy(net, test_x, test_y)
print("\nAccuracy on test data = %7.2f%%" % test_acc)
# 5. save model
print("\nSaving trained model state")
fn = ".\\Models\\imdb_model.pt"
T.save(net.state_dict(), fn)
# saved_model = Net()
# saved_model.load_state_dict(T.load(fn))
# use saved_model to make prediction(s)
# 6. use model
print("\nFor \"the movie was a great waste of my time\"")
print("0 = negative, 1 = positive ")
review = np.array([4, 20, 16, 6, 86, 425, 7, 58, 64], \
dtype=np.int64)
padding = np.zeros(41, dtype=np.int64)
review = np.concatenate([review, padding])
review = T.tensor(review, dtype=T.int64)
net.eval()
with T.no_grad():
prediction = net(review) # raw outputs
print("\nlogits: ", end=""); print(prediction)
probs = T.softmax(prediction, dim=1) # pseudo-probabilities
probs = probs.numpy()
print("pseudo-probs: ", end="")
print("%0.4f %0.4f " % (probs[0][0], probs[0][1]))
print("\nEnd PyTorch IMDB LSTM sentiment demo")
if __name__ == "__main__":
main()

.NET Test Automation Recipes
Software Testing
SciPy Programming Succinctly
Keras Succinctly
R Programming
2026 Visual Studio Live
2025 Summer MLADS Conference
2026 DevIntersection Conference
2025 Machine Learning Week
2025 Ai4 Conference
2026 G2E Conference
2026 iSC West Conference
You must be logged in to post a comment.