Regression (Employee Income) Using Keras 2.8 on Windows 11

One of my standard neural network examples is to predict employee income from sex, age, city, and job-type. Predicting a single numeric value is usually called a regression problem. (Note: “logistic regression” predicts a single numeric probability value between 0.0 and 1.0 but then that value is immediately used as a binary classification result).

My data is synthetic and looks like:

 1   0.24   1 0 0   0.2950   0 0 1
-1   0.39   0 0 1   0.5120   0 1 0
 1   0.63   0 1 0   0.7580   1 0 0
-1   0.36   1 0 0   0.4450   0 1 0
 1   0.27   0 1 0   0.2860   0 0 1
. . .

There are 200 training items and 40 test items.

The first value in column [0] is sex (M = -1, F = +1). Column [1] is age, normalized by dividing by 100. Columns [2,3,4] is city one-hot encoded (anaheim, boulder, concord). Column [5] is annual income, divided by $100,000, and is the value to predict. Columns [6,7,8] is job-type (mgmt, supp, tech).

I designed an 8-(10-10)-1 neural network. I used glorot_uniform() weight initialization with zero-bias initialization. I used tanh() activation on the two hidden layers, and no activation (aka Identity activation) on the single output node.

For training, I used Adam optimization with an initial learning rate of 0.01 along with a batch size of 10. I used mean squared error for the loss function.

For regression problems you must define a custom accuracy() function. My accuracy() function counts an income prediction as correct if it’s within 10% of the true income. I implemented two accuracy() functions. The first version iterates through one data item at a time. This is slow but useful to examine results. The second version feeds all data to the model at the same time. This is faster but more opaque.

There’s a strong correlation between a person’s job and their income. Here are three people who have interesting jobs.

Left: According to the BBC, Alan Moore is a “writer, wizard, mall Santa, and Rasputin impersonator”. Impressive.

Center: According to the Food Network TV company, Richard Scheuerman is a “shredded cheese authority”. OK.

Right: The BBC broadcast an interview with Andrew Drinkwater, from the “Water Research Centre”. He was meant to have that job.

Demo code. Replace “lt”, “gt”, “lte”, “gte” with Boolean operator symbols. My lame blog editor chokes on symbols. For the training and test data, see my post at https://jamesmccaffreyblog.com/2022/05/23/regression-employee-income-using-pytorch-1-10-on-windows-11/ where I did the same problem using PyTorch.

# employee_income_tfk.py
# predict income from sex, age, city, job_type
# Keras 2.8.0 in TensorFlow 2.8.0 ("_tfk")
# Anaconda3-2020.02  Python 3.7.6  Windows 10/11

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'  # suppress CPU warn

import numpy as np
import tensorflow as tf
from tensorflow import keras as K

# -----------------------------------------------------------

class MyLogger(K.callbacks.Callback):
  def __init__(self, n, model, data_x, data_y):
    self.n = n   # print loss every n epochs
    self.model = model
    self.data_x = data_x  # needed to compute accuracy
    self.data_y = data_y
    
  def on_epoch_end(self, epoch, logs={}):
    if epoch % self.n == 0:
      curr_loss = logs.get('loss')  # loss on curr batch
      acc = accuracy_x(self.model, self.data_x,
        self.data_y, 0.10) 
      print("epoch = %4d  |  loss = %0.6f  |  acc = %0.4f" % \
(epoch, curr_loss, acc))

# -----------------------------------------------------------

def accuracy(model, data_x, data_y, pct_close):
  # item-by-item -- slow -- for debugging
  n_correct = 0; n_wrong = 0
  n = len(data_x)
  for i in range(n):
    x = np.array([data_x[i]])  # [[ x ]]
    predicted = model.predict(x)  
    actual = data_y[i]
    if np.abs(predicted[0][0] - actual) "lt" \
      np.abs(pct_close * actual):
      n_correct += 1
    else:
      n_wrong += 1
  return (n_correct * 1.0) / (n_correct + n_wrong)

# -----------------------------------------------------------

def accuracy_x(model, data_x, data_y, pct_close):
  n = len(data_x)
  oupt = model(data_x)
  oupt = tf.reshape(oupt, [-1])  # 1D
 
  max_deltas = tf.abs(pct_close * data_y)  # max allow deltas
  abs_deltas = tf.abs(oupt - data_y)   # actual differences
  results = abs_deltas "lt" max_deltas    # [True, False, . .]

  n_correct = np.sum(results)
  acc = n_correct / n
  return acc

# -----------------------------------------------------------

def main():
  # 0. prepare
  print("\nBegin Employee predict income using Keras ")
  np.random.random(1)
  tf.random.set_seed(1)

  # 1. load data
  # sex age   city    income   job_type
  # -1  0.27  0 1 0   0.7610   0 0 1
  # +1  0.19  0 0 1   0.6550   1 0 0

  print("\nLoading Employee data into memory ")
  train_file = ".\\Data\\employee_train.txt"  # 200 lines
  train_x = np.loadtxt(train_file, usecols=[0,1,2,3,4,6,7,8],
    delimiter="\t", comments="#", dtype=np.float32)
  train_y = np.loadtxt(train_file, usecols=5, delimiter="\t",
    comments="#", dtype=np.float32)

  test_file = ".\\Data\\employee_test.txt"  # 40 lines
  test_x = np.loadtxt(test_file, usecols=[0,1,2,3,4,6,7,8],
    delimiter="\t", comments="#", dtype=np.float32)
  test_y = np.loadtxt(test_file, usecols=5, delimiter="\t",
    comments="#", dtype=np.float32)

# -----------------------------------------------------------

  # 2. create network
  print("\nCreating 8-(10-10)-1 neural network ")
  model = K.models.Sequential()
  model.add(K.layers.Dense(units=10, input_dim=8,
    activation='tanh', kernel_initializer='glorot_uniform',
    bias_initializer='zeros'))  # hid1
  model.add(K.layers.Dense(units=10,
    activation='tanh', kernel_initializer='glorot_uniform',
    bias_initializer='zeros'))  # hid2
  model.add(K.layers.Dense(units=1,
    activation=None, kernel_initializer='glorot_uniform',
    bias_initializer='zeros'))    # output layer
  opt = K.optimizers.Adam(learning_rate=0.01)
  model.compile(loss='mean_squared_error',
    optimizer=opt, metrics=['mse'])

# -----------------------------------------------------------

  # 3. train model
  print("\nbat_size = 10 ")
  print("loss = mean_squared_error ")
  print("optimizer = Adam ")
  print("lrn_rate = 0.01 ")

  my_logger = MyLogger(100, model, train_x, train_y) 

  print("\nStarting training ")
  h = model.fit(train_x, train_y, batch_size=10,
    epochs=1000, verbose=0, callbacks=[my_logger])
  print("Done ")

# -----------------------------------------------------------

  # 4. evaluate model
  print("\nComputing model accuracy (within 0.10 of true) ")
  train_acc = accuracy(model, train_x, train_y, 0.10) 
  print("Accuracy on train data = %0.4f" % train_acc)
  test_acc = accuracy_x(model, test_x, test_y, 0.10) 
  print("Accuracy on test data = %0.4f" % test_acc)

  # 5. use model
  # np.set_printoptions(formatter={'float': '{: 0.6f}'.format})
  print("\nPredicting income for M 34 concord support: ")
  x = np.array([[-1, 0.34, 0,0,1,  0,1,0]], dtype=np.float32)
  pred_inc = model.predict(x)
  print("$%0.2f" % (pred_inc * 100_000))  # un-normalized

# -----------------------------------------------------------

  # 6. save model
  print("\nSaving trained model ")
  # model.save_weights(".\\Models\\employee_model_wts.h5")
  # model.save(".\\Models\\employee_model.h5")

# -----------------------------------------------------------

  print("\nEnd Employee income demo")

if __name__=="__main__":
  main()