Neural Network Arctan Activation Function

Neural networks (NNs) are software systems that make predictions. NNs loosely model biological synapses and neurons. A NN requires what’s called a hidden node activation function to compute its output values. The two most common activation functions are the logistic sigmoid (sometimes abbreviated log-sig, log-sigmoid, or just sigmoid) and the hyperbolic tangent (usually abbreviated tanh) functions.

Both the log-sigmoid and tanh functions accept as input any value from negative infinity to positive infinity. The graphs of both functions resemble an S shape. The log-sigmoid returns a value between 0 and 1 (mimicking a neuron firing). The tanh function returns a value between -1 and +1 which isn’t as biologically plausible but (-1, +1) tends to work better.

Other activation functions are, in principal, potentially superior to log-sigmoid and tanh. Briefly, functions that have a bit flatter S shape have better discriminatory power. The main reason why log-sigmoid and tanh are used is that their calculus derivatives, which are needed by the most common training algorithm back-propagation, are computationally convenient to compute.

Back-propagation resembles:

set NN weights to random values
loop
  compute output values
  compare computed output values to correct target values
  use activation derivative to update weights
    so outputs are closer to target
end-loop

The derivative of most functions is based on the input value, x. For example, if y = 3x^2, then the derivative is 6x. But, the derivatives of log-sigmoid and tanh can, quite surprisingly, be defined in terms of the output, y. The derivative of y = log-sigmoid(x) is y * (1-y). The derivative of y = tanh(x) = (1 – y) * ( 1+ y).

So what’s the point? Notice in the pseudo-code above, before you need the derivative of the activation function, you must compute the output values. This makes calculating the derivative of log-sigmoid or tanh very easy because the derivatives can, by an algebra coincidence, defined in terms of the output values.

I was investigating using the arctan function for hidden node activation. It returns a value between -1.57 and +1.57 and is flatter than log-sigmoid or tanh. The derivative of arctan(x) is 1 / (1 + x^2). I coded up a demo using arctan activation and it worked well in some situations, not so well in others. The only implementation detail was that when computing the output value, I had to save the input value so it could be used later to compute the derivative.