RIP Microsoft CNTK (2016-2019)

Microsoft CNTK was a code library for creating neural networks. CNTK was intended to be a competitor to the Google TensorFlow library, but both CNTK and TensorFlow were essentially killed off by the more developer-friendly PyTorch library.

I worked with CNTK a lot from 2016-2018 until I switched over to PyTorch. I was running a research program at my very large tech company, that was designed to train engineers how to add neural technologies into products and services. Even though TensorFlow dominated neural technologies in 2016, we used CNTK because 1.) I didn’t like TensorFlow, and 2.) we had direct access to the CNTK development team led by Frank S. and Sayan P. which gave us great supporting resources for a library that was in active development, and 3.) PyTorch hadn’t even been released yet so we didn’t know about it. By early 2017, when PyTorch had just been released, it quickly became the library of choice for neural systems development.

By the way, “CNTK” originally stood for “computational network tool kit” (the original developers didn’t realize that toolkit is one word), but meaning was changed to a backronym by the always hilariously unhelpful marketing people to “cognitive network toolkit”. I have a dream that one day before I die, I will see an example of a tech marketing department do something intelligent. I am not optimistic about this dream.

Anyway, I was sitting on a painfully long flight from London to Seattle and I was mentally reviewing the CNTK days. I had my laptop with me and tried to run an old demo. But, alas, the last release of CNTK (version 2.7) works only with Python 3.6 and my laptop had Python 3.11 installed, and so I couldn’t install CNTK without a lot of effort.

Here is a photo of me during a CNTK training class in early 2018. I’m showing how to stream large data files by adding field names to the data and then defining a function to create a reader object. I had more hair in 2018 than I have now.

But I had fun looking over the source code for an old demo CNTK program. I used the classic Iris dataset where the goal is to predict the species of an iris flower (0 = setosa, 1 = versicolor, 2 = virginica) from sepal length, sepal width, petal length, petal width. The raw 150-item Iris data looks like:

5.1  3.5  1.4  0.2  setosa
4.9  3.0  1.4  0.2  setosa
. . .
7.0  3.2  4.7  1.4  versicolor
6.4  3.2  4.5  1.5  versicolor
. . .
5.8  2.7  5.1  1.9  virginica
7.1  3.0  5.9  2.1  virginica

An interesting feature of CNTK is that it supports optional streaming of very large data files that won’t fit into memory. To enable streaming, you add field names to the data like so:

|attribs 5.1 3.5 1.4 0.2 |species 1 0 0
|attribs 4.9 3.0 1.4 0.2 |species 1 0 0
. . .
|attribs 7.0 3.2 4.7 1.4 |species 0 1 0
|attribs 6.4 3.2 4.5 1.5 |species 0 1 0
. . .
|attribs 5.8 2.7 5.1 1.9 |species 0 0 1
|attribs 7.1 3.0 5.9 2.1 |species 0 0 1

The “|attribs” tells CNTK where the predictors are and the “|species” tells where the one-hot encoded target values are. Because I already had data files with field names, I used that data, however, my demo used a simple load-all data approach that ignores the field names. Put another way, CNTK supported basic NumPy style data and also field-augmented data for streaming huge data files.

The old demo program begins with:

# iris_nn_cntk.py
# CNTK 2.4 - Anaconda 4.1.1 (Python 3.5, NumPy 1.11.1)

# data resembles:
# |attribs 5.1 3.5 1.4 0.2 |species 1 0 0
# . . .

import numpy as np
import cntk as C

def main():
  print("\nBegin Iris classification demo using CNTK  \n")
  input_dim = 4
  hidden_dim = 5
  output_dim = 3
  train_file = ".\\Data\\iris_train_cntk.txt"
  test_file = ".\\Data\\iris_test_cntk.txt"
. . .

These statements should be self-explanatory if you are familiar with PyTorch. The next statements in the demo are:

  train_x = np.loadtxt(train_file, delimiter=" ",
    usecols=[1,2,3,4], dtype=np.float32)
  train_y = np.loadtxt(train_file, delimiter=" ",
    usecols=[6,7,8], dtype=np.float32)
  test_x = np.loadtxt(test_file, delimiter=" ",
    usecols=[1,2,3,4], dtype=np.float32)
  test_y = np.loadtxt(test_file, delimiter=" ",
    usecols=[6,7,8], dtype=np.float32)

  X = C.ops.input_variable(input_dim, np.float32)
  Y = C.ops.input_variable(output_dim, np.float32)

The loadtxt() function strips away the field names. Another interesting quirk of CNTK is that you define an explicit predictor input variable and an explicit input target variable. The next statements define a neural network multi-class classifier:

  print("Creating a 4-5-3 tanh softmax NN for Iris data ") 
  with C.layers.default_options(init=\
    C.initializer.uniform(scale=0.01, seed=1)):
    h_layer = C.layers.Dense(hidden_dim, \
      activation=C.ops.tanh, name='hidLayer')(X)  
    o_layer = C.layers.Dense(output_dim, \
      activation=None, name='outLayer')(h_layer)
  nnet = o_layer
  model = C.ops.softmax(nnet)

These statements are a bit tricky. The Dense class is analogous to the PyTorch fully-connected Linear layer. The network uses tanh activation on the hidden layer but no explicit softmax() activation on the output layer because softmax() will be applied automatically during training. This means that the network emits logit values rather than pseudo-probability values. The nnet object is the real network, and the model object is a reference copy that applies explicit softmax() and so it can be used to make predictions where the outputs are pseudo-probabilities. These details took me a long time to figure out when I was new to CNTK.

The next statements prepare training:

  print("Creating a cross entropy batch=1 Trainer \n")
  tr_loss = C.cross_entropy_with_softmax(nnet, Y)
  tr_clas = C.classification_error(nnet, Y)
 
  learn_rate = 0.01 
  learner = C.sgd(nnet.parameters, learn_rate)
  trainer = C.Trainer(nnet, (tr_loss, tr_clas), [learner])
  
  max_iter = 2000  # maximum training iterations
  np.random.seed(1)
  N = len(train_x)

I’m not sure why I used a batch size of 1 in my old demo. I must have been exploring the effect of different batch sizes. Notice the name of the cross_entropy_with_softmax() function suggests that it automatically applies softmax(), which it does. The statements that perform training are:

  print("Starting (online) training \n")
  for i in range(0, int(max_iter)):
    rnd_row = np.random.choice(N,1)
    trainer.train_minibatch({X:train_x[rnd_row], \
      Y:train_y[rnd_row]})
    if i % 200 == 0:
      mcee = trainer.previous_minibatch_loss_average
      macc = \
        (1.0 - trainer.previous_minibatch_evaluation_average) \
        * 100
      print("batch %4d: mean loss = %0.4f, accuracy curr item \
        = %0.2f%% " % (i,mcee, macc))
  print("\nTraining complete")

This code is very different from PyToch training code. I use a primitive form of training by selecting a single random row of training data and passing it to the train_minibatch() method of the trainer object. The demo program concludes with:

. . .
  acc = (1.0 - trainer.test_minibatch({X:test_x, \
    Y:test_y })) * 100
  print("Classification accuracy on the 30 test \
    items = %0.2f%%" % acc)

  # mp = ".\\Models\\iris_nn.model"  # path to model file
  # model.save(mp, format=C.ModelFormat.CNTKv2)  # or ONNX

  np.set_printoptions(precision = 4)
  unknown = np.array([[6.4, 3.2, 4.5, 1.5]], \
    dtype=np.float32) 
  print("\nPredicting Iris species for input features: ")
  print(unknown[0])
 
  pred_prob = model.eval(unknown)  # simple form works
  print("Prediction probabilities are: ")
  print(pred_prob[0])
   
if __name__ == "__main__":
  main()

Unlike PyTorch where you must define a custom accuracy() function, CNTK allows you to compute accuracy using the built-in test_minibatch() method. CNTK had a couple of ways to save a trained model (native and ONNX). To make a prediction, the demo uses the model copy of nnet so that the three output values are pseudo-probabilities like (0.2000, 0.7000, 0.1000) that correspond to the three possible targets (setosa, versicolor, virginica).

Because developing neural systems is so complex, a developer can realistically learn only one library. In the beginning, TensorFlow dominated because it was first-to-market. PyTorch and CNTK were developed and released at almost the exact same time, but only one could gain dominance. I subjectively rate the early versions of PyTorch and CNTK as nearly equal (both significantly superior to TensorFlow from an ease-of-learning point of view) but PyTorch quickly gained a massive market share and Microsoft pulled the plug on CNTK in 2019.

It was a nice stroll down neural memory lane on a flight from London to Seattle.

I was thinking about CNTK on a return trip from Croatia. Left: I was on a cruise on a small 18-cabin 36-person yacht named Lastavica (“swallow bird” in Croatian). My cabin is circled in red. The ship went from the city of Split, to several other interesting towns on the Adriatic/Dalmatian coast, ending at Dubrovnik. Right: My wife, who is an excellent photographer, took this picture. I was fascinated to observe that there were approximately 40 ships cruising the coast that all had nearly identical dimensions. I hypothesize that only one ship design is optimal for Croatia coast cruising, and this one design eliminated all others. The nearly-identical size allows multiple ships to dock together in very limited port space (where no large cruise ship could go) and lets passengers walk from ship to ship without any large gaps to jump over.

So, yes, while the other passengers were taking winery tours and buying souvenirs on shore, I was mentally analyzing ship architecture. I’m a non-social person by nature, but on the cruise I did meet four people whose company I really enjoyed — very rare for me. Smart, tall and striking-in-appearance Michelle R., her super-personable and perceptive daughter Sarah R., modern day hippie (and I mean this in a very complimentary way) Julie S., and avid golfer and outdoorswoman Sam G. All four of them, plus the two of us, had the same sense of humor, which for me is the one thing that matters the most. Our camaraderie was the best part of the trip.