Data Clustering Using a Self-Organizing Map (SOM) from Scratch with C#

Data clustering is a classical machine learning technique to group similar data items together. By far the most common clustering technique is k-means clustering. Three other more-or-less common clustering techniques include DBSCAN (“density based spatial clustering of applications with noise”), Gaussian mixture model clustering, and Spectral clustering.

A fifth clustering technique is self-organizing map (SOM) clustering. I put together an implementation of SOM clustering, from scratch, using the C# language.

Compared to other techniques, SOM clustering produces clustering where the clusters are constructed so that clusters with similar IDs are closer to each other than clusters with dissimilar cluster IDs. For example, suppose you set k = 5 and so each data item falls into one of clusters 0, 1, 2, 3, 4. Data items in clusters 0 and 1 are relatively closer to each other than data items in clusters 0 and 3.

Self-organizing maps can be used for several purposes other than clustering. A standard, non-clustering SOM creates a square matrix, such as 4-by-4, called the map. Each cell of the SOM map holds a vector that represents a cluster. For a clustering SOM, even though you could create a square matrix map, it makes more sense to create a 1-by-k matrix/vector map.

For my demo program, I used a small 12-item subset of the Penguin dataset. The data is:

0, 39.5, 17.4, 186, 3800
0, 40.3, 18.0, 195, 3250
0, 36.7, 19.3, 193, 3450
0, 38.9, 17.8, 181, 3625
1, 46.5, 17.9, 192, 3500
1, 45.4, 18.7, 188, 3525
1, 45.2, 17.8, 198, 3950
1, 46.1, 18.2, 178, 3250
2, 46.1, 13.2, 211, 4500
2, 48.7, 14.1, 210, 4450
2, 46.5, 13.5, 210, 4550
2, 45.4, 14.6, 211, 4800

Each line represents a penguin. The columns are species (0 = Adelie, 1 = Chinstrap, 2 = Gentoo), bill length, bill depth, flipper length, body mass. The full Penguin dataset has 345 items, with 333 items that have no missing values.

The demo does not use the species labels, but notice that the data is artificially organized so that it’s already clustered.

The key calling statements in my demo are:

// 1. load data
string fn = "..\\..\\..\\Data\\penguin_12.txt";
double[][] X = ClusterSOM.MatLoad(fn,
  new int[] { 1, 2, 3, 4 }, ',', "#"); // no labels

// 2. standardize data
double[][] stdX = ClusterSOM.MatStandard(X);

// 3. create ClusterSOM object and cluster
int k = 3;  // number clusters
double lrnRateMax = 0.50;
int stepsMax = 1000;
ClusterSOM som = new ClusterSOM(stdX, k, seed: 0);
som.Cluster(lrnRateMax, stepsMax);

// 4. get and display clustering result
int[] clustering = som.GetClustering();
ClusterSOM.VecShow(clustering, wid: 3);

The clustering result is:

  2  2  2  2  1  1  1  1  0  0  0  0

Data items in clusters 2 and 1 are close. Data items in clusters 1 and 0 are close. Data items in clusters 2 and 0 are relatively farther apart.

SOM clustering is an iterative process and requires a maximum learning rate (the learning rate is decreased during SOM map construction) and a maximum number of iteration steps.

The source data should be normalized/standardized so that columns with large magnitudes, like body mass, don’t overwhelm columns with small magnitudes, like bill length. The standardized data looks like:

-1.1657   0.3300  -0.8774  -0.1661
-0.9475   0.6163  -0.0943  -1.2103
-1.9291   1.2366  -0.2683  -0.8306
. . .

The penguin species labels are not used.

The demo 1-by-3 SOM map has three cells, one for each cluster. Each cell has a representative vector that you can think of as a central vector of the data items assigned to the cluster. For the demo data, the constructed SOM map is:

k = 0:    0.8296  -1.4315   1.2489   1.2322
k = 1:    0.5188   0.7038  -0.4929  -0.5406
k = 2:   -1.3109   0.6217  -0.6869  -0.6074

Each data item is assigned to the cluster that has the closest map node:

k = 0:    8 9 10 11
k = 1:    4 5 6 7
k = 2:    0 1 2 3

This means data items 8, 9, 10, 11 are assigned to cluster 0 because those items are closest to the map node 0 vector: (0.8296 -1.4315 1.2489 1.2322).

The demo program iterates 1000 steps. This was determined by preliminary trial and error when the creation of the SOM map was monitored by computing the sum of the Euclidean distances (SED) within each cluster. When the SED stopped decreasing significantly at about step 1000, that’s a sign that the map build steps can stop.

The demo program computes the distance between each map node vector:

[0]  0.00  3.29  3.99
[1]  3.29  0.00  1.84
[2]  3.99  1.84  0.00

The largest distance is between cluster 0 and cluster 2 (3.99), as expected.

I don’t use SOM clustering very often. When I do use it, I typically use it as a check against k-means or DBSCAN clustering.

Most data clustering techniques are unsupervised / automatic. I’ve always been fascinated by old, coin operated mechanical automata. You’d drop a penny or a nickel into the machine and watch the action. Many early coin operated automata were rather morbid. Left: In the “St. Dennistoun Mortuary” automaton, circa 1900, the doors of the mortuary open to reveal bodies. The doctors would examine the bodies while the two women outside dabbed their tears with a handkerchief. Right: In “American Execution”, circa 1920, the condemned man in the electric chair has a metal cap lowered onto his head, the switch is thrown, and the condemned man’s head lights up as he’s electrocuted.

Demo program. Replace “lt” (less-than), “gt”, “lte”, “gte”, “and” with Boolean operator symbols.

using System;
using System.Collections.Generic;
using System.IO;

namespace ClusterSOM
{
  class ClusterSOMProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("\nBegin self-organizing" +
        " map (SOM) clustering using C#");

      // 1. load data
      Console.WriteLine("\nLoading 12-item Penguin subset ");
      string fn = "..\\..\\..\\Data\\penguin_12.txt";
      double[][] X = ClusterSOM.MatLoad(fn,
        new int[] { 1, 2, 3, 4 }, ',', "#");
      Console.WriteLine("\nX: ");
      ClusterSOM.MatShow(X, 1, 8, true);

      //int[] y = ClusterSOM.VecLoad(fn, 0, "#");
      //Console.WriteLine("\ny labels: ");
      //ClusterSOM.VecShow(y, 3);

      // 2. standardize data
      double[][] stdX = ClusterSOM.MatStandard(X);
      Console.WriteLine("\nStandardized data: ");
      ClusterSOM.MatShow(stdX, 4, 9, true);

      // 3. create ClusterSOM object and cluster
      int k = 3;
      double lrnRateMax = 0.50;
      int stepsMax = 1000;
      Console.WriteLine("\nSetting num clusters k = " + k);
      Console.WriteLine("Setting  lrnRateMax = " +
        lrnRateMax.ToString("F2"));
      Console.WriteLine("Setting stepsMax = " + stepsMax);

      Console.WriteLine("\nComputing SOM clustering ");
      ClusterSOM som = new ClusterSOM(stdX, k, seed: 0);
      som.Cluster(lrnRateMax, stepsMax);
      Console.WriteLine("Done ");

      // 4. show the SOM
      Console.WriteLine("\nSOM map nodes: ");
      for (int kk = 0; kk "lt" k; ++kk)
      {
        Console.Write("k = " + kk + ": ");
        ClusterSOM.VecShow(som.map[0][kk], 4, 9);
      }

      Console.WriteLine("\nSOM mapping: ");
      for (int kk = 0; kk "lt" k; ++kk)
      {
        Console.Write("k = " + kk + ":    ");
        ClusterSOM.ListShow(som.mapping[0][kk]);
      }

      double[][] betweenDists = som.GetBetweenNodeDists();
      Console.WriteLine("\nBetween map node distances: ");
      ClusterSOM.MatShow(betweenDists, 2, 6, true);

      // 5. show clustering result
      Console.WriteLine("\nclustering: ");
      int[] clustering = som.GetClustering();
      ClusterSOM.VecShow(clustering, wid: 3);

      Console.WriteLine("\nEnd SOM clustering demo");
      Console.ReadLine();
    } // Main

    // ------------------------------------------------------

  } // Program

  public class ClusterSOM
  {
    public int k;  // number clusters
    public double[][] data;  // data to cluster
    public double[][][] map;  // [r][c][vec]
    public List"lt"int"gt"[][] mapping;  // [r][c](List indices)
    public Random rnd;  // to initialize map cells
 
    // ------------------------------------------------------

    public ClusterSOM(double[][] X, int k, int seed)
    {
      int nRows = 1; // for map
      int nCols = k; // for map
      this.k = k;
      //this.dim = X[0].Length;
      int dim = X[0].Length;

      this.rnd = new Random(seed);
      this.data = X;
      //this.n = X.Length;

      // map is 1-by-k matrix
      this.map =
        new double[nRows][][];  // [r][c][vec]
      for (int i = 0; i "lt" nRows; ++i)
      {
        this.map[i] = new double[nCols][];
        for (int j = 0; j "lt" nCols; ++j)
        {
          this.map[i][j] = new double[dim];
          for (int d = 0; d "lt" dim; ++d)
            this.map[i][j][d] = this.rnd.NextDouble();
        }
      }

      this.mapping =
        new List"lt"int"gt"[nRows][]; // [r][c][lst]
      for (int i = 0; i "lt" nRows; ++i)
      {
        this.mapping[i] = new List"lt"int"gt"[nCols];
        for (int j = 0; j "lt" nCols; ++j)
          this.mapping[i][j] = new List"lt"int"gt"();
      }
    } // ctor

    // ------------------------------------------------------

    public void Cluster(double lrnRateMax, int stepsMax)
    {
      int n = this.data.Length;
      int dim = this.data[0].Length;
      int nRows = 1; // for map
      int nCols = this.k; // for map
      int rangeMax = nRows + nCols;

      // compute the map
      for (int step = 0; step "lt" stepsMax; ++step)
      {
        // show progress 5 + 1 times
        if (step % (int)(stepsMax / 5) == 0)
        {
          Console.Write("map build step = " +
            step.ToString().PadLeft(4));
          double sum = 0.0;  // sum squared dists
          for (int ix = 0; ix "lt" n; ++ix)
          {
            int[] RC = ClosestNode(ix);
            int kk = RC[1];  // RC[0] is always 0
            double[] item = this.data[ix];
            double[] node = this.map[0][kk];
            double dist = EucDist(item, node);
            sum += dist;  // sum Euclidean distances
          }
          Console.WriteLine("  |  SED = " +
            sum.ToString("F4").PadLeft(9));
        }

        double pctLeft = 1.0 - ((step * 1.0) / stepsMax);
        int currRange = (int)(pctLeft * rangeMax);
        double currLrnRate = pctLeft * lrnRateMax;
        // Pick random data index
        int idx = this.rnd.Next(0, n);
        // Get (row,col) of closest map node -- 'bmu'
        int[] bmuRC = ClosestNode(idx);
        // Move each map mode closer to the bmu
        for (int i = 0; i "lt" nRows; ++i)
        {
          for (int j = 0; j "lt" nCols; ++j)
          {
            if (ManDist(bmuRC[0],
              bmuRC[1], i, j) "lte" currRange)
            {
              for (int d = 0; d "lt" dim; ++d)
                this.map[i][j][d] = this.map[i][j][d] +
                  currLrnRate * (this.data[idx][d] -
                  this.map[i][j][d]);
            }
          } // j
        } // i
      } // step
      // map has been created

      // compute mapping
      for (int idx = 0; idx "lt" n; ++idx)
      {
        // node map coords of node closest to data(idx)
        int[] rc = ClosestNode(idx);
        int r = rc[0]; int c = rc[1];
        this.mapping[r][c].Add(idx);
      }

      // results in this.mapping
      return;
    } // Cluster()

    // ------------------------------------------------------

    public int[] GetClustering()
    {
      // cluster ID for every data item
      int n = this.data.Length;
      int[] result = new int[n];  // ID for each item
      for (int kk = 0; kk "lt" this.k; ++kk)
      {
        for (int i = 0; i "lt" this.mapping[0][kk].Count; ++i)
        {
          int dataIdx = this.mapping[0][kk][i];
          result[dataIdx] = kk;
        }
      }
      return result;
    }

    public double[][] GetBetweenNodeDists()
    {
      // between map node distances
      double[][] result = new double[this.k][];
      for (int kk = 0; kk "lt" this.k; ++kk)
        result[kk] = new double[this.k];
      for (int i = 0; i "lt" this.k; ++i)
      {
        for (int j = i; j "lt" this.k; ++j)
        {
          double dist =
            EucDist(this.map[0][i], this.map[0][j]);
          result[i][j] = dist;
          result[j][i] = dist;
        }
      }
      return result;
    }

    // ------------------------------------------------------
    // private helper functions
    // ------------------------------------------------------

    private static int ManDist(int r1, int c1,
      int r2, int c2)
    {
      // Manhattan distance between two SOM map cells
      return Math.Abs(r1 - r2) + Math.Abs(c1 - c2);
    }

    // ------------------------------------------------------

    private static double EucDist(double[] v1,
      double[] v2)
    {
      // Euclidean distance between two data items
      double sum = 0;
      for (int i = 0; i "lt" v1.Length; ++i)
        sum += (v1[i] - v2[i]) * (v1[i] - v2[i]);
      return Math.Sqrt(sum);
    }

    // ------------------------------------------------------

    private int[] ClosestNode(int idx)
    {
      // r,c coords in map of node closest to data[idx]
      double smallDist = double.MaxValue;
      int[] result = new int[] { 0, 0 };  // (row, col)
      for (int i = 0; i "lt" this.map.Length; ++i)
      {
        for (int j = 0; j "lt" this.map[0].Length; ++j)
        {
          double dist = EucDist(this.data[idx],
            this.map[i][j]);
          if (dist "lt" smallDist)
          {
            smallDist = dist;
            result[0] = i;
            result[1] = j;
          }
        }
      }
      return result;
    }

    // ------------------------------------------------------

    // misc. public utility functions for convenience
    // MatLoad, VecLoad, MatShow, VecShow, ListShow
    // MatStandardize

    public static double[][] MatLoad(string fn,
      int[] usecols, char sep, string comment)
    {
      // count number of non-comment lines
      int nRows = 0;
      string line = "";
      FileStream ifs = new FileStream(fn, FileMode.Open);
      StreamReader sr = new StreamReader(ifs);
      while ((line = sr.ReadLine()) != null)
        if (line.StartsWith(comment) == false)
          ++nRows;
      sr.Close(); ifs.Close();

      // make result matrix
      int nCols = usecols.Length;
      double[][] result = new double[nRows][];
      for (int r = 0; r "lt" nRows; ++r)
        result[r] = new double[nCols];

      line = "";
      string[] tokens = null;
      ifs = new FileStream(fn, FileMode.Open);
      sr = new StreamReader(ifs);

      int i = 0;
      while ((line = sr.ReadLine()) != null)
      {
        if (line.StartsWith(comment) == true)
          continue;
        tokens = line.Split(sep);
        for (int j = 0; j "lt" nCols; ++j)
        {
          int k = usecols[j];  // into tokens
          result[i][j] = double.Parse(tokens[k]);
        }
        ++i;
      }
      sr.Close(); ifs.Close();
      return result;
    }

    // ------------------------------------------------------

    public static int[] VecLoad(string fn, int usecol,
      string comment)
    {
      char dummySep = ',';
      double[][] tmp = MatLoad(fn, new int[] { usecol },
        dummySep, comment);
      int n = tmp.Length;
      int[] result = new int[n];
      for (int i = 0; i "lt" n; ++i)
        result[i] = (int)tmp[i][0];
      return result;
    }

    // ------------------------------------------------------

    public static void MatShow(double[][] M, int dec,
      int wid, bool showIndices)
    {
      double small = 1.0 / Math.Pow(10, dec);
      for (int i = 0; i "lt" M.Length; ++i)
      {
        if (showIndices == true)
        {
          int pad = M.Length.ToString().Length;
          Console.Write("[" + i.ToString().
            PadLeft(pad) + "]");
        }
        for (int j = 0; j "lt" M[0].Length; ++j)
        {
          double v = M[i][j];
          if (Math.Abs(v) "lt" small) v = 0.0;
          Console.Write(v.ToString("F" + dec).
            PadLeft(wid));
        }
        Console.WriteLine("");
      }
    }

    // ------------------------------------------------------

    public static void VecShow(int[] vec, int wid)
    {
      int n = vec.Length;
      for (int i = 0; i "lt" n; ++i)
        Console.Write(vec[i].ToString().PadLeft(wid));
      Console.WriteLine("");
    }

    // ------------------------------------------------------

    public static void VecShow(double[] vec, int decimals,
      int wid)
    {
      int n = vec.Length;
      for (int i = 0; i "lt" n; ++i)
        Console.Write(vec[i].ToString("F" + decimals).
          PadLeft(wid));
      Console.WriteLine("");
    }

    // ------------------------------------------------------

    public static void ListShow(List"lt"int"gt" lst)
    {
      int n = lst.Count;
      for (int i = 0; i "lt" n; ++i)
      {
        Console.Write(lst[i] + " ");
      }
      Console.WriteLine("");
    }

    // ------------------------------------------------------

    public static double[][] MatStandard(double[][] data)
    //public static double[][] MatStandard(double[][] data,
    //  out double[] means, out double[] stds)
    {
      // scikit style z-score biased normalization
      int nRows = data.Length;
      int nCols = data[0].Length;

      // make result matrix
      double[][] result = new double[nRows][];
      for (int r = 0; r "lt" nRows; ++r)
        result[r] = new double[nCols];

      // compute means
      double[] mns = new double[nCols];
      for (int j = 0; j "lt" nCols; ++j)
      {
        double sum = 0.0;
        for (int i = 0; i "lt" nRows; ++i)
          sum += data[i][j];
        mns[j] = sum / nRows;
      } // j

      // compute std devs
      double[] sds = new double[nCols];
      for (int j = 0; j "lt" nCols; ++j)
      {
        double sum = 0.0;
        for (int i = 0; i "lt" nRows; ++i)
          sum += (data[i][j] - mns[j]) *
            (data[i][j] - mns[j]);
        sds[j] = Math.Sqrt(sum / nRows);  // biased
      } // j

      // normalize
      for (int j = 0; j "lt" nCols; ++j)
      {
        for (int i = 0; i "lt" nRows; ++i)
          result[i][j] =
            (data[i][j] - mns[j]) / sds[j];
      } // j

      //means = mns;
      //stds = sds;

      return result;
    }

    // ------------------------------------------------------

  } // class SOM

} // ns