The Data Science Lab
Poisson Regression Using C#
Dr. James McCaffrey from Microsoft Research presents a complete end-to-end demo of Poisson regression, where the goal is to predict a count of things arriving, such as the number of telephone calls received in a 10-minute interval at a call center. When your source data is close to mathematically Poisson distributed, Poisson regression is simple and effective.
The goal of a machine learning regression problem is to predict a single numeric value. Poisson regression is a specific technique that can be used when the problem data is approximately Poisson distributed. The most common scenario where Poisson distributed data arises is when the variable to predict is a count of "things arriving," for example, the number of cars that arrive at a fast-food restaurant drive-through window, every Tuesday between 9:00 a.m. and 9:10 a.m.
This article presents a complete demo of Poisson regression using the C# language. There are many code libraries that contain implementations of Poisson regression, such as the Python scikit-learn library PoissonRegressor module. However, implementing Poisson regression from scratch using C# allows you to customize your prediction system to meet specific problem scenarios, and easily integrate your prediction system with other .NET systems.
[Click on image for larger view.] Figure 1: Poisson Regression in Action
A good way to see where this article is headed is to take a look at the screenshot in Figure 1. The demo program begins by loading a set of synthetic data into memory. The data looks like:
-0.8153, -0.6275, -0.3089, -0.2065, 7
0.3409, -0.1654, 0.1174, -0.7192, 1
0.7892, -0.8299, -0.9219, -0.6603, 1
-0.8033, -0.1578, 0.9158, 0.0663, 2
-0.3690, 0.3730, 0.6693, -0.9634, 0
. . .
The first four values on each line are predictor variables, and the last value is the target count to predict. The data was programmatically generated. For the cars-arriving example, you can imagine that the predictor variables are things like weather temperature (scaled between -1 and +1), humidity, and so on. There are 200 training items and 40 test items.
The demo creates and trains a Poisson regression model. The key output is:
learn rate = 0.0010
maxEpochs = 150
weight decay alpha = 0.000010
Start
epoch = 0 RMSE = 2.8405 acc = 0.2650
epoch = 30 RMSE = 1.0541 acc = 0.4000
epoch = 60 RMSE = 0.5313 acc = 0.7050
epoch = 90 RMSE = 0.3459 acc = 0.8700
epoch = 120 RMSE = 0.2940 acc = 0.9400
Done
Training a Poisson regression model is an iterative process. You can see that the RMSE (root mean squared error) decreases and the prediction accuracy increases, which indicate that training is working as expected.
A Poisson regression model results in one so-called constant (also known as the intercept) and one coefficient per predictor:
Model constant and coefficients:
1.9900
-0.9760 1.4856 -1.9587 2.4644
The demo evaluates the prediction accuracy of the trained model:
Computing model accuracy
Accuracy on train = 0.9750
Accuracy on test = 1.0000
The model scores 0.9750 accuracy on the training data (195 out of 200 correct), and 1.0000 accuracy on the test data (40 out of 40 correct).
The demo concludes by using the trained model to predict the arrival count for the first training data item, x = (-0.8153, -0.6275, -0.3089, -0.2065):
Predicting for x =
-0.8153 -0.6275 -0.3089 -0.2065
y = 7.0263
y = 7
The predicted count is y = 7.0263, and when rounded to an integer is y = 7, which matches the correct target count in the training data.
This article assumes you have intermediate or better programming skill but doesn't assume you know anything about Poisson regression. The demo is implemented using C# but you should be able to refactor the demo code to another C-family language if you wish. All normal error checking has been removed to keep the main ideas as clear as possible.
The source code for the demo program is too long to be presented in its entirety in this article. The complete code and data are available in the accompanying file download, and they're also available online.
Understanding Poisson Regression
Poisson regression is perhaps best understood by examining a concrete example. Suppose you have the trained Poisson regression model shown in Figure 1 with intercept/constant = b = 1.99 and coefficients = (c0, c1, c2, c3) = (-0.97, 1.48, -1.95, 2.46). To compute the y = predicted count for input (x0, x1, x2, x3) = (-0.81, -0.62, -0.30, -0.20), the calculations are:
y = exp((x0 * c0) + (x1 * c1) + (x2 * c2) + (x3 * c3) + b)
= exp((-0.81 * -0.97) +
(-0.62 * 1.48) +
(-0.30 * -1.95) +
(-0.20 * 2.46) + 1.99)
= exp(0.78 + -0.91 + 0.58 + -0.49 + 1.99)
= exp(1.95)
= 7.0263
Because the predicted y is an integer count, rounding the preliminary predicted y to the nearest integer gives a final predicted y = 7.
In words, a predicted y is the exp() function applied to the sum of the products of coefficients times inputs plus the intercept/constant.
Training a Poisson regression model is the process of finding the values of the intercept/constant and the coefficients so that predicted y count values (before rounding) closely match the known, correct y count values in the training data.
Understanding Poisson Distributed Data
Poisson distributed data is more subtle than it first appears. Theoretically, Poisson data is generated by a so-called mathematical Poisson process. In most machine learning prediction scenarios, you aren't overly concerned with the underlying process, but you should understand the vocabulary.
A mathematical Poisson process has a rate, which is usually given the symbol Greek letter lower case lambda (sort of like an upside down 'v'). For example, the theoretical rate for a cars-arriving example might be 3.5 arrivals per 10 minutes. This means that on average, 3.5 cars will arrive in any 10-minute interval, but the number of arrivals could be as low as 0, or significantly higher than 3.5. Because the process is governed by a rate, there is always a "per time interval" but the time interval is often not explicitly stated.
Now, if you have some data that is counts of things arriving, you will not know the lambda rate associated with the theoretical underlying generating Poisson process. In a machine learning scenario, where you are interested in prediction, you usually don't try to estimate the lambda rate -- you just want to predict a count based on predictor values.
The graph in Figure 2 shows the 200-item training data in blue, and a theoretical Poisson distribution in orange for 200 items where the lambda rate is set to 1.5 arrivals per time unit. When graphed, Poisson distributed data often resembles a bell-shaped curve that is skewed to the left.
[Click on image for larger view.] Figure 2: Poisson Distributions
There are two ways to generate synthetic Poisson distributed data, one difficult, one much easier. The difficult technique is to write a helper program that simulates an underlying Poisson process with some fixed lambda rate, and then emit Poison count data, and then reverse engineer the count data to synchronize with predictor data.
The easier approach, and the one used by the demo, is to set values for the intercept/constant and coefficients, and then work backwards. The synthetic Poisson data was generated using intercept/constant = 2.0, and coefficients = (-1.0, 1.5, -2.0, 2.5).
It's important to realize that the Poisson regression technique only works for data that is approximately Poisson distributed. No real-life data is exactly Poisson distributed. It's surprisingly difficult to determine how close real-life data is to being mathematically Poisson distributed, mostly because it's difficult to define how close is close enough.
Because the synthetic demo data was generated using a true mathematical Poisson distribution, the demo Poisson regression model is artificially accurate. The further away from truly Poisson distributed your data is, the less accurate a Poisson regression model will be.
The Demo Program
I used Visual Studio 2022 (Community Free Edition) for the demo program. I created a new C# console application and checked the "Place solution and project in the same directory" option. I specified .NET version 8.0. I named the project PoissonRegression. I checked the "Do not use top-level statements" option to avoid the strange program entry point shortcut syntax.
The demo has no significant .NET dependencies and any relatively recent version of Visual Studio with .NET (Core) or the older .NET Framework will work fine. You can also use the Visual Studio Code program if you like.
After the template code loaded into the editor, I right-clicked on file Program.cs in the Solution Explorer window and renamed the file to the more descriptive PoissonRegressionProgram.cs. I allowed Visual Studio to automatically rename class Program.
The overall program structure is presented in Listing 1. All the control logic is in the Main() method in the Program class. All of the Poisson regression functionality is in a PoissonRegressor class. The PoissonRegressor class exposes a constructor and four public methods: Train(), Predict(), Accuracy(), RootMSE(). The demo code also contains a Utils class that holds helper functions to load data from file into memory and display data.
Listing 1: Overall Program Structure
using System;
using System.IO;
using System.Collections.Generic;
namespace PoissonRegression
{
internal class PoissonRegressionProgram
{
static void Main(string[] args)
{
Console.WriteLine("Poisson regression using C# ");
// 1. load data from file to memory
// 2. create model
// 3. train model
// 4. evaluate model
// 5. use model
Console.WriteLine("End demo ");
Console.ReadLine();
}
} // Program
public class PoissonRegressor
{
public Random rnd;
public double intercept;
public double[] coeffs; // allocated in Train()
public double alpha; // wt decay
public PoissonRegressor(double alpha) { . . }
public double Accuracy(double[][] dataX,
double[] dataY) { . . }
public double RootMSE(double[][] dataX,
double[] dataY) { . . }
private void Shuffle(int[] sequence) { . . }
public void Train(double[][] trainX, double[] trainY,
double lrnRate, int maxEpochs) { . . }
public double Predict(double[] x) { . . }
}
public class Utils
{
public static double[] MatToVec(double[][] m) { . . }
public static double[][] MatLoad(string fn,
int[] usecols, char sep, string comment) { . . }
public static void VecShow(double[] vec,
int dec, int wid, bool newLine)
}
} // ns
The demo program starts by loading the 200-item training data into memory:
string trainFile = "..\\..\\..\\Data\\train_poisson_200.txt";
double[][] trainX = Utils.MatLoad(trainFile,
new int[] { 0, 1, 2, 3 }, ',', "#");
double[] trainY = Utils.MatToVec(Utils.MatLoad(trainFile,
new int[] { 4 }, ',', "#"));
The training X data is stored into an array-of-arrays style matrix of type double. The data is assumed to be in a directory named Data, which is located in the project root directory. The arguments to the MatLoad() function mean load zero-based columns 0, 1, 2, 3 where the data is comma-delimited, and lines beginning with "#" are comments to be ignored. The training target y data in column [4] is loaded into a matrix and then converted to a one-dimensional vector using the MatToVec() helper function.
The 40-item test data is loaded into memory using the same pattern that was used to load the training data:
string testFile = "..\\..\\..\\Data\\test_poisson_40.txt";
double[][] testX = Utils.MatLoad(testFile,
new int[] { 0, 1, 2, 3 }, ',', "#");
double[] testY = Utils.MatToVec(Utils.MatLoad(testFile,
new int[] { 4 }, ',', "#"));
The first three training items are displayed using 4 decimals with 8 columns width and a trailing newline like so:
Console.WriteLine("First three train X inputs: ");
for (int i = 0; i < 3; ++i)
Utils.VecShow(trainX[i], 4, 9, true);
Console.WriteLine("First three train y targets: ");
for (int i = 0; i < 3; ++i)
Console.WriteLine(trainY[i].ToString("F0"));
In a non-demo scenario, you might want to display all the training data to make sure it was correctly loaded into memory.
Creating and Training the Poisson Regression Model
The Poisson regression model is created like so:
Console.WriteLine("Creating and training Poisson model");
double alpha = 1.0e-5; // weight decay
PoissonRegressor model = new PoissonRegressor(alpha);
The alpha weight decay value passed to the PoissonRegressor constructor prevents the model coefficient values from becoming too large during training. The value of alpha must be determined by trial and error. A good strategy is to start with alpha = 0.0 and then only increase alpha if you observe model overfitting, where the model predicts the training data very well but predicts poorly on the test data.
The Poisson regression model is trained using these statements:
double lrnRate = 0.001;
int maxEpochs = 150;
model.Train(trainX, trainY, lrnRate, maxEpochs);
Console.WriteLine("Done");
The values of the lrnRate and maxEpochs parameters must be determined by trial and error. The key idea is to watch the root mean squared error and accuracy values to make sure error is getting smaller and accuracy is getting larger. After training has completed, the demo displays the model intercept/constant and the four coefficients:
Console.WriteLine("Model constant and coefficients: ");
Console.WriteLine(model.intercept.ToString("F4").PadLeft(9));
Utils.VecShow(model.coeffs, 4, 9, true);
In general, a good trained model will not have any extremely large or small intercept or coefficient values. Extreme model values usually indicate that the data is not close to Poisson distributed, or that model overfitting is occurring.
Evaluating and Using the Trained Poisson Regression Model
The demo program evaluates the trained model using these statements:
double accTrain = model.Accuracy(trainX, trainY);
Console.WriteLine("Accuracy on train = " + accTrain.ToString("F4"));
double accTest = model.Accuracy(testX, testY);
Console.WriteLine("Accuracy on test = " + accTest.ToString("F4"));
When evaluating the accuracy of most machine learning regression models, you must specify a closeness percentage for a predicted target value to be scored as correct. For example, if you were predicting a person's annual income, you might specify that a predicted income is correct if it's within 10% of the true income. But for Poisson regression, because the target data values are integer counts, a predicted count, after rounding to the nearest integer, is either correct or not.
The demo program defines a RootMSE() method that provides a more granular evaluation metric. The root mean squared error is useful during training, or when comparing a Poisson regression model with a different model, such as a neural network regression model. Some of my colleagues prefer using plain mean squared error but I prefer root mean squared error because the units for MSE are counts-squared (which doesn't have an obvious meaning) but the units for RMSE are just counts.
The trained model is used like so:
double[] x = trainX[0];
Console.WriteLine("Predicting for x = ");
Utils.VecShow(x, 4, 9, true);
double y = model.Predict(x);
Console.WriteLine("y = " + y.ToString("F4"));
int yInt = (int)Math.Round(y);
Console.WriteLine("y = " + yInt);
The input vector is the first training item, (-0.8153, -0.6275, -0.3089, -0.2065). The preliminary predicted count, as a type double variable is y = 7.0263. When rounded to an integer count, the predicted count is yInt = 7, which correctly matches the target count in the training data.
Wrapping Up
The key to whether Poisson regression works well or not is how close to true mathematical Poisson distributed the data is. There's no easy way to determine if your data is close enough to Poisson distributed because "close enough" varies from problem to problem.
In practice, if you have a problem where the goal is to predict a count, then Poisson regression is usually worth trying because it's a simple and easy technique. If you get poor results, then you can try more complex regression techniques such as kernel ridge regression (for small and medium size datasets) or neural network regression (for medium and large datasets).
The demo data has four predictor variables. In my experience, most machine learning Poisson regression problems have just one predictor variable. And in a classical statistics scenario where the goal is not prediction, data often doesn't have any predictor variables at all, and the count data is analyzed by itself.