Machine learning provides a scalable way to go from raw data to meaning. However, often the input data has noise and uncertainties. Coming from the perspective of a physicist, every non-negligible uncertainty must be quantified in order to be able to state a quantitative result of the measurement. In the case of a machine learning classifier, could we even go one step further and teach the classifier which input variables are noise to minimize the impact on the result. Yes, ML classifiers can exploit our knowledge about uncertainties in the inputs. This article shows an example.

To keep things simple, in the example, we have a two-class classifier built from a small neural network. The dataset has three input variables:

• The variable $$x$$ is very different for signal and background events, giving it a large discrimination power.
• The variable $$y$$ has a slight shift for signal compared to background, giving it a small discrimination power.
• The variable $$z$$ is completely independent of the class labels.

How does noise affect the variables $$x, y, z$$? In this example, we have a simulated dataset. Our training dataset consists of simulated events from a known model. The simulation accounts for resolution effects, meaning that the distributions for $$x, y, z$$ have a width and are not sharp Dirac delta functions. In a real-world example, we measure the width of the distribution and implement the measured resolution in the simulation. However, there might be additional effects that we cannot account for. After deploying the model, the model might operate on input data with different noise levels. Meaning that the resolution assumed in the simulation was off.

For our toy model, let’s assume that the resolution of variable $$x$$ is allowed to fluctuate wildly while there is no change for $$y$$ and $$z$$. We can parametrize the effect and generate alternative datasets with a higher and a poorer resolution. The nominal and the varied distributions for all input variables look as follows.

To study the effect of the unknown resolution, we train two different models:

• One neural network, the naive model, uses the nominal dataset to train,
• The other neural network, the aware model, uses the nominal dataset combined with the dataset with higher and poorer resolution.

Besides these changes, the architecture and hyperparameters of the networks are identical. The naive model will learn that variable $$x$$ is a reliable indicator of the class and heavily rely on this input variable. The aware model, on the other hand, will see that variable $$x$$ is not so reliable and base its output decision probably on a combination of $$x$$ and $$y$$. The training datasets are summarized in this sketch.

Besides the input layer, the networks have two layers, one with twelve nodes and ReLU activation, and the output layer with two nodes and softmax activation. The input variables are normalized to unit width, and vanishing mean. Each network is trained for 15 epochs with stochastic gradient descent and a learning rate of $$10^{-3}$$. The loss function is the cross-entropy. The outputs are two variables in one-hot encoding for the two classes signal and background. The networks are implemented in Keras with Tensorflow as backend.

How do the two models compare? To assess the effect, we can have a look at the receiver operator characteristic (ROC) for each model. Additionally, we can see how a higher or poorer resolution affects the classifiers. The results are summarized in the following plots.

We can see that both models have similar characteristics on the nominal dataset. The naive model performs slightly better than the aware model. If we look at the performances on the two datasets with higher and poorer resolution, we see that the aware network is far less affected by the uncertainty. The intuitive interpretation is that the aware model learned that variable $$x$$ is not a reliable input variable. In an environment where the resolution of a variable is unknown, it might be beneficial if the network learns about the uncertainty to achieve an overall better performance. For a real application this requires a lot of insight in the process creating the noise and how they are modelled in the simulated dataset.

The example was arguably contrived and simple. However, real-world applications include the search for the Higgs boson at the Large Hadron Collider at CERN. For a machine learning classifier used in Phys. Lett. B 805 (2020) 135426, I’ve used techniques to teach the classifier about detector uncertainties.