Artificial Intelligence (AI) and various component tools such as machine learning (ML) are not intended to fully-automate threat mitigation and response, at least not in the current generation of technologies. Instead, AI and ML are beginning to provide a much greater degree of organization and prioritization for existing workflows. For example, the Threat Hunter ideally keys off a highly-targeted alert, perhaps one that is indicative of a specific threat actor or use of a particular tool. However, such targeted alerts are notoriously brittle, and this model fails for novel threat vectors.
In lieu of specially-crafted signatures for known or previously-seen attacks, or alerts that might fail to reliably fire in a particular network, how can we key an investigation into potentially malicious activity in which an attacker has already gained access, before it’s too late?
Nearly all “generic” or non-signature detections involve detecting a change in an observable field or a change in the level of activity on some observable field. This is called “changepoint detection” and is an area of probabilistic or statistical time series analysis. Some methods are more reliable than others. Some methods are “fooled” by regular changes in periodic activity – that is, they detect changes due to the normal ebb-and-flow of the work day or week.
In our last blog, we discussed methods on the highly-sophisticated end (neural networks) for learning nominal activity patterns in data. These and similar models require a lot of training data and can be challenging to deploy in production (we will have a follow-up on this subject.) Here we discuss how to apply much simpler statistical models that are relatively straightforward to deploy in production. However, as with most data science based models, the key is framing the problem and what might be called data logistics.
The following chart shows hourly samples of SMB file access from a real network. If you squint you can see a pattern of daily activity compressed near the x-axis. However, the spike one morning really stands out, as this was multiple orders of magnitude greater than the usual level of activity for this activity type for this network.
What goes through a threat hunter’s mind when they see a graph like this? It could represent something benign like a file backup, a policy change, or some other routine high-volume file server activity. Or it could indicate a hostile scenario such as a recon, ransomware, or a mass file delete. This triage process can be hectic and stressful until the cause of the unusual activity is run to ground. In order to determine threat or not-threat, the hunter will need to dig further, but first, the unusual activity must be flagged.
To reduce the burden on hunters and the security workload in general, JASK’s Trident product combines multiple behaviors, like change points, signals, and threat intelligence into a greatly reduced number of Smart Alerts that merit greater attention. But any analyst or hunter can benefit from a better understanding of how to better detect changes on the network.
One good initial step in a statistical analysis of data is to determine the distribution of data. It can be tempting to start off by calculating high-level or coarse statistical measures like mean (average) and variance (square of standard deviation). But these measures are only valid if the data really does fit a standard normal distribution, the oft-cited “bell curve.” A quick look at the above time series shows that this data does not quite meet this assumption. In fact, research has shown that network traffic does not generally follow a normal distribution, but tends to be heavy-tailed, like the data example here.
This graph is zoomed into the area around the bulk of the data, so we can see the shape of the distribution more clearly. There are a small number of outliers, in particular, the dramatic activity peak on Wednesday morning.
Going back to analyzing some of the descriptive statistics associated to these values we see that the histogram is known as a skewed (or heavy-tailed) distribution. Network counts for instance are skewed, and in this case mean is a poor indicator of data centrality. Similarly, variance is a poor estimate for how much spread is typical in the skewed data sets. A commonly-used indicator of change in simple anomaly detection applications is building a threshold rule, such as “three sigma,” which is three times the standard deviation above and below the average trend of values in the timeseries.
In practice when we use these type of threshold rules on skewed data we sometimes run into noisy situations, where the model raises too many outliers or not any at all. For example In the output below we have overlaid some useful descriptive statistics. In the example below we end up computing a standard deviation that is less than zero, which is hard to interpret for counts. Furthermore looking at the distribution of values in relation to the outlier, we see that the mean and standard deviation are concentrated towards the skewed part of the distribution.
An alternate approach is to use median instead of mean. As shown in the same figure, median is a much more reliable measure of data centrality. Median is also less prone to pull from small numbers of outliers and other extreme data. A common change point detection metric used based on median is median absolute deviation, or MAD, defined like:
In the example above we see if we use MAD for flagging change points or outlier the values we get a much less noisy threshold. As can be seen from the chart above, this would only trigger on the one unusual hour in the observed data. This might seem to be what we want, however, what if the data is collected or processed in 10 min or 1 min bins instead of hourly bins? How does this impact the distribution?
The standard deviation is often a poor measure of data “dispersion” or spread on skewed data sets commonly used in cyber security modeling. The same chart shows the alternate measure of dispersion based on median, or MAD. Again, this appears to fit our visual intuition. Also, using a threshold of 3 or 4 times MAD would ignore most of the data, but flag a small number of high-end data points in addition to the extreme outlier – which is way off the right end of of this zoomed in chart.
From a scalability perspective we have to answer one key question: are we concerned about performing these calculations on the vast amount of network data collected on real large-scale networks? The answer is “no” because streaming implementation of the median exist, based on a mini-heap computation, which scales as O(n log n), principally because as new data arrives, a sort must be performed so that the mid-point can be identified, or the P2 algorithm, which is more memory efficient than using heaps.
As mentioned already, barring a clear signature of a known attack tool or use of malicious code, which for a wide variety of reasons is becoming more rare, the threat hunter requires some starting-point for investigations. Changes in the network are one of those key behaviors, and using a metric that suits the heavy-tailed distribution of network activity is essential to keep false alerts in check while still being sensitive to moderate changes. Correlating these and other types of changes in context can be performed using an manual correlations and SIEM. More and more, through additional machine learning automation. This next level of contextual automation is where JASK is focusing, and we will discuss some of the ways this can be reliably done is subsequent posts in this series.