Keeping the “Science” in “Data Science”: Calibrating Algorithms for Threat Detection

As attack payloads and methods have become more easily adaptable and customizable to individual campaigns and targets (e.g. polymorphic malware, customized payloads, credential theft, etc.), threat detection systems have migrated from using static, predefined rules (e.g. snort, regex) to more data-driven detectors (UBA, honey net salting, etc.). Although often more effective than older methods at detecting complex and subtle attacks, these newer techniques can be accompanied by a greater level of uncertainty. Fortunately, there are statistical methods we can use to improve performance and to clarify results for both product developers and users.

For many years, threat detection was based on signatures which can be thought of as assertions about data, based on previously-seen attacks or on rule violations. We often think of hash matching or regex rules as being the main type of signature detection, but in fact, many types of behavioral detection are also signature detection. For example, detecting outbound traffic volumes over some predefined threshold, or detecting any use of particular protocols on an internal network are common unusual behaviors that are flagged, but they are also assertions about the data. Signatures in general can be circumvented by a knowledgeable attacker, particularly an insider, and behavior-based signatures are often brittle or require manual tuning by IT staff. (For example, just how much external volume is unusual?)

Adaptive Detection

The advantage of data science-driven detection is the ability to adapt detection for each installation environment, and in some cases, each system. However, developers and customers need to become comfortable with the inherently statistical nature of many of these methods. Unlike assertions, model-based deductions are not binary (threat or non-threat) but rather yield a probability (between 0 and 1) that a particular event or datum is a threat. Depending on the model, there may also be a confidence in the probability estimate. These details are often hidden by using thresholds. For instance, if the threat probability is > 0.95 then it’s a threat, otherwise (<= 0.95) non-threat.

Of course, signature-based detections produce false alerts, but given the use of thresholding and the desire to not miss true alerts, false alerts can be more common with probabilistic detection. One common method to modulate the true alert / false alert set is to maintain a set of whitelist or blacklist items related to the detection method. For instance, a blacklist of known malicious domains can be maintained to help improve the true alert set and a set of known safe domains can be maintained to help improve the false alert set. Unfortunately, these lists are necessarily incomplete and rarely maintained over time. Also, as with other assertion based methods, they are inherently reactive, and do not accommodate unknown domains in advance.

Model Metrics

In order to discuss more data-driven methods to improve performance, let’s consider our metrics.


Figure 1. AUC examples. Black is no better than random; blue is better than red.

Other metrics like F1 and Precision-Recall can be used, and at JASK we normally calculate both AUC and F1, but when we can calculate AUC, it is the most useful for model comparison and for selecting an operating point, or a trade off between TPR and FPR.

Model Training and Calibration

When we develop a detection model, we will normally try to gather data from as many environments as possible for training. Then, as we move a model into production, not only for a new model, but each time we do an installation, we want to make sure the model is working as well as it can be. However, instead of using an assertion-based tuning method, we want to continue to use a model-driven method. Depending on the model being used, we have several options.

The most straightforward method for model calibration is to adjust the training set, based on local data characteristics, and re-train the model. For batch training classification, this is a reliable approach. Models that are inherently streaming and maintain a latent or feature-space “state,” such as topic and community models, can be operated in a dynamic mode such that over time, as new data is ingested, the model will drift adaptively toward more relevant solutions. Active learning models facilitate direct feedback either from users (e.g. this was a good detection, that was not a good detection) or from an internal quality metric. Such “guided learning” or “reinforcement learning” models often learn more adaptive and robust representations of data, but require considerable more effort to develop.

Example Calibration: Domain Generation Algorithm Detector

One detection model improvement that does not fit well into the above categories, but instead requires a model extension, is n-gram subtypes for DGA or Domain Generation Algorithm detection. Attackers often use DGAs to create domains to use for command and control, beaconing, and other communication infrastructure use cases. We developed an artificial neural network (ANN) using Long-Short Term Memory (LSTM) for detecting algorithmically-generated domain names. Figure 2 shows a few typical examples from non-threat and threat cases. DGAs are often used for hosting malicious command and control servers, and other threat infrastructure.

Figure 2. Examples of threat and non-threat domains based on a Domain Generation Algorithm (DGA)

We found during some customer installations that a customer’s external traffic might involve a particular set of domains that for some reason were flagged as DGA by our baseline detector. As mentioned above, this sort of behavior is relatively common for probabilistic models. Instead of gathering a list of domains we happened to observe and whitelisting these in an adjunct process, external to the model, we gathered these domains (see Figure 3) and noted that many of them were characterized by dictionary-word n-grams. The most straightforward way to address this was to extend the model to check for dictionary words in n-grams in the candidate domains. This enabled the model to be robust to potential FPs from all domains in this category, not just the domains in the observed set.

Figure 3. False Positive DGA examples, with some egregious examples highlighted. Many in this category are composed of dictionary word n-grams.

The following figure shows an example of the DGA detection algorithm running in production on the JASK system.

Figure 4. Screen Capture from DGA detection in JASK Trident.

Ongoing Monitoring and Metrics

In order to maintain ongoing situational awareness of detection performance across all JASK installations, we have instrumented the product to include various logging and metrics into the usual system performance dashboards. See Figure 4 below for an example of one such internal dashboard. This provides data scientists, engineering, and DevOps with views over time, per installation, aggregate in time (hours, days) and aggregate in category and in customer segment (financial, data center, etc.) as attacks often cluster along these axes. This allows us to get ahead of model trends during operation, not just during initial calibration.

Figure 5. Internal performance monitoring dashboard, showing detection rates by category over time.


We encourage data science teams to look for principled, model-driven ways to measure and control the performance of analytics, rather than ad hoc external methods such as lists or separate processing steps. Retraining, dynamic / streaming models, and guided / reinforcement learning are all viable options. A little effort up front can lead to improved and more robust performance over the lifecycle of an engagement.

Share on