With modern technology today, almost all personal devices participate in a highly connected interweb and leave a footprint of our digital behaviors. The power of analytic modeling can help us identify adverse situations in streams of networked device data relating to fraud, system faults, or even human error. The main problem we focus on in this blog is the processing of network logs for detecting anomalies. For example, by analyzing the HTTP login logs an unusual spike in the number of logins could be indicative of a possible cyber-attack. One way to analyze large pipelines of this type of connected data is using time series models.
From a Research and Development perspective the problem of time series anomaly detection at scale is a complex task. First off, from a statistics perspective, the definition of anomaly itself is vague and tends to vary across problem domains. Furthermore many of the use cases in the cyber security industry hold a standard for all models, including the time series model to function in real time by being lightweight and efficient. From a data perspective, the individual series are often times sparse valued, unevenly sampled and too large to process by a single CPU.
In this blog post, we outline the work we have done building distributed streaming time series models for processing enterprise network traffic. Monitoring for changes in these network patterns over time can help solve tactical cyber security use cases and can also highlight unusual or suspicious connections. Ultimately, time series models can help establish a standard for the network environment and exhibit things difficult for the human eye to mine out of the plethora of available information, such as detecting a large amount of data transferred from a single host.
We try to overcome all the aforementioned complications and propose a statistically intelligent hybrid streaming model for detecting anomalies in cybersecurity data that scales to fortune 500 level requirements and supports a flexible design abstraction for creating new instances of models that map to individual use cases using a simple template approach.
We describe a prototype use case to walk through the end-to-end design choices. In production our model was deployed to predict anomalies on real-time streams of customer traffic. Prediction on streaming data not only makes tuning and training an interesting challenge but also requires the model to be lightweight and time efficient. Application layer data like HTTP logs are sparse, which makes it difficult for models to learn and identify patterns. The proposed algorithmic solution we devised to tackle these challenges is shown in Flowchart 1. n the following subsections, we present a breakdown of our modeling life-cycle: data-collection, pre-processing, identifying our model, identifying model parameters, training, and testing.
The data which our model was trained on was collected in the form of HTTP counts per source IP from 5 separate customer cloud instances. This data had the form of PCAP data (examples of raw PCAPscan be found here for a variety of security use cases: https://github.com/jasklabs/blackhat2017) and then transformed with BRO for easy parsing of some key protocol types.
Once we had the BRO extracted form of the PCAPs we built basic transformations using standard ETL concepts focused on converting logs to a generically-typed feature vector per time point. We added a tiny sprinkle of abstraction at this layer in order to support processing heterogeneous use cases in the same code base, so that we can build a time series either out of strings or numeric values.
These granular logs pruned through the ETL helped the model learn on only the essential data pruned for isolating time series patterns per host/user.
A deviation from this “normal” pattern then helps it detect a possible anomaly.
After having preprocessed the logs we proceed with the iterative process of mathematical model design and parameter selection. This involves an experiment and evaluate approach where we essentially have a bake off between algorithmic concepts deciding some candidate algorithms that seem to have the best outcomes on test datasets we isolated.
We experimented with Symbolic Aggregate Approximation , Cluster-Based Local Outlier Factor , Hybrid Sliding window, and Autoregressive Moving Average (ARIMA). However, one of our production constraints was to have a model which, along with being time optimal, would also be effective to detect anomalies in real time. These two requirements proved a major drawback for most of the aforementioned algorithms we evaluated against. Either the model was too slow in time complexity or that it failed to detect efficiently on streamed data. After doing an intensive comparison study we concluded that Hybrid Sliding-Window was the best fit for our requirements and was closely accurate to make anomaly detections. Some nice libraries that we would like to have used if we could relax some of the real-time constraints would be Twitter’s time series library  and the R library called forecast
The sliding window logic helps us traverse over our real-time streaming data at the same time it keeps listening for a new log to be added to the time series being traversed. The latest log we score will use the history of the host and the population in the anomaly detection step. To carry of this test for an anomaly we incorporate three major conditions, (1) Check for historical Median Absolute Deviation (MAD) against the present MAD value, (2) Compare against the median of the already seen time series for all time series in the population, and (3) A threshold check for handling any sparsity in the window.
As seen from literature [1, 2], absolute standard deviation outshines standard deviation around the mean when it comes to robust outlier detections. Thus, we built into our algorithm the heavy weighting of Median Absolute Deviation (MAD) as a key statistic. Before settling for MAD, we also did a comparative study between MAD-based and percentile-based outlier detections with a fixed window length. An example of this comparative study for a window length of 263 is as shown in Figure 1. It can be seen that MAD turns out to perform better than percentile based outlier detections.
Figure 1. Percentile-based versus MAD-based outlier detection comparative study.
The pseudo-code for the Hybrid Sliding Window is as shown in the picture below.
Algorithm 1. Streaming time series anomaly detection with global statistics
As seen from subsection 1.2 and Algorithm 1. we realize that our hybrid model’s performance greatly depends on the three parameters: (1), the length of the sliding window, (2), threshold to normalize MAD, and (3), threshold for dealing with a sparse window.
The three values mentioned above are computed offline and the model is then initialized with these precomputed values. Intuitively, we have an idea of an approximate range for these values but in order to come up with concrete values, we rigorously carry out automated tests. The goal of this phase is to score multiple independent datasets varying on a range of these parameters values and fix on the ones that give the best results in terms of anomaly detection. These best results are backed by a comparative study in the form of raw data and plots produced by the unit tests. One of the examples of such a comparative study is as shown in Figure 2., where we plot the number of anomalies fired versus the different window length values. It can clearly be seen that the length of the window does have a significant impact on the number of anomalies detected.
After computing these parameters offline, we then initialized our model with these parameters to be able to detect anomalies on real-time streaming data.
Figure 2. Varying window length versus number of anomalies.
We summarize the entire model’s life-cycle through Flow Chart 1.
Flowchart 1. Life cycle of hybrid model for anomaly detection on streaming data.
Automated detection of anomalous behavior in cybersecurity data can help reduce overall alert volume for a large number of practical use cases. If the problem is to detect outliers in streaming time series it turns out the use of Median Absolute Deviation as a local baseline statistic to describe the history of an individual time series is useful in combination with studying a population’s global Median/and rare quantile values. We built a threshold based model for dynamically adjusting to changing local behaviors in an individual time series along with global behaviors in the population. For a population of 25,000 Individual time series on one month of data we raised 50 anomalies in this fashion with the initial run of hyper parameter values we learned in our model iteration phase. There is more work to be done in making the parameter learning phase fully online streaming and we will talk about open source efforts around this topic in a future blog post.
 Christophe Leys, Christophe Ley, Olivier Klein, Philippe Bernard, Laurent Licata, Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median, Journal of Experimental Social Psychology, Volume 49, Issue 4, 2013, Pages 764-766, ISSN 0022-1031, https://doi.org/10.1016/j.jesp.2013.03.013.
 Leo H. Chiang, Randy J. Pell, Mary Beth Seasholtz, Exploring process data with the use of robust outlier detection algorithms, Journal of Process Control, Volume 13, Issue 5, 2003, Pages 437-449, ISSN 0959-1524, https://doi.org/10.1016/S0959-1524(02)00068-9.
 E. Keogh, J. Lin and A. Fu, “HOT SAX: efficiently finding the most unusual time series subsequence,” Fifth IEEE International Conference on Data Mining (ICDM’05), 2005, pp. 8 pp.-.doi: 10.1109/ICDM.2005.79
 He, Z., Xu, X., & Deng, S. (2003). Discovering cluster-based local outliers. Pattern Recognition Letters, 24(9), 1641-1650.
 BRO Intrusion Detection System: https://www.bro.org/