Time Series Anomaly Detection in Network Traffic: A Use Case for Deep Neural Networks


As the waves of the big data revolution cascade across industries, more and more forms of sensor data become valuable inputs to predictive analytics. This sensor data has an intrinsic temporal component to it – and this temporality lets us use a family of techniques for predictive analytics called Time Series Models [1]. In this blog post we explore the underlying nature of time series modeling in the context of enterprise IT analytics particularly for cyber security use-cases.

Time series can exist in many different industries and problem spaces, but at its essence it is simply a data set that has values indexed by time. In research literature we usually refer to a univariate time series as a data set that has timestamps and single values associated to each timestamp. Examples of univariate time series include the number of packets sent over time by a single host in a network, or the amount of voltage used by a smart meter for a single home over the year. Multivariate time series are an extension of the original concept to the case where each time stamp has a vector or array of values associated with it. Examples of multivariate time series are the (P/E, price, volume) for each time tick of a single stock or the tuple of information for each netflow between a single session (e.g. source and destination ip and port, packets and bytes sent and received, etc.)


Time Series Models For Network Security

Time series data is particularly prevalent in any modeling scenario dependent on input from a modern IT infrastructure. Almost every single component of the hardware and software used in enterprise networks have some sub-system that generates time series data. For cybersecurity models univariate/multivariate time series form one of the cornerstone data structures, particular for studying evolving patterns of behavior.

There are multitudes of different use cases relevant for modeling problems in cybersecurity.  To illustrate some of the common phenomena associated with this class of problems we enumerate a couple of the most common scenarios below.


Use Case 1: Detecting DDOS Attacks

With the growing prevalence of pay for play attack infrastructure, Distributed Denial of Service (DDOS) attack volume has hit all time records, including the latest attack on Krebs last year using the Merai botnet [1,2]. Denial of service attacks come in a couple of different varieties inducing ‘Layer-4’ attacks and ‘Layer-7’ attacks, referencing the OSI 7-layer network model. Typically the detection of the application layer attacks (Layer-7) is more difficult than the lower layer attacks because it involves exploiting some property of an API. For either case though, we can use the data related to overall flow, size/volume, and app layer traffic stats generated by our routers and perimeter infrastructure over time, to build time series models for layer 4 and layer 7 inbound traffic patterns. A standard time series model is then overlayed on this data to detect change points in the normal traffic baseline of the key choke points and DMZ assets exposed to inbound network traffic. The goal of this model is to identify spikes in traffic patterns that are extreme deviations from the observed baseline like in the figure below.

Use Case 2: Detecting Failed Login Spikes

Another common attack pattern usually following a large leak of user names or PII data onto the darkweb is called Account Takeover (ATO). For instance, after the leak of a large number of user names for a financial institution, attacks can follow by targeting the login infrastructure for the banking applications. Typically attackers will script an automated test of usernames /passwords against the list of stolen data; there will be a pattern of logins on the application that is rapidly changing the number of attempted logins per username. There is potential for major financial gain to be had, even in the case of a single successful login, so attackers are incentivized to target weak infrastructure in combination with the darkwebs economy of stolen PII. This type of attack manifests as a time series problem, particularly in the application logs of the web service being targeted. A changepoint, in total number of failed logins related to a particular external subnet or other group information, is a one primary indicator an ATO attack is taking place. Typical patterns we look for in this case can be seen as intermittent spikes of activity spread out over time (see below figure).


Use Case 3: Data Exfiltration

Finally the last common use case that is most common with regards to time series models is exfiltration of data. There are many sub-problems and behaviors to take under consideration here depending on the particular security scenario. For instance, an enterprise may be dealing with a disgruntled insider who is actively dumping data from repos onto a physical usb disk or sending it to attachments through google drive. Different paths of exfiltration require careful analysis of the protocol and methods involved. One rich area that is nice to model, using multivariate time series, is time series behaviors involving DNS data*. In the example below we see that if we build the appropriate multivariate vector on each individual endpoint, DNS requests we can predict multiple attack patterns with a single model. *See the JASK blog post here for more details on some of insights into searching for key patterns related to DNS exfiltration [10].


Time Series Prediction Using Neural Nets

Neural networks have a long and interesting history as pattern recognition engines used in machine learning [4]. Over the last decade the advent of next generation hardware for specific learning tasks (e.g. tensor processing units) along with breakthroughs in neural-net training has led us to the era of Deep Learning [6,7]. State-of-the-art libraries like TensorFlow and PyTorch provide high level abstractions for making some of most important techniques from Deep Learning available to solve business problems.

One of the most important aspects of leveraging time series output in security operations is  building detections tuned to highest priority outcomes. With most of the toolsets and solutions designed for security operations center (SOC) workflows, the operator has to specify a manual threshold in order to detect time series outliers. Neural networks provide a nice solution, from an engineering standpoint, for cybersecurity models with temporal data because they provide a more dynamic learning aspect that helps drive data-driven detections past static thresholds.

In 1997 Hochreiter and Schmidhuber wrote their original paper that introduced the concept of long-short term memory (LSTM) cell in neural net architectures [5]. Since then LSTMs have become one of the most flexible and best-in-breed solutions for a variety of classification problems in deep learning.

Traditional statistical/mathematical approaches for analyzing time series are run over a specified time window frame. The length of this window needs to be pre-determined and the results of these approaches are heavily influenced by the length of this window. Traditional machine learning algorithms require extensive feature engineering to train the classifier on. However, with any change in the input data, the dynamics of the features change as well, forcing a re-design of feature vectors to maintain performance. During the feature extraction phase, if the features are not appropriately chosen, then there are high chances of losing important information from the time series. LSTM, on the other hand, showcases the ability to learn long-term sequential patterns without the need for feature engineering:  part of the magic here is the concept of three memory gates specific to this particular implementation of deep learning. Recurrent Neural Networks suffer from the problem of vanishing gradient descent, which prevents the model from converging properly due to insufficient error correction, and which is overcome by LSTM. On account of these advantages, we turn to LSTM for modeling our time series.


TensorFlow LSTM Model Layer-By-Layer

Using TensorFlow [13] we can build a template for processing with arbitrary types of time series data. For a good introductory overview into TensorFlow and LSTM check out some of the great books and blogs that have been published recently on the topic [9,11,12].

In our prototype example we build a simple architecture description of a neural network specifying the number of layers and some of related properties. We define our LSTM model to contain a visible layer with 3 neurons, followed by a hidden “dense” (densely connected) layer with two-dimensional output and finally an activation layer. The mean squared error regression problem is the objective that the model tries to optimize. The final output is a single prediction.

The input to the LSTM is higher-dimensional than traditional machine learning modeling inputs. A diagrammatic representation of our data is as shown:


Algorithmic Scalability Notes

For univariate time series data LSTM training scales linearly for single time series (O(N) scaling with N number of time steps). The training time using LSTM networks is one of the drawbacks but because time series models are often embarrassingly parallel these problems are suitable to running on large GPU/TPU clusters.

To test if our model overfit we plotted a training size versus the RMSE plot and saw that the error reduced with the increase in the training data (RMSE is a quick metric that is easy to use but proper overit analysis requires a more detailed testing paradigm). This is the expected trend since the model should be able to predict better with the increase in the training data. The tests below are run on synthetic time series data and are on regular CPU cores.


Part of the appeal of neural network methods for time series problems is they let us move past traditional threshold-based detections as well as automate some key use cases. There is a lot depth to this topic and related engineering design. We have found Python and TensorFlow are great tools for prototyping ideas for building operationalized solutions with low initial complexity. In the realm of cybersecurity we can move a lot of the generic queries that end up being driven by fixed thresholds to a more dynamic learning paradigm driven by deep learning models. The benefit we see for choosing LSTM in these cases is that we can get better data driven detections while moving away from simple rule based time series alerts.




Share on