This the second post in a multi-part series about employing “real” machine learning in network security. The remainder of the series will follow with more details to come. See Part 1 for the introduction.
There are many definitions of machine learning. My first graduate class on the subject was called “computational statistics,” and I still prefer that term because when we “do machine learning” we are essentially doing the following:
There are a few key elements of this process representation, but it’s actually quite simple. The focus is on the data. Without data, you are doing some other kind of artificial intelligence, like a knowledge base, or formal logic, or something else. Machine learning relies on data, usually lots and lots of it. Data is typically represented, to the model, as a data matrix or design matrix, where each row represents a data element, xi, in which each columnar element is a feature. In many cases, features are just the raw, observed data attributes. But in some cases, the feature vector may be extended with associated data and derived fields.
The data may also include labels, which might be a privileged attribute from the data, or from another source. Generally, labels are what we are trying to predict from the data, like the slope of a line through trending data points, or the identity of people in an image. We denote the labels as a separate vector, yi.
Two other key features should be specified in advance: a model of our data and a loss function. The model reflects our assumptions about how the data is generated, structure in the data, etc. Sometimes we don’t know for sure, so we try a bunch of different models, sometimes after seeing how other models have fared. This can be dangerous because it can lead to spurious results, but that’s another topic entirely. Examples of models include linear regression, where we think the target (yi) is a proportionate function of the other attributes, or one of many other common models. Each such model has one or more (or many!) parameters that can be adjusted to see which parameter settings best fit the provided data.
The goal of machine learning, then, is to pick a model and perform parameter estimation given some data to yield a fitted model. This is also called model training. Or finding p(y|x). However, we need a way to determine which parameter settings are the best fit to the data. The loss function is rarely given much thought, and it is true that most models already provide a natural loss function, such as a minimization function, or some other optimization function. However, this entire process assumes the following:
This last one is a kicker, since loss functions are normally written as some sort of distance minimization. This means that if the model is just taken off the shelf (and this is normally the case), it is up to the data to contain sufficient information to provide separability and identifiability in the trained model. This is why machine learning is so data hungry: the model / loss-function side of the coin often does not contain sufficient information to constrain the parameter space search, especially toward real-world goals (incremental progress toward toy-data set problems is another issue, but in this discussion, we are interested in real-world machine learning applications).
Many of the amazing applications of applied machine learning in the past couple of years have been in the areas of image recognition and language translation. One of my personal favorites is the story of how Google transformed the Google Translate engine from a system based on pair-wise trained models to a general-purpose translator based on deep learning. There are a couple of amazing aspects to this story. One is that the translator appeared to learn an internal abstraction of some language constructs, enabling effective translation between two languages for which an explicit phrase translation had not been trained. The other is that Google transitioned this technology from R&D to product in under a year (!!). So where are all the other machine learning successes we should be seeing? Like maybe in security?
Image recognition and language translation, in addition to being two very large use cases, also have a common key quality: label consistency. That is, if an image contains a representation of a “cat,” that will always be a “cat.” This sounds almost silly at first, but it is critical. There are many potential use cases that do not have this quality – like news stories, for instance. Anything involving sentiment and trends is susceptible to what is referred to as “concept drift,” and if it is not explicitly accommodated in the model then the performance of the model over time will degrade – users of news recommenders from a few years ago will recognize this effect.
The following figure shows testing examples from a deep learning classifier trained to recognize single objects in images:
Data sets like this do not have a problem with concept drift, or related ones like transfer learning, which is where a representative data sample from one source is similar to, but different from, data from another source. More technically, the data is sampled from different probability spaces, although one (sincerely!) hopes that there is some conditional dependence that can be modeled – assumptions, again.
If you have ever tried to analyze security data, you already know the punchline: our data rarely has labels, and if we can find labeled data, there is very little of it. On top of that, what counts as “unusual” or “potentially malicious” or whatever on one network is often not the same on another network (with different routing, policies, etc.). And the last nail in the coffin is that what we are really trying to detect, actual threat activity, is by definition constantly changing, as attackers and other miscreants continually evolve attacks to avoid the very detection we are trying to do.
This is one of the primary reasons that some analysts believe machine learning is the wrong tool for detection. This, and the related fact that these problems can actually lead to model failure in practice. It is not a trivial fact that many types of detections are designed as merely proxies for threat activity that cannot be directly or reliably detected which, although possibly well-intended, can actually make the problem worse by causing confusion with end-users and clouding specificity metrics.
So, end of story? Give up and go home?
Happily, we are not constrained to simple input -> classification model -> output use cases. It is true that many introductory courses, online MOOCs, blogs, etc., focus on such models, but that’s because they are the simplest possible machine learning models, and in many cases the most well-studied. But security does not contain the simplest possible use cases of anything (IMO).
One often unsung area of machine learning is the unsupervised sort. An analyst can do this without any labels or with minimal labels. Although in the latter case, labels would be used to guide the analyst, not the model. Semi-supervised learning is another possibility, but generally requires the labels to broadly cover the classes in the data, and unlabeled data just provides more examples to learn from. It doesn’t work if some data classes are labeled preferentially over others. Over the years, I have found this to not work better than other approaches for security work cases, but as always, your mileage may vary.
In unsupervised learning, we are just trying to find p(x). The choice of model and loss function is particularly important here, and there may be some room for using non-standard loss functions. Unsupervised learning methods are used a lot during initial data exploration, because without labels, one is trying to identify “natural” structure within the data itself. However, this is usually an underspecified problem, so some assumptions from the problem domain must be used.
This leads us to another topic: Transfer learning. We really don’t want to have to re-train models every time we upgrade an operating system or application, incorporate an acquired company’s network, or install our product on a new customer’s network. This is the bain of security analytics models. I don’t think Waymo cars pull off the road to re-train once they enter a new neighborhood, or if a pedestrian runs out in the middle of the street. And all such exception cases don’t need to be pre-coded and caught.
Such systems rely on a systems approach, rather than a single-algorithm approach. Instead of diving in to analyze each data feed at the source, data is organized and processed in steps, including quality and enrichment processing prior to more advanced information processing. This is essential in noisy and conditionally-sampled spaces like automated driving and computer networks.
Some might object to including an “Assumptions” box explicitly in the above Figure, but I included this for a reason. There are always assumptions in any statistical model. Machine learning and data science efforts tend to fail when model assumptions are violated or ignored. But all is not lost for applications and problem domains that do not easily fit typical assumptions like stationarity (no drift), consistency (no transfer learning issues), and so on. The solution is to build systems that inject constraints in a reliable and consistent (and verifiable!) manner, from data ingest through model training and inference. Unfortunately, many data science curriculums start with groomed data sets and isolated problems and therefore do not prepare analysts and developers for tackling this current phase of applied machine learning. Fortunately, there is a rich engineering history for signal processing and other steps related to managing these sorts of problems.
One reference I recommend for how to include real-world knowledge in a principled way into machine learning models is Judea Pearl’s Causality. There is an intro-level book that just came out by the same author called The Book of Why: The New Science of Cause and Effect. Pearl was instrumental in developing the Bayesian probabilistic networks that are used in innumerable machine learning systems, and this newer work is his proposal for how to move artificial intelligence to the next level by incorporating domain knowledge in a principled way, rather than through ad-hoc code and business rules, as is often done.
In the next and final blog in this series, I will discuss how JASK has designed a next-generation system to incorporate domain knowledge with machine learning to provide detection in security context – a multi-step design that does not rely on a single algorithm or a “golden” data source. Instead, much like an autonomous vehicle or “smart” weapon (another type of autonomous vehicle), the JASK system was designed from the start as a model-driven signal processing platform that can use a variety of different data sources and focuses on enabling the analyst.