Threat Hunting Part 3: Going Hunting with Machine Learning

Due to being busy with proof of concepts at the end of the quarter, I’ve been on the prowl for lazy hunting ideas. Every security person’s dream is to have interesting data come to them, but is this possible? Apache Spark’s MLlib seemed like a good place to start the hunt.

I wanted to leverage Apache Spark’s MLlib combined with in-bound and out-bound data to bubble up anomalous traffic talking to suspicious countries.*  At JASK we leverage machine learning to produce behavioral based signals and sometimes I decide to go hunting based on our raw data stacking instead of output from something like Spark+MLlib.  These notebooks are great way of pitting one behavioral based model vs. another approach, checking false positive rates and figuring out if your model is ready for prime time.

*Disclaimer – I’m an Security Engineer, not one of the ML experts on the team but I know where to find help!

Here’s how to do it:

Step 1: Import Apache’s Machine Learning Libraries with KMeans

Step 2: Formulate what data you want to run KMeans against.

Here I’m querying the sum of the outbound bytes.

Step 3: Convert the dataset to type RDD:array[double].

The resulting query above needs to be of type double for KMeans, so we map it to double. The reason for this is that num_en is a SchemaRDD. When you collect() on it, you get and Array[org.apache.spark.sql.Row]. Thus, num_en.collect()(0) gives you the first Row of the Array. The technical reason behind this is that a dense vector is backed by a double array representing its entry values, while a sparse vector is backed by two parallel arrays: indices and values. For example, a vector (1.0, 0.0, 3.0) can be represented in dense format as [1.0, 0.0, 3.0] or in sparse format as (3, [0, 2], [1.0, 3.0]), where 3 is the size of the vector.

Step 4: Define the number of classes.

Step 5: Define the number of iterations.

Step 6: Evaluate clustering by computing within set sum of squared errors.

This is where we evaluate the model and determine if it accurately represents the data.  I’m merely implementing this to get some anomalous traffic back as a test, and I am not a KMeans expert.

The key is to minimize the Euclidean distance among the points in the groups. The quadratic error is the Within Set Sum Squared Error (WSSE).

Step 7: Show the results.

Step 8: Save and load the model.

Step 9: Have your coworker look over your shoulder and tell you that Spark 2.0 has deprecated Apache’s MLlib!

BAAAAH!!! Defeated…except I wrote and implemented this a couple months ago and I’m just now getting to the point of writing the blog post around it (better late than never!). The above still works. It sounds like Spark has simply stopped maintaining the library in favor of other approaches, but it is still available for use. Essentially, no further maintenance is going to be performed on the MLlib. I’m not gonna let a good notebook go to waste due to a ML library being deprecated. It may be true that they are moving on from MLlib, but it doesn’t change the fact that hunting based on anomalies could be the place to start and K-Means is a helpful way to start that investigation.

Since Apache’s MLlib sounds like it’s going to be deprecated in the near future, I’m going to shift our goal of querying anomalies from MLlib to leveraging the machine learning anomaly detection within JASK Trident. This way I’ll know that by the time our user community reads this blog post, the underlying detection method will still be around.

To kick off this series of notebook paragraphs, I have a “signal” table. If you aren’t a  Trident customer, you’ll have to utilize your own Anomaly detection, such as the example implementation of APACHE MLIB shown above. If you are a Trident customer, you are in luck, because we expose our anomaly detection to users via the signals table.

Step 1: Here we are querying the Trident signal table for Anomaly detection. I’m storing the results of the query into an array that I will later automate the analysis.

The return type of the above query is a collection of arrays and in order to work with just the ip.address and feed a single IP at a time into our secondary analysis, we need to focus on grabbing each of the ip.addresses from the result.

Step 2: To solve the Array of Arrays problem we map each first item in the array to an Array of strings with the following easy paragraph.

Below you can see our result is now Array[String]. We are getting closer.

Step 3: Now that I have a clean list of suspicious IP addresses, I can throw my list of IP addresses through any type of secondary analysis and find out what these IP’s were talking about. What destination ports were they talking over? What website were they sending/receiving the data from? Heck, now I have a suspicious list of IP’s, I can perform an auto-analysis of them and then write these out to my firewall and put a block rule in for each one. All of those types of questions and actions I could throw into a paragraph and then understand WHY my anomalies were being generated.

Here’s  a quick look at the countries involved in my Anomalies:

My first question will be simple: How much traffic was being sent or received between the two hosts?

My second question will be based on the result that a set of anomalies were around destination port 80. What website URLs were being requested?

What I found from the above query was no results for one of the hosts!!

That led me to ask more about how many bytes were transferred:

Other paragraphs could ask similar questions based on the type of anomaly that was generated. What country was the traffic was destined for? What was the HOST that was requested? Was this related to data-exfiltration? What were the DNS queries? What were the DNS answers?  I now have a notebook for auto-analysis to determine if all those anomalies on your network were real threats or just users binge watching World Cup Skiing, (which happens to be what I am currently watching). Once you are comfortable with your auto-analysis you are free to export that to a file or write it to HDFS and use elsewhere, closing the feedback loop back inside the product for any future detections.


Share on