1-800-335-0403 Blog Careers Contact Us
Post

Dynamic Asset Discovery

 

1. The Problem

  • A large number of data breaches occur as a result of weak or inefficient perimeter protection. With the ever-increasing diversity among the devices being connected to a network and the constantly-evolving threats in the cyber world, recognizing and understanding devices on the network has become of the utmost importance.
  • With the exploding size of organizations and policies such as Bring Your Own Device (BYOD), security teams aren’t able to keep up with the data these devices are generating on their network, thus increasing the blind-spots in the network. Therefore, with the goal to illuminate every corner of the network, automating asset discovery could prove beneficial. One way to automate asset discovery without any human dependency is to classify assets using Machine Learning. With this objective in mind, in this post we’ll walk through how to apply Machine Learning to dynamically discover assets on a customer network.

 

2. Our Approach

  • We take a look at the Bro logs to help us identify the assets in the network. The most difficult part of understanding network logs is to convert them into an understable data structure for further processing. We turn to graphs for visualizing our network data. An example of such a visualization is as shown:

 

How is graph structure best-suited for tackling this problem domain?

Graphs are easy to build algorithmically and do not need any advanced artificial intelligence (AI), making them easy to create. They’re also simple enough for non-data scientists to get the hang of. Along with being easy to understand, the graph structure can also  be easy to label through classification of sorts. This labeled graph structure can further prove useful for, (1) Identifying alarming assets, (2) Understanding these assets. Labeling will help assign meaning to the surplus amount of data available. A neat summarization of this graph can be presented for “dashboard bling”. Now that users are  aware of the possible assets at a topological level, clustering will help further. Looking for the cause of a specific effect is easy with graphs at our disposal and simple graph traversal can come handy. 

How is the graph built?

The current implementation involves creating a graph from the conn.log in which the nodes are the IP addresses and the edges between these nodes signifies the direction of the source to the destination IP. It’s important to note that each unique IP in the conn.log is represented by a node in the graph.

How is feature engineering?

The traffic flow log files is our primary dataset. The information in traffic flow is extracted to generate training features from three dimensions: network topology, graph feature and traffic flow feature. Network topology feature collects the connection among different IPs. It describes topology of each IP. For example, L2M is the count of connection from private IP to broadcast or multicast IP. The graph gives us the network information such as in-degree, out-degree pagerank and so on. Traffic flow feature is based on the flow of the traffic in the network It collects the packet size, packet number and service number and so on for each IP.

Data Labeling

For the proof of concept, we manually label our data into these five categories: (1) Server (Windows Server, *nix fs/app servers), (2) Endpoint (workstation, laptop), (3) Mobile (tablet, phone), (4) IoT (firestick, nest, echo), and (5) Routers (access point).

One of the biggest challenges in Machine Learning is not algorithm selection, but collecting sufficient training data. This is also one of the biggest challenges we face when building these graphs. So far we only have 56 labeled data points out of the total ~20K points. Even these 56 labeled points create the problem of class imbalance (e.g. there are many more nodes labeled “endpoint” than “IoT”). This poses a challenge to come up with traditional Machine Learning algorithms, never mind turning to Deep Learning. However, to overcome this problem we can apply two data preprocessing methods: data oversampling and label propagation (LP).

 

Synthetic Generation

Data oversampling is applied in order to get more samples. The algorithm used in our experiment is called SMOTE: Synthetic Minority Over-Sampling Technique [1]. Instead of over-sampling with replacement, SMOTE  adds more samples by generating synthetic points.

 

Label Propagation

We turned to LP to help us generate labels for our entire dataset from the 56 labels gathered so far. We have come up with two versions of LP: (1) Unsupervised LP for clustering or to form communities, and (2) LP for semi-supervised labelling of unlabeled data from known labeled data.

  • Unsupervised Label Propagation for cluster formationThis algorithm helps us form communities from unlabeled data. These clusters/communities can be used for pre-testing the data. The clusters formed can give us a general idea of what our data looks like and what to possibly expect from it. This could also be used as a post-processing accuracy metric to verify if our ML algorithm classified known-labelled data into correct clusters.
  • Label Propagation for semi-supervised data labellingThis algorithm helps us propagate the known labels to the rest of the unlabelled dataset. Having had run a 10-fold cross validation on our dataset, we came up with labels for the whole of our dataset based on the 56 labels we already know of. The figure below shows the clusters formed. For the ease of display the IPs (representing nodes of the clusters) are mapped to unique integers. The labels are mapped to different colors. Clusters are depicted by the colors the nodes fall under.

Figure. Clusters formed by label propagation

To verify the accuracy of the labels being assigned, we carried out a label spreading technique with a 80:20 training:testing split on the 56 known labels and found the accuracy metrics to be:

Label Spreading Model: 10 Labeled & 46 Unlabeled Points (56 total)

Models Bake-Off

Before selecting a model, we carried out a bake-off between different Machine Learning algorithms on our dataset which consists of ~20K data points of which 56 are manually labeled.

In order to find the best model, we created a template to compare the performance of logistic regression, linear discriminant analysis, k-nearest neighbors, classification and regression trees, naive bayes, support vector machines, random forest and XGboost. The 10-fold cross-validation procedure is used to evaluate each algorithm. According to the results, we have found out that for now CART, random forest and XGboost are the best models to detect dynamic assets.

Algorithm comparison result

 

Random Forest

As seen from the model bake-off, Random Forest showed promising results, hence we worked on further polishing this model for in-depth analysis.

 

In our problem, there are 56 labeled samples. The dataset is partitioned randomly into a training set and a testing set with a proportion of 66.7% and 33.3%. Random forest is applied to classify the asset.

Classification results:

Experiment result after oversampling

In our case, SVM classifier is used to find the support vectors and samples are generated based on the support vector. After SMOTE, we have increased the sample size from 56 to 188. The data distribution is shown below. (‘Server’, 0), (‘IoT’, 1), (‘Endpoint’, 2).

 

Data distribution

Then the new dataset is applied to the random Forest Model. The dataset is partitioned randomly into a training set and a testing set with a proportion of  66.7% and 33.3%. The results are shown as follows:

Classification result after oversampling

Confusion matrix after SMOTE sampling

  • Future Work: Once we have successfully classified the assets in the network, this classification can further be extended to measure the risk scores for these assets. These risk scores could help us further understand the criticality of this asset for the organization or the level of sensitivity linked with the asset.
  • Having  achieved the primary level of asset discovery, this information can further be used to uncover more granular information, such as connections with other assets, maintenance, monitor behavior, automate responses, etc.
  • It can provide more information to security analysts to speed up the security operation process. For example, if we find the current IP is related to IoT, we can investigate further to figure out whether it is speaker, firestick or TV etc. Moreover, we are planning to gather more data. As long as the dataset size is large enough, some deep learning models such LSTM, RNN, CNN, and attention model will be applied.

In a nutshell, our approach can identify asset type for each ip in the network automatically. Today’s security analysts are overwhelmed with a large volume of data from different sources, but machine Learning based asset identification method can prevent them from this and make security detection more accurate and efficient.

 

Reference

[1] Chawla, Nitesh V., Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. “SMOTE: synthetic minority over-sampling technique.” Journal of artificial intelligence research16 (2002): 321-357.

[2] Zhu, X., & Ghahramani, Z. (2002). Learning from labeled and unlabeled data with label propagation.

 

Share on
CLOSE