1. The Problem
2. Our Approach
How is graph structure best-suited for tackling this problem domain?
Graphs are easy to build algorithmically and do not need any advanced artificial intelligence (AI), making them easy to create. They’re also simple enough for non-data scientists to get the hang of. Along with being easy to understand, the graph structure can also be easy to label through classification of sorts. This labeled graph structure can further prove useful for, (1) Identifying alarming assets, (2) Understanding these assets. Labeling will help assign meaning to the surplus amount of data available. A neat summarization of this graph can be presented for “dashboard bling”. Now that users are aware of the possible assets at a topological level, clustering will help further. Looking for the cause of a specific effect is easy with graphs at our disposal and simple graph traversal can come handy.
How is the graph built?
The current implementation involves creating a graph from the conn.log in which the nodes are the IP addresses and the edges between these nodes signifies the direction of the source to the destination IP. It’s important to note that each unique IP in the conn.log is represented by a node in the graph.
How is feature engineering?
The traffic flow log files is our primary dataset. The information in traffic flow is extracted to generate training features from three dimensions: network topology, graph feature and traffic flow feature. Network topology feature collects the connection among different IPs. It describes topology of each IP. For example, L2M is the count of connection from private IP to broadcast or multicast IP. The graph gives us the network information such as in-degree, out-degree pagerank and so on. Traffic flow feature is based on the flow of the traffic in the network It collects the packet size, packet number and service number and so on for each IP.
For the proof of concept, we manually label our data into these five categories: (1) Server (Windows Server, *nix fs/app servers), (2) Endpoint (workstation, laptop), (3) Mobile (tablet, phone), (4) IoT (firestick, nest, echo), and (5) Routers (access point).
One of the biggest challenges in Machine Learning is not algorithm selection, but collecting sufficient training data. This is also one of the biggest challenges we face when building these graphs. So far we only have 56 labeled data points out of the total ~20K points. Even these 56 labeled points create the problem of class imbalance (e.g. there are many more nodes labeled “endpoint” than “IoT”). This poses a challenge to come up with traditional Machine Learning algorithms, never mind turning to Deep Learning. However, to overcome this problem we can apply two data preprocessing methods: data oversampling and label propagation (LP).
Data oversampling is applied in order to get more samples. The algorithm used in our experiment is called SMOTE: Synthetic Minority Over-Sampling Technique . Instead of over-sampling with replacement, SMOTE adds more samples by generating synthetic points.
We turned to LP to help us generate labels for our entire dataset from the 56 labels gathered so far. We have come up with two versions of LP: (1) Unsupervised LP for clustering or to form communities, and (2) LP for semi-supervised labelling of unlabeled data from known labeled data.
Figure. Clusters formed by label propagation
To verify the accuracy of the labels being assigned, we carried out a label spreading technique with a 80:20 training:testing split on the 56 known labels and found the accuracy metrics to be:
Label Spreading Model: 10 Labeled & 46 Unlabeled Points (56 total)
Before selecting a model, we carried out a bake-off between different Machine Learning algorithms on our dataset which consists of ~20K data points of which 56 are manually labeled.
In order to find the best model, we created a template to compare the performance of logistic regression, linear discriminant analysis, k-nearest neighbors, classification and regression trees, naive bayes, support vector machines, random forest and XGboost. The 10-fold cross-validation procedure is used to evaluate each algorithm. According to the results, we have found out that for now CART, random forest and XGboost are the best models to detect dynamic assets.
Algorithm comparison result
As seen from the model bake-off, Random Forest showed promising results, hence we worked on further polishing this model for in-depth analysis.
In our problem, there are 56 labeled samples. The dataset is partitioned randomly into a training set and a testing set with a proportion of 66.7% and 33.3%. Random forest is applied to classify the asset.
Experiment result after oversampling
In our case, SVM classifier is used to find the support vectors and samples are generated based on the support vector. After SMOTE, we have increased the sample size from 56 to 188. The data distribution is shown below. (‘Server’, 0), (‘IoT’, 1), (‘Endpoint’, 2).
Then the new dataset is applied to the random Forest Model. The dataset is partitioned randomly into a training set and a testing set with a proportion of 66.7% and 33.3%. The results are shown as follows:
Classification result after oversampling
Confusion matrix after SMOTE sampling
In a nutshell, our approach can identify asset type for each ip in the network automatically. Today’s security analysts are overwhelmed with a large volume of data from different sources, but machine Learning based asset identification method can prevent them from this and make security detection more accurate and efficient.
 Chawla, Nitesh V., Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. “SMOTE: synthetic minority over-sampling technique.” Journal of artificial intelligence research16 (2002): 321-357.
 Zhu, X., & Ghahramani, Z. (2002). Learning from labeled and unlabeled data with label propagation.