1-800-335-0403 Blog Careers Contact Us

Profiling VIP Accounts in User-Centric Data Streams Part 2: Real-Time Modeling


In this post, we continue our discussion of use cases involving account take over and credential access in enterprise data sets. In the first part of this series, we introduced the definition of a VIP account as any account that has privileged or root level access to systems/services. These VIP accounts are important to monitor for changes in behavior, particularly because they have critical access to key parts of the enterprise. As a follow up to our first post, this blog will describe a real-time approach for automatically profiling VIP accounts and detecting when they are potentially being misused.

Real-time user data in Kafka

User data is constantly changing and comes in high volumes. With this in mind, we modeled our problem as a Kafka streams application. Kafka is a scalable and efficient distributed streaming platform that serves records in a reliable way. The architecture can also scale per user; in this case, based on the number of nodes added, which becomes a critical bottleneck in environments with hundreds of thousands of users. We use the Kafka API to maintain a sliding window of statistics per individual account from LDAP data (such as Active Directory event logs) and derive a baseline of all VIPs simultaneously (see Fig 1. for a simple reference architecture).

Fig 1. Kafka Streams App https://kafka.apache.org/intro


Modeling VIP Accounts

In our model, a graph is built and updated to detect account access pattern changes. The technical details are sketched below. The core algorithms are based on ideas from a number of well-known research sources and combines methods originally outlined in this blog.

Build the graph of account-to-device access.We build a graph first with a simple rule. In the record stream, if one account accesses a device, the edge between device and account is drawn. The number of times access is made from an account to one device will be attached with the edge. This graph is then used to derive the popularity of each account using the PageRank algorithm.

Compute a popularity metric. In our model we create “popularity scores” using PageRank. In this particular example, higher page rank = higher likelihood that the account is a VIP (i.e. the account is popular in the graph).

Detect High-Risk Changes in VIP Account Behavior.For each VIP account, we compute an access pattern matrix and assign it an individual risk score. The risk posed by the matrix of account history is computed using another algorithm called SVD via the following steps:

  • For each day, the account access pattern is grouped by servers for each account.
  • Matrix of counts representing access patterns of the account for typical weekly access to Windows resources is built. The heat map matrix for one account is shown below. Each row represents different servers. Each column stands for different dates.


Figure 2 shows access patterns for the past eight days. From the heat map, we discover that the access pattern for account 2’s eighth day is significantly different from the past seven days. This becomes easy to detect with an algorithm by comparing each day’s “SVD score”. Essentially the idea is to use the score for each matrix and compare the score to the previous matrix.

Fig 2. Working week access pattern for VIP with PageRank in top .01 percent of OS Query Data


Relevant Data Sets

The data we used to prototype our model had at a minimum fields of the form account name, system name, and timestamp. The data we used in our Kafka application included OS Query, McAfee EPO, Cylance and Active Directory.  For every stream we partitioned the data by account name and sliding windows were defined in 5, 7 and 10-day windows over each partition.


Our goal was to describe an intuitive way to build a streaming application to determine when VIP accounts have a change in behavior. We also wanted to create a streaming framework to accelerate the time to detect account theft and privilege abuse. This gives us a way to model complex interactions of hundreds of thousands of users and break down the problem to a few key accounts that should be monitored.


Summary of PageRank: PageRank is a type of graph algorithm that relates to the general problem of how to rank the relative importance of a node in a graph.  This algorithm was created by Larry Page and Sergey Brin and is first mentioned in their paper that formed the original foundation for Google’s search engine: “PageRank: Bringing Order to the Web.”  Wikipedia describes the original inception for the algorithm in terms of ranking world wide web pages, stating “PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.” For this project, we borrowed this type of metric and applied it to a graph we modeled on user and system account access information. We used a variety of data sets to build various types of graphs and applied different types of metrics to model a data-driven method for deriving context about how important a specific account is relative to an entire enterprise of interactions across systems and humans.


Summary of SVD: Singular Value Decomposition(SVD) is a classic matrix factorization method in machine learning, used often for data pre-processing. In our project, SVD takes a rectangular matrix of row-oriented activity for each account (defined as P, P is a kdmatrix) where is number of devices the account accesses and represents time. The singular value decomposition is shown as follows:

Where contains the left singular vectors,contains right singular vectors and is diagonal and has singular values. All of the singular values are arranged in descending order. The effect of SVD is to reduce the input activity matrix to a reduced set of constituent parts in order of importance indicated by the magnitude of the singular values. Unlike other factorization methods, SVD has minimal requirements for pre-processing and regularity of the data. For instance, it does not need to be pre-normalized or de-trended prior to factorization, in most cases.

Share on