How we deal with sparse data at SentinelOne

5 min readAug 3, 2020

At SentinelOne, we work with a lot of sparse data. Our static engine uses many features about files on the client’s computer. We scan those files and extract a numeric array to feed into our models. That data tends to be sparse. Luckily, numpy and scipy have many great features for working with sparse data, but sometimes, we need a little extra.

Spartans

We started writing more and more functions and methods, which we eventually packaged together to create Spartans — SPARse Tools for ANalysis. Spartans allows you to have a rich feature set of functions to work on and with you sparse matrices. It also has a (beta) FeatureMatrix — a sparse matrix object that has feature names and works with OOB algorithms from sklearn and the such.

import spartans
import pandas as pd
import numpy as np
from scipy import sparse

Our Main Goal in Mind

At some point, we understood that our models performed really well, but we wanted to understand our features a little better. We want a real human understanding of our features and not just their mathematical effect. We started by wanting to create a simple correlation plot. In pandas this would be easy to achieve:

df = pd.read_csv('data/data.csv',index_col=0)

cor = df.corr()
cor

Alas, sparse matrices give us a harder time. calling pd.corr or np.corrcoef will result in an error.

Basic Building Block

numpy does give us an efficient method to extract the mean, with np.mean , which luckily works with sparse matrices.

m = np.array([[1, -2, 0, 50, 1],
              [0, 0, 0, 100, 0],
              [1, 0, 0, 80, 0],
              [1, 4, 0, 0, 0],
              [0, 0, 0, 0, 0],
              [0, 4, 0, 0, 0],
              [0, 0, 0, -50, 0]])
c = sparse.csr_matrix(m)
c

Out[118]:

<7x5 sparse matrix of type '<class 'numpy.int64'>'
	with 11 stored elements in Compressed Sparse Row format>

In [119]:

np.mean(m, axis=0)

Out[119]:

array([ 0.4286,  0.8571,  0.    , 25.7143,  0.1429])

In [120]:

np.mean(c, axis=0)

Out[120]:

matrix([[ 0.4286,  0.8571,  0.    , 25.7143,  0.1429]])

University Pays Off

numpy cannot compute the variance for sparse matrices out of the box. But we know this little trick:

It may look complex writing as formulas, but it’s actually pretty easy in code.

X2  = c.power(2)
EX  = np.mean(c,axis=0)
EX2 = np.mean(X2, axis=0)
E2X = np.power(EX,2)
V = EX2 - E2X

We incorporated all of those into Spartans and now we just use spartans.variance(c, axis=0) . And yes, like in numpy we can compute everything on each of the (two) axes.

Correlation Matrix

Our objective was to compute the correlation matrix. For this we need to compute the covariance, and the covariance matrix first. spartans provides us with those methods.

In [64]:

spartans.cov(c)

Out[64]:

matrix([[   0.2449,   -0.0816,    0.    ,    7.551 ,    0.0816],
        [  -0.0816,    4.4082,    0.    ,  -36.3265,   -0.4082],
        [   0.    ,    0.    ,    0.    ,    0.    ,    0.    ],
        [   7.551 ,  -36.3265,    0.    , 2395.9184,    3.4694],
        [   0.0816,   -0.4082,    0.    ,    3.4694,    0.1224]])

From here, it is easy to get the correlation matrix, by dividing by the variance of each row and column.

fig, ax = plt.subplots(figsize=(10,10))
plt.imshow(np.abs(spartans.corr(c)), vmin=0,vmax=1, cmap='Purples')
plt.colorbar()
ax.grid()

Have you noticed the grey row and column? The correlation of feature 2 is np.nan with each of the other variables. If you look again, you see that feature is a constant feature with variance of 0.

Indexing

spartans provides has easy methods for indexing. A common problem for sparse matrices is that sometimes entire columns are zero. We’ll get rid of those using spartans.non_zero_index(c, axis=0, as_bool=False) which will give us array([0, 1, 3, 4]) . Then we’ll slice accordingly :

In [78]:

C = c[:,spartans.non_zero_index(c, axis=0, as_bool=False)].todense()
C

Out[78]:

matrix([[  1,  -2,  50,   1],
        [  0,   0, 100,   0],
        [  1,   0,  80,   0],
        [  1,   4,   0,   0],
        [  0,   0,   0,   0],
        [  0,   4,   0,   0],
        [  0,   0, -50,   0]], dtype=int64)

We may see some columns that are non-zero but constant. Or maybe they are almost constant with only a few non zero values. For example, the last column has only 1 non zero.

In [79]:

spartans.non_constant_index(c,axis=0, threshold=1,method='nnz') # Faster

Out[79]:

array([ True,  True, False,  True, False])

In [80]:

spartans.non_constant_index(c,axis=0,threshold=1/7,method='variance')

Out[80]:

array([ True,  True, False,  True, False])

Masking

One of the biggest problems working with sparse matrices is null values. Many algorithms tend to mix between the native zeros of the matrix and the missing values. In spartans we have masking. Masking creates an accompanying boolean sparse matrix pointing at the missing values. Then we can use those in later computations or by stacking it together with the original matrix if we believe this information is valuable.

In [58]:

c_with_mask.todense()

Out[58]:

matrix([[  1.,  -2.,   0.,  50.,   1.],
        [  0.,   0.,   0., 100.,   0.],
        [  1.,   0.,   0.,  80.,  nan],
        [  1.,   4.,   0.,   0.,   0.],
        [  0.,   0.,   0.,   0.,   0.],
        [  0.,   4.,   0.,   0.,   0.],
        [  0.,   0.,   0., -50.,   0.]], dtype=float32)

In [60]:

msk = spartans.make_nan_mask(c_with_mask)
print (repr(msk))
print (msk)
print (msk.todense())>>>
<7x5 sparse matrix of type '<class 'numpy.bool_'>'
	with 1 stored elements in Compressed Sparse Column format>
  (2, 4)	True
[[False False False False False]
 [False False False False False]
 [False False False False  True]
 [False False False False False]
 [False False False False False]
 [False False False False False]
 [False False False False False]]

Conclusion

spartans is a great open-source package. It bridged the gap between operation you are used to with dense arrays and sparse matrices. Its main functionalities are everyday mathematical tasks like computing the variance, finding covariances creating a correlation matrix. Moreover, `spartans` lets you index your matrix by several conditions that may be of interest when preparing a dataset for machine learning tasks. It also enables you to differentiate between “real zeros” in your data and between missing data.

Try Spartans- https://github.com/Sentinel-One/Spartans

This post was written by Dean Langsam, Data Scientist at SentinelOne.

You can follow Dean on his personal blog www.deanla.com , or even better- join him! We’re hiring.