How SentinelOne optimizes Big Data at scale

  • When querying the data in presto, it retrieves the partition mapping from the Hive metastore, and lists the files in each partition. As mentioned above, S3 listing is eventually consistent, so in real-time we sometimes missed a few files in the list response. Listing in HDFS is deterministic and immediately returns all files.
  • S3 list API response is limited to 1000 entries. When listing a directory of large amounts of files, Hive executes several API requests, which cost time.
  • Set these parameters to optimize compacted files output:
  • Create two external tables — a source table with raw data and a destination table of compacted data — based on this template:
  • Add partitions to the Hive store, for the source and destination tables to point to the correct location:
  • Finally, run the insert magical command, that will do the compaction:
  • Each big partition had 200 GB on disk, which in memory is in fact 2T of raw, uncompressed data every day. Compaction is done on uncompressed data, so holding it in memory through the entire process of compaction required huge clusters.
  • Running a few clusters with hundreds of nodes is quite expensive. First, we ran with spots to reduce costs, but as our cluster grew, it became hard to get a lot of big nodes for several hours. At peak, one cluster ran for 3 hours to get one big partition compacted. When we moved to on-demand machines, costs increased dramtaically.
  • Presto has no built-in fault-tolerance mechanism which is very disruptive when running on spots. If even one spot failed, the whole operation failed, and we had to run it all over again. This caused delays in switching to compacted data, which resulted in slower queries.
  • When small files are written to our S3 bucket from our event stream (1), we use AWS event notifications from S3 to SQS on object creation events (2).
  • Our service, the Compaction Manager, reads messages from SQS (3) and inserts S3 paths to the database (4).
  • Compaction Manager aggregates files ready to be compacted by internal logic (5), and assigns tasks to worker processes (6).
  • Workers compact files by internal logic and write big files as output (8).
  • The workers update the Compaction Manager on success or failure (7).
  • We control EVERYTHING. We own the logic of compaction, the size of output files and handle retries.
  • Our compaction is done continuously, allowing us to have fine-grained control over the amount of workers we trigger. Due to seasonality of the data, resources are utilized effectively, and our worker cluster is autoscaled over time according to the load.
  • Our new approach is fault-tolerant. Failure is not a deal breaker any more; the Manager can easily retry the failed batch without restarting the whole process.
  • Continuous compaction means that late files are handled as regular files, without special treatment.
  • We wrote the entire flow as a continuous compaction process that happens all the time, and thus requires less computation power and is much more robust to failures. We choose the batches of files to compact, so we control memory requirements (as opposed to Presto, where we load all data with a simple select query). We can use spots instead of on-demand machines and reduce costs dramatically.
  • This approach introduced new opportunities to implement internal logic for compacted files. We choose what files to compact and when. Thus, we can aggregate files by specific criteria, improving queries directly.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
SentinelOne Tech

SentinelOne Tech

This is the tech blog of Sentinelone, a leading cybersecurity company. Follow us to learn about the awesome tech we build here.