Architectural Chokepoints

I have been thinking a bit on scalability lately – and I thought it might be an interesting exercise to examine a couple of the obvious places in a SIEM solution where scalability problems can be exposed. In a previous post I talked about scalability and EPS. The fact is there are multiple areas in a SIEM solution where the system may not scale and anyone thinking of a SIEM procurement should be thinking of scalability as a multi-dimensional beast.

First, all the logs you care about need to be dependably collected. Collection is where many vendors build EPS benchmarks – but generally the number of events per second is based on a small normalized packet. Event size varies widely depending on source so understand your typical log size, and calculate accordingly. The general mitigation strategies for collection are faster collection hardware (collection is usually a CPU intensive task), distributed collection architecture, and log filtering.

One thing to think off — log generation is often quite “bursty” in nature. You will, for instance, get a slew of logs generated on Monday mornings when staff arrive to work and start logging onto system resources. You should evaluate what happens if the system gets overloaded – do the events get lost, does the system crash?

As a mitigation strategy, Event filtering is sometimes pooh-poohed , however the reality is that 90% of traffic generated by most devices consists of completely useless (from a security perspective) status information. Volume varies widely depending on audit settings as well. A company generating 600,000 events per day on a windows network can easily generate 10-fold as much by increasing their audit settings slightly. . If you need the audit levels high, filtering is the easiest way to ease pressure on the entire down-stream log system.

Collection is a multi-step process also. Simply receiving an event is too simplistic a view. Resources are expended running policy and rules against the event stream. The more processing, the more system resources consumed. The data must be committed to the event store at some point so it needs to get written to disk. It is highly advisable to look at these as 3 separate activities and validate that the solution can handle your volume successfully.

A note on log storage for those who are considering buying an appliance with a fixed amount of onboard storage – be sure it is enough, and be sure to check out how easy it is to move off, retrieve and process records that have been moved to offline storage media. If your event volume eats up your disk you will likely be doing a lot of the moving off, moving back on activity. Also, some of the compliance standards like PCI require that logs must be stored online a certain amount of time. Here at Prism we solved that problem by allowing events to be stored anywhere on the file system, but most appliances do not afford you that luxury.

Now let’s flip our attention over to the analytics and reporting activities. This is yet another important aspect of scalability that is often ignored. If a system can process 10 million events per minute but takes 10 hours to run a simple query you probably are going to have upset users and a non-viable solution. And what happens to the collection throughput above when a bunch of people are running reports? Often a single user running ad-hoc reports is just fine, put a couple on and you are in trouble.

A remediation strategy here is to look for a solution that can offload the reporting and analytics to another machine so as not to impact the aggregation, correlation and storage steps. If you don’t have that capability absolutely press the vendor for performance metrics if reports and collection are done on the same hardware.

– Steve Lafferty