diff --git a/docs/configuration.md b/docs/configuration.md index 02c2d18..aa91a1a 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -83,4 +83,26 @@ redeploy the updated service using the command: - /opt/sc4s/default/compliance_meta_by_source.csv:/opt/syslog-ng/etc/context-local/compliance_meta_by_source.csv - /opt/sc4s/default/compliance_meta_by_source.conf:/opt/syslog-ng/etc/context-local/compliance_meta_by_source.conf `` +## Data Durability - Local Disk Buffer Configuration +SC4S provides capability to minimize the number of lost events if the connection to all the Splunk Indexers goes down. This capability utilizes the disk buffering feature of Syslog-ng. SC4S receives a response from the Splunk HTTP Event Collector (HEC) when a message is received successfully. If a confirmation message from the HEC endpoint is not received (or a “server busy” reply, such as a “503” is sent), the load balancer will try the next HEC endpoint in the pool. If all pool members are exhausted (such as would occur if there were a full network outage to the HEC endpoints), events will queue to the local disk buffer on the SC4S Linux host. SC4S will continue attempting to send the failed events while it buffers all new incoming events to disk. If the disk space allocated to disk buffering fills up then SC4S will stop accepting new events and subsequent events will be lost. Once SC4S gets confirmation that events are again being received by one or more indexers, events will then stream from the buffer using FIFO queueing. The number of events in the disk buffer will reduce as long as the incoming event volume is less than the maximum SC4S (with the disk buffer in the path) can handle. When all events have been emptied from the disk buffer, SC4S will resume streaming events directly to Splunk. + +For more detail on the Syslog-ng behavior the documentation can be found here: https://www.syslog-ng.com/technical-documents/doc/syslog-ng-open-source-edition/3.22/administration-guide/55#TOPIC-1209280 + +SC4S has disk buffering enabled by default and it is strongly recommended that you keep it on, however this feature does have a performance cost. +Without disk buffering enabled SC4S can handle up to 345K EPS (800 bytes/event avg) +With “Normal” disk buffering enabled SC4S can handle up to 60K EPS (800 bytes/event avg) -- This is still a lot of data! + +To guard against data loss it is important to configure the appropriate type and amount of storage for SC4S disk buffering. To estimate the storage allocation its best to start with your estimated maximum events per second that each SC4S server will experience. Based on the maximum throughput of SC4S with disk buffering enabled, the conservative estimate for maximum events per second is 60K (however, you should use the maximum rate in your environment for this calculation, not the max rate SC4S can handle). Next is your average estimated event size based on your data sources. It is common industry practice to estimate log events as 800 bytes on average. And the final input to the sizing estimation would be the maximum length of connectivity downtime you want disk buffering to be able to handle. This measure is very much dependent on your risk tolerance. For example, to protect against a full day of lost connectivity from SC4S to all your indexers at maximum throughput the calculation would look like the following... + +60,000 EPS * 86400 seconds * 800 bytes = 3.77186 TB of storage + +To configure storage allocation for the SC4S disk buffering, do the following... +Edit the file /opt/sc4s/default/env_file +Add the SC4S_DEST_SPLUNK_HEC_DISKBUFF_DISKBUFSIZE variable to the file and set the value to the number of bytes based on your estimation (e.g. 4147200000000 in the example above) +Splunk does not recommend reducing the disk allocation below 500 GB +Restart SC4S + +Given that in a connectivity outage to the Indexers events will be saved and read from disk until the buffer is emptied, it is ideal to use the fastest type of storage available. For this reason, NVMe storage is recommended for SC4S disk buffering. + +It is best to design your deployment so that the disk buffer will drain after connectivity is restored to the Splunk Indexers (while incoming data continues at the same general rate). Since "your mileage may vary" with different combinations of data load, instance type, and disk subsystem performance, it is good practice to provision a box that performs twice as well as is required for your max EPS. This headroom will allow for rapid recovery after a connectivity outage.