Update troubleshooting docs

* Update troubleshooting docs * Refine gettingstarted runtime docs * Update default receive buffer to 16 MB (from 2 MB)
CSVD · Mar 23, 2020 · 1639208 · 1639208
1 parent ddab7c5
commit 1639208
Show file tree

Hide file tree

Showing 8 changed files with 134 additions and 35 deletions.
diff --git a/docs/configuration.md b/docs/configuration.md
@@ -41,7 +41,7 @@ separately from that of the alternates below.
 | Variable | Values        | Description |
 |----------|---------------|-------------|
 | SC4S_DEST_GLOBAL_ALTERNATES | Comma or space-separated list of syslog-ng destinations | Send all sources to alternate destinations |
-| SC4S_DEST_<SOURCE>\_ALTERNATES | Comma or space-separated list of syslog-ng destiinations  | Send specific sources to alternate syslog-ng destinations, e.g. SC4S_DEST_CISCO_ASA_ALTERNATES |
+| SC4S_DEST_\<SOURCE\>_ALTERNATES | Comma or space-separated list of syslog-ng destiinations  | Send specific sources to alternate syslog-ng destinations, e.g. SC4S_DEST_CISCO_ASA_ALTERNATES |
 
 ## SC4S Disk Buffer Configuration
 
@@ -102,7 +102,7 @@ and/or move them to an archival system to avoid exhaustion of disk space.
 | SC4S_SOURCE_TCP_MAX_CONNECTIONS | 2000 | Max number of TCP Connections |
 | SC4S_SOURCE_TCP_IW_SIZE | 20000000 | Initial Window size |
 | SC4S_SOURCE_TCP_FETCH_LIMIT | 2000 | Number of events to fetch from server buffer at once |
-| SC4S_SOURCE_UDP_SO_RCVBUFF | 425984 | UDP server buffer size in bytes |
+| SC4S_SOURCE_UDP_SO_RCVBUFF | 1703936 | UDP server buffer size in bytes. Make sure that the host OS kernel is configured [similarly](gettingstarted/index.md#Prerequisites). |
 | SC4S_SOURCE_STORE_RAWMSG | undefined or "no" | Store unprocessed "on the wire" raw message in the RAWMSG macro for use with the "fallback" sourcetype.  Do _not_ set this in production; substantial memory and disk overhead will result. Use for log path/filter development only. |
 
 ## Syslog Source TLS Certificate Configuration

diff --git a/docs/gettingstarted/docker-swarm-general.md b/docs/gettingstarted/docker-swarm-general.md
@@ -80,9 +80,10 @@ of SC4S for local configurations and context overrides. _Do not_ change the dire
 the files that are laid down; change (or add) only individual files if desired.  SC4S depends on the directory layout
 to read the local configurations properly.  See the notes below for which files will be preserved on restarts.
 
-    * In the `local/config` directory, there are example log path files (`lp-example.*`) and a filter (`example.conf`) in the
-appropriate subdirectories.  These should _not_ be used directly, but copied as examples for your own log path development.
-They _will_ get overwritten at each SC4S start.    
+    * In the `local/config/` directory there are four subdirectories that allow you to provide support for device types
+that are not provided out of the box in SC4S.  To get you started, there is an example log path template (`lp-example.conf.tmpl`)
+and a filter (`example.conf`) in the `log_paths` and `filters` subdirectories, respectively.  These should _not_ be used directly,
+but copied as templates for your own log path development.  They _will_ get overwritten at each SC4S start. 
 
     * In the `local/context` directory, if you change the "non-example" version of a file (e.g. `splunk_index.csv`) the changes
 will be preserved on a restart.  However, the "example" files _themselves_ (e.g. `splunk_index.csv.example`) will be updated

diff --git a/docs/gettingstarted/docker-swarm-rhel7.md b/docs/gettingstarted/docker-swarm-rhel7.md
@@ -88,9 +88,10 @@ of SC4S for local configurations and context overrides. _Do not_ change the dire
 the files that are laid down; change (or add) only individual files if desired.  SC4S depends on the directory layout
 to read the local configurations properly.  See the notes below for which files will be preserved on restarts.
 
-    * In the `local/config` directory, there are example log path files (`lp-example.*`) and a filter (`example.conf`) in the
-appropriate subdirectories.  These should _not_ be used directly, but copied as examples for your own log path development.
-They _will_ get overwritten at each SC4S start.    
+    * In the `local/config/` directory there are four subdirectories that allow you to provide support for device types
+that are not provided out of the box in SC4S.  To get you started, there is an example log path template (`lp-example.conf.tmpl`)
+and a filter (`example.conf`) in the `log_paths` and `filters` subdirectories, respectively.  These should _not_ be used directly,
+but copied as templates for your own log path development.  They _will_ get overwritten at each SC4S start.  
 
     * In the `local/context` directory, if you change the "non-example" version of a file (e.g. `splunk_index.csv`) the changes
 will be preserved on a restart.  However, the "example" files _themselves_ (e.g. `splunk_index.csv.example`) will be updated

diff --git a/docs/gettingstarted/docker-systemd-general.md b/docs/gettingstarted/docker-systemd-general.md
@@ -86,9 +86,10 @@ of SC4S for local configurations and context overrides. _Do not_ change the dire
 the files that are laid down; change (or add) only individual files if desired.  SC4S depends on the directory layout
 to read the local configurations properly.  See the notes below for which files will be preserved on restarts.
 
-    * In the `local/config` directory, there are example log path files (`lp-example.*`) and a filter (`example.conf`) in the
-appropriate subdirectories.  These should _not_ be used directly, but copied as examples for your own log path development.
-They _will_ get overwritten at each SC4S start.    
+    * In the `local/config/` directory there are four subdirectories that allow you to provide support for device types
+that are not provided out of the box in SC4S.  To get you started, there is an example log path template (`lp-example.conf.tmpl`)
+and a filter (`example.conf`) in the `log_paths` and `filters` subdirectories, respectively.  These should _not_ be used directly,
+but copied as templates for your own log path development.  They _will_ get overwritten at each SC4S start.  
 
     * In the `local/context` directory, if you change the "non-example" version of a file (e.g. `splunk_index.csv`) the changes
 will be preserved on a restart.  However, the "example" files _themselves_ (e.g. `splunk_index.csv.example`) will be updated

diff --git a/docs/gettingstarted/index.md b/docs/gettingstarted/index.md
@@ -23,7 +23,7 @@ instance in the same VLAN as the source device.
 environment.
 * Avoid TCP except where the source is unable to contain the event to a single UDP packet.
 * Avoid TLS except where the event may cross a untrusted network.
-* Plan for appropriately sized hardware (see)[performance.md]
+* Plan for [appropriately sized hardware](../performance.md)
 
 
 ## Implementation
@@ -61,6 +61,13 @@ session.  Alternatively, a list of HEC endpoint URLs can be configured in SC4S (
 recommended that SC4S traffic be sent to HEC endpoints configured directly on the indexers rather than an intermediate tier of HWFs. Deployments with 10 or fewer Indexers and where HEC is used exclusively for syslog, the recommendation is to use the native load balancing. In all other scenarios the recommendation is to use an external load balacer. If utilizing the native load balancing, be sure to update the configuration when the number and/or names of the indexers change.
 - Create a HEC token that will be used by SC4S and ensure the token has access to place events in main, em_metrics, and all indexes used as
 event destinations.
+
+    * NOTE: It is recommended that the "Selected Indexes" on the token configuration page be left blank so that the token has access to
+_all_ indexes, including the `lastChanceIndex`.  If this list is populated, extreme care must be taken to keep it up to date, as an attempt to
+send data to an index not in this list will result in a `400` error from the HEC endpoint. Furthermore, the `lastChanceIndex` will _not_ be
+consulted in the event the index specified in the event is not configured on Splunk.  Keep in mind just _one_ bad message will "taint" the
+whole batch (by default 1000 events) and prevent the entire batch from being sent to Splunk.
+
 - Refer to [Splunk Cloud](http://docs.splunk.com/Documentation/Splunk/7.3.1/Data/UsetheHTTPEventCollector#Configure_HTTP_Event_Collector_on_managed_Splunk_Cloud)
 or [Splunk Enterprise](http://dev.splunk.com/view/event-collector/SP-CAAAE6Q) for specific HEC configuration instructions based on your
 Splunk type.
@@ -71,13 +78,33 @@ Splunk type.
 
 * Linux host with Docker (CE 19.x or greater with Docker Swarm) or Podman enabled, depending on runtime choice (below).
 * A network load balancer (NLB) configured for round robin. Note: Special consideration may be required when more advanced products are used. The optimal configuration of the load balancer will round robin each http POST request (not each connection).
+* The host linux OS receive buffer size should be tuned to match the sc4s default to avoid dropping events (packets) at the network level.
+The default receive buffer for sc4s is set to 16 MB for UDP traffic, which should be OK for most environments.  To set the host OS kernel to
+match this, edit `/etc/sysctl.conf` using the following whole-byte values corresponding to 16 MB:
+
+```bash
+net.core.rmem_default = 1703936
+net.core.rmem_max = 1703936
+```
+and apply to the kernel:
+```bash
+sysctl -p
+```
+* Ensure the kernel is not dropping packets by periodically monitoring the buffer with the command
+`netstat -su | grep "receive errors"`.
+* NOTE: Failure to account for high-volume traffic (especially UDP) by tuning the kernel will result in message loss, which can be _very_
+unpredictable and difficult to detect. See this helpful discusion in the syslog-ng
+[Professional Edition](https://www.syslog-ng.com/technical-documents/doc/syslog-ng-premium-edition/7.0.10/collecting-log-messages-from-udp-sources)
+documentation regarding tuning syslog-ng in particular (via the [SC4S_SOURCE_UDP_SO_RCVBUFF](../configuration.md#Syslog Source Configuration)
+environment variable in sc4s) as well as overall host kernel tuning.  The default values for receive kernel buffers in most distros is 2 MB,
+which has proven inadequate for many.
 
 #### Select a Container Runtime and SC4S Configuration
 
 | Container and Orchestration | Notes |
 |-----------------------------|-------|
-| [Podman + systemd](podman-systemd-general.md) | First choice for RedHat 7.x/8.x and CentOS, second choice for Debian and Ubuntu (packages provided via PPA) |
-| [Docker CE + systemd](docker-systemd-general.md) | First choice for Debian and Ubuntu; second choice for CentOS for those with limited existing Docker experience |
+| [Podman + systemd](podman-systemd-general.md) | First choice for RedHat 8.x and CentOS, second choice for Debian and Ubuntu (packages provided via PPA). |
+| [Docker CE + systemd](docker-systemd-general.md) | First choice for RHEL/CentOS 7.x, Debian and Ubuntu |
 | [Docker CE + Swarm](docker-swarm-general.md) | Option for Debian, Ubuntu, CentOS, and Desktop Docker desiring Docker Compose or Swarm orchestration |
 | [Docker CE + Swarm RHEL 7.7](docker-swarm-rhel7.md) | Option for RedHat 7.7 desiring Docker Compose or Swarm orchestration |
 | [Bring your own Envionment](byoe-rhel7.md) | Option for RedHat 7.7 (centos 7) with SC4S configuration without containers |

diff --git a/docs/gettingstarted/podman-systemd-general.md b/docs/gettingstarted/podman-systemd-general.md
@@ -1,4 +1,9 @@
 
+# WARNING:  Do _not_ use Podman with RHEL/CentOS 7.x or earlier!
+
+There have been cases where UDP packet loss is noted when Podman is used with RHEL/CentOS 7.x versions.  Stay tuned; the cause is
+currently unkown.
+
 # Install podman
 
 Refer to [Installation](https://podman.io/getting-started/installation)
@@ -68,9 +73,10 @@ of SC4S for local configurations and context overrides. _Do not_ change the dire
 the files that are laid down; change (or add) only individual files if desired.  SC4S depends on the directory layout
 to read the local configurations properly.  See the notes below for which files will be preserved on restarts.
 
-    * In the `local/config` directory, there are example log path files (`lp-example.*`) and a filter (`example.conf`) in the
-appropriate subdirectories.  These should _not_ be used directly, but copied as examples for your own log path development.
-They _will_ get overwritten at each SC4S start.    
+    * In the `local/config/` directory there are four subdirectories that allow you to provide support for device types
+that are not provided out of the box in SC4S.  To get you started, there is an example log path template (`lp-example.conf.tmpl`)
+and a filter (`example.conf`) in the `log_paths` and `filters` subdirectories, respectively.  These should _not_ be used directly,
+but copied as templates for your own log path development.  They _will_ get overwritten at each SC4S start.
 
     * In the `local/context` directory, if you change the "non-example" version of a file (e.g. `splunk_index.csv`) the changes
 will be preserved on a restart.  However, the "example" files _themselves_ (e.g. `splunk_index.csv.example`) will be updated

diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
@@ -2,41 +2,104 @@
 
 ## General
 
+Prior to production deployment, it is easier to gauge proper operation outside of the systemd startup environment.  systemctl/systemd
+make it difficult to see the error output of problematic services, so rather than "fight it" there, it's best to confirm proper
+operation directly on the CLI.
+
 To test the container outside of the systemd startup environment, you can run the following to test the syntax
-of the container.  These commands assume the local mounted directory is set up as shown in the gettingstarted
-examples (and omits the disk buffer mount):
+of the container.  These commands assume the local mounted directories are set up as shown in the gettingstarted
+examples:
 
-```
-/usr/bin/docker run --env-file=/opt/sc4s/env_file -v "/opt/sc4s/local:/opt/syslog-ng/etc/conf.d/local:z" --name SC4S_preflight --rm splunk/scs:latest -s
+```bash
+/usr/bin/podman run -p 514:514 -p 514:514/udp -p 6514:6514 -p 5000-5020:5000-5020 -p 5000-5020:5000-5020/udp \
+    --env-file=/opt/sc4s/env_file \
+    -v splunk-sc4s-var:/opt/syslog-ng/var \
+    -v /opt/sc4s/local:/opt/syslog-ng/etc/conf.d/local:z \
+    -v /opt/sc4s/archive:/opt/syslog-ng/var/archive:z \
+    --name SC4S_preflight \
+    --rm splunk/scs:latest -s
 ```
 
 and you can run
 
-```
-/usr/bin/docker run --env-file=/opt/sc4s/env_file -v "/opt/sc4s/local:/opt/syslog-ng/etc/conf.d/local:z" --name SC4S --rm splunk/scs:latest
+```bash
+/usr/bin/podman run -p 514:514 -p 514:514/udp -p 6514:6514 -p 5000-5020:5000-5020 -p 5000-5020:5000-5020/udp \
+    --env-file=/opt/sc4s/env_file \
+    -v splunk-sc4s-var:/opt/syslog-ng/var \
+    -v /opt/sc4s/local:/opt/syslog-ng/etc/conf.d/local:z \
+    -v /opt/sc4s/archive:/opt/syslog-ng/var/archive:z \
+    --name SC4S \
+    --rm splunk/scs:latest
 ```
 
-to test the final image.  These commands can help with container errors that are hidden in the systemd process.  If you
-are using podman, substitute "podman" for "docker" for the container runtime command above.
+to test the final image.  If you are using podman, substitute "podman" for "docker" for the container runtime command above.
 
 ### Verification of TLS Server
 
-To verify the correct configuration of the TLS server use the following command. Replace the IP, FQDN, and port as appropriate
+To verify the correct configuration of the TLS server use the following command. Use `podman` or `docker` and replace the IP, FQDN,
+and port as appropriate:
+
+```bash
+<podman|docker> run -ti drwetter/testssl.sh --severity MEDIUM --ip 127.0.0.1 selfsigned.example.com:6510
+```
+
+## Validating HEC/token issues (AKA "No data in Splunk")
+
+The first thing to check are the container logs themselves, where stdout from the underlying syslog-ng is written by default.  To do this,
+run:
 
-* Docker
+```bash
+/usr/bin/podman logs SC4S
 ```
-docker run -ti drwetter/testssl.sh --severity MEDIUM --ip 127.0.0.1 selfsigned.example.com:6510
+
+and note the output.  You may see entries similar to these:
+```
+Mar 16 19:00:06 b817af4e89da syslog-ng[1]: Server returned with a 4XX (client errors) status code, which means we are not authorized or the URL is not found.; url='https://splunk-instance.com:8088/services/collector/event', status_code='400', driver='d_hec#0', location='/opt/syslog-ng/etc/conf.d/destinations/splunk_hec.conf:2:5'
+Mar 16 19:00:06 b817af4e89da syslog-ng[1]: Server disconnected while preparing messages for sending, trying again; driver='d_hec#0', location='/opt/syslog-ng/etc/conf.d/destinations/splunk_hec.conf:2:5', worker_index='4', time_reopen='10', batch_size='1000'
 ```
+This is an indication that the standard `d_hec` destination in syslog-ng (which is the route to Splunk) is being rejected by the HEC endpoint.
+A `400` error (not 404) is normally caused by an index that has not been created on the Splunk side, and is a common occurrence in new
+installations.  This can present a serious problem, as just _one_ bad index will "taint" the entire batch (in this case, 1000 events) and
+prevent _any_ of them from being sent to Splunk.  _It is imperative that the container logs be free of these kinds of errors in production._
+
+### Enabling the Alternate Debug Destination
+
+To help debug why the `400` errors are ocurring, it is helpful to enable an alternate destination for syslog traffic that will write
+the contents of the full JSON payload that is intended to be sent to Splunk via HEC.  This destination will contain each event, repackaged
+as a `curl` command that can be run directly on the command line to see what the response from the HEC endpoint is.  To do this, set
+`SC4S_DEST_GLOBAL_ALTERNATES=d_hec_debug` in the `env_file` and restart sc4s.  When set, all data destined for Splunk will also be written to
+`/opt/sc4s/archived/debug`, and will be further categorized in subdirectories by sourcetype.  Here are the things to check:
 
-* Podman
+* In `/opt/sc4s/archived/debug`, you will see directories for each sourcetype that sc4s has collected. If you recognize any that you
+don't expect, check to see that the index is created in Splunk, or that a `lastChanceIndex` is created and enabled.  This is the
+cause for almost _all_ `400` errors.
+* If you continue to the individual log entries in these directories, you will see entries of the form
+```bash
+curl -k -u "sc4s HEC debug:a778f63a-5dff-4e3c-a72c-a03183659e94" "https://splunk.smg.aws:8088/services/collector/event" -d '{"time":"1584556114.271","sourcetype":"sc4s:events","source":"SC4S:s_internal","index":"main","host":"e3563b0ea5d8","fields":{"sc4s_syslog_severity":"notice","sc4s_syslog_facility":"syslog","sc4s_log_host":"e3563b0ea5d8","sc4s_fromhostip":"127.0.0.1"},"event":"syslog-ng starting up; version='3.25.1'"}'
 ```
-podman run -ti drwetter/testssl.sh --severity MEDIUM --ip 127.0.0.1 selfsigned.example.com:6510
+* These commands, with minimal modifications (e.g. multiple URLs specified or elements that needs shell escapes) can be run directly on the
+command line to determine what, exactly, the HEC endpoint is returning.  This can be used to refine th index or other parameter to correct the
+problem.
+
+## "Exec" into the container
+
+You can confirm how the templating process created the actual syslog-ng config files that are in use by "exec'ing in" to the container
+and navigating the syslog-ng config filesystem directly.  To do this, run
+```bash
+/usr/bin/podman exec -it SC4S /bin/bash
 ```
+and navigate to `/opt/syslog-ng/etc/` to see the actual config files in use.  If you are adept with container operations and syslog-ng
+itself, you can also modify files directly and reload syslog-ng with the command `kill -1 1` in the container.  This is an advanced topic
+and futher help can be obtained via the github issue tracker and Slack channels.
 
-## Syslog-ng Metrics 
+## Run the container with a null entrypoint (Advanced!)
 
-## Syslog-NG Events
+You can run the container without the usual entrypoint shell script by executing this command (modified to suit your environment):
 
-## Container Events
+```bash
+/usr/bin/podman run -p 514:514 -p 514:514/udp -p 5000-5020:5000-5020 -p 5000-5020:5000-5020/udp --entrypoint=tail --env-file=/opt/sc4s/env_file -v /opt/sc4s/local:/opt/syslog-ng/etc/conf.d/local:z --name SC4S --rm splunk/scs:latest -f /dev/null
+```
+From there, you can "exec" into the container (above) and run the `/entrypoint.sh` script by hand (or a subset of it, such as everything
+but syslog-ng) and have complete control over the templating and underlying syslog-ng process.  Again, this is an advanced topic but can be
+very useful for low-level troubleshooting.
 
-# Monitoring