Skip to content

Commit

Permalink
Update troubleshooting docs
Browse files Browse the repository at this point in the history
* Update troubleshooting docs
* Refine gettingstarted runtime docs
* Update default receive buffer to 16 MB (from 2 MB)
  • Loading branch information
Mark Bonsack committed Mar 23, 2020
1 parent ddab7c5 commit 1639208
Show file tree
Hide file tree
Showing 8 changed files with 134 additions and 35 deletions.
4 changes: 2 additions & 2 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ separately from that of the alternates below.
| Variable | Values | Description |
|----------|---------------|-------------|
| SC4S_DEST_GLOBAL_ALTERNATES | Comma or space-separated list of syslog-ng destinations | Send all sources to alternate destinations |
| SC4S_DEST_<SOURCE>\_ALTERNATES | Comma or space-separated list of syslog-ng destiinations | Send specific sources to alternate syslog-ng destinations, e.g. SC4S_DEST_CISCO_ASA_ALTERNATES |
| SC4S_DEST_\<SOURCE\>_ALTERNATES | Comma or space-separated list of syslog-ng destiinations | Send specific sources to alternate syslog-ng destinations, e.g. SC4S_DEST_CISCO_ASA_ALTERNATES |

## SC4S Disk Buffer Configuration

Expand Down Expand Up @@ -102,7 +102,7 @@ and/or move them to an archival system to avoid exhaustion of disk space.
| SC4S_SOURCE_TCP_MAX_CONNECTIONS | 2000 | Max number of TCP Connections |
| SC4S_SOURCE_TCP_IW_SIZE | 20000000 | Initial Window size |
| SC4S_SOURCE_TCP_FETCH_LIMIT | 2000 | Number of events to fetch from server buffer at once |
| SC4S_SOURCE_UDP_SO_RCVBUFF | 425984 | UDP server buffer size in bytes |
| SC4S_SOURCE_UDP_SO_RCVBUFF | 1703936 | UDP server buffer size in bytes. Make sure that the host OS kernel is configured [similarly](gettingstarted/index.md#Prerequisites). |
| SC4S_SOURCE_STORE_RAWMSG | undefined or "no" | Store unprocessed "on the wire" raw message in the RAWMSG macro for use with the "fallback" sourcetype. Do _not_ set this in production; substantial memory and disk overhead will result. Use for log path/filter development only. |

## Syslog Source TLS Certificate Configuration
Expand Down
7 changes: 4 additions & 3 deletions docs/gettingstarted/docker-swarm-general.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,9 +80,10 @@ of SC4S for local configurations and context overrides. _Do not_ change the dire
the files that are laid down; change (or add) only individual files if desired. SC4S depends on the directory layout
to read the local configurations properly. See the notes below for which files will be preserved on restarts.

* In the `local/config` directory, there are example log path files (`lp-example.*`) and a filter (`example.conf`) in the
appropriate subdirectories. These should _not_ be used directly, but copied as examples for your own log path development.
They _will_ get overwritten at each SC4S start.
* In the `local/config/` directory there are four subdirectories that allow you to provide support for device types
that are not provided out of the box in SC4S. To get you started, there is an example log path template (`lp-example.conf.tmpl`)
and a filter (`example.conf`) in the `log_paths` and `filters` subdirectories, respectively. These should _not_ be used directly,
but copied as templates for your own log path development. They _will_ get overwritten at each SC4S start.

* In the `local/context` directory, if you change the "non-example" version of a file (e.g. `splunk_index.csv`) the changes
will be preserved on a restart. However, the "example" files _themselves_ (e.g. `splunk_index.csv.example`) will be updated
Expand Down
7 changes: 4 additions & 3 deletions docs/gettingstarted/docker-swarm-rhel7.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,9 +88,10 @@ of SC4S for local configurations and context overrides. _Do not_ change the dire
the files that are laid down; change (or add) only individual files if desired. SC4S depends on the directory layout
to read the local configurations properly. See the notes below for which files will be preserved on restarts.

* In the `local/config` directory, there are example log path files (`lp-example.*`) and a filter (`example.conf`) in the
appropriate subdirectories. These should _not_ be used directly, but copied as examples for your own log path development.
They _will_ get overwritten at each SC4S start.
* In the `local/config/` directory there are four subdirectories that allow you to provide support for device types
that are not provided out of the box in SC4S. To get you started, there is an example log path template (`lp-example.conf.tmpl`)
and a filter (`example.conf`) in the `log_paths` and `filters` subdirectories, respectively. These should _not_ be used directly,
but copied as templates for your own log path development. They _will_ get overwritten at each SC4S start.

* In the `local/context` directory, if you change the "non-example" version of a file (e.g. `splunk_index.csv`) the changes
will be preserved on a restart. However, the "example" files _themselves_ (e.g. `splunk_index.csv.example`) will be updated
Expand Down
7 changes: 4 additions & 3 deletions docs/gettingstarted/docker-systemd-general.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,9 +86,10 @@ of SC4S for local configurations and context overrides. _Do not_ change the dire
the files that are laid down; change (or add) only individual files if desired. SC4S depends on the directory layout
to read the local configurations properly. See the notes below for which files will be preserved on restarts.

* In the `local/config` directory, there are example log path files (`lp-example.*`) and a filter (`example.conf`) in the
appropriate subdirectories. These should _not_ be used directly, but copied as examples for your own log path development.
They _will_ get overwritten at each SC4S start.
* In the `local/config/` directory there are four subdirectories that allow you to provide support for device types
that are not provided out of the box in SC4S. To get you started, there is an example log path template (`lp-example.conf.tmpl`)
and a filter (`example.conf`) in the `log_paths` and `filters` subdirectories, respectively. These should _not_ be used directly,
but copied as templates for your own log path development. They _will_ get overwritten at each SC4S start.

* In the `local/context` directory, if you change the "non-example" version of a file (e.g. `splunk_index.csv`) the changes
will be preserved on a restart. However, the "example" files _themselves_ (e.g. `splunk_index.csv.example`) will be updated
Expand Down
33 changes: 30 additions & 3 deletions docs/gettingstarted/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ instance in the same VLAN as the source device.
environment.
* Avoid TCP except where the source is unable to contain the event to a single UDP packet.
* Avoid TLS except where the event may cross a untrusted network.
* Plan for appropriately sized hardware (see)[performance.md]
* Plan for [appropriately sized hardware](../performance.md)


## Implementation
Expand Down Expand Up @@ -61,6 +61,13 @@ session. Alternatively, a list of HEC endpoint URLs can be configured in SC4S (
recommended that SC4S traffic be sent to HEC endpoints configured directly on the indexers rather than an intermediate tier of HWFs. Deployments with 10 or fewer Indexers and where HEC is used exclusively for syslog, the recommendation is to use the native load balancing. In all other scenarios the recommendation is to use an external load balacer. If utilizing the native load balancing, be sure to update the configuration when the number and/or names of the indexers change.
- Create a HEC token that will be used by SC4S and ensure the token has access to place events in main, em_metrics, and all indexes used as
event destinations.

* NOTE: It is recommended that the "Selected Indexes" on the token configuration page be left blank so that the token has access to
_all_ indexes, including the `lastChanceIndex`. If this list is populated, extreme care must be taken to keep it up to date, as an attempt to
send data to an index not in this list will result in a `400` error from the HEC endpoint. Furthermore, the `lastChanceIndex` will _not_ be
consulted in the event the index specified in the event is not configured on Splunk. Keep in mind just _one_ bad message will "taint" the
whole batch (by default 1000 events) and prevent the entire batch from being sent to Splunk.

- Refer to [Splunk Cloud](http://docs.splunk.com/Documentation/Splunk/7.3.1/Data/UsetheHTTPEventCollector#Configure_HTTP_Event_Collector_on_managed_Splunk_Cloud)
or [Splunk Enterprise](http://dev.splunk.com/view/event-collector/SP-CAAAE6Q) for specific HEC configuration instructions based on your
Splunk type.
Expand All @@ -71,13 +78,33 @@ Splunk type.

* Linux host with Docker (CE 19.x or greater with Docker Swarm) or Podman enabled, depending on runtime choice (below).
* A network load balancer (NLB) configured for round robin. Note: Special consideration may be required when more advanced products are used. The optimal configuration of the load balancer will round robin each http POST request (not each connection).
* The host linux OS receive buffer size should be tuned to match the sc4s default to avoid dropping events (packets) at the network level.
The default receive buffer for sc4s is set to 16 MB for UDP traffic, which should be OK for most environments. To set the host OS kernel to
match this, edit `/etc/sysctl.conf` using the following whole-byte values corresponding to 16 MB:

```bash
net.core.rmem_default = 1703936
net.core.rmem_max = 1703936
```
and apply to the kernel:
```bash
sysctl -p
```
* Ensure the kernel is not dropping packets by periodically monitoring the buffer with the command
`netstat -su | grep "receive errors"`.
* NOTE: Failure to account for high-volume traffic (especially UDP) by tuning the kernel will result in message loss, which can be _very_
unpredictable and difficult to detect. See this helpful discusion in the syslog-ng
[Professional Edition](https://www.syslog-ng.com/technical-documents/doc/syslog-ng-premium-edition/7.0.10/collecting-log-messages-from-udp-sources)
documentation regarding tuning syslog-ng in particular (via the [SC4S_SOURCE_UDP_SO_RCVBUFF](../configuration.md#Syslog Source Configuration)
environment variable in sc4s) as well as overall host kernel tuning. The default values for receive kernel buffers in most distros is 2 MB,
which has proven inadequate for many.

#### Select a Container Runtime and SC4S Configuration

| Container and Orchestration | Notes |
|-----------------------------|-------|
| [Podman + systemd](podman-systemd-general.md) | First choice for RedHat 7.x/8.x and CentOS, second choice for Debian and Ubuntu (packages provided via PPA) |
| [Docker CE + systemd](docker-systemd-general.md) | First choice for Debian and Ubuntu; second choice for CentOS for those with limited existing Docker experience |
| [Podman + systemd](podman-systemd-general.md) | First choice for RedHat 8.x and CentOS, second choice for Debian and Ubuntu (packages provided via PPA). |
| [Docker CE + systemd](docker-systemd-general.md) | First choice for RHEL/CentOS 7.x, Debian and Ubuntu |
| [Docker CE + Swarm](docker-swarm-general.md) | Option for Debian, Ubuntu, CentOS, and Desktop Docker desiring Docker Compose or Swarm orchestration |
| [Docker CE + Swarm RHEL 7.7](docker-swarm-rhel7.md) | Option for RedHat 7.7 desiring Docker Compose or Swarm orchestration |
| [Bring your own Envionment](byoe-rhel7.md) | Option for RedHat 7.7 (centos 7) with SC4S configuration without containers |
Expand Down
12 changes: 9 additions & 3 deletions docs/gettingstarted/podman-systemd-general.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,9 @@

# WARNING: Do _not_ use Podman with RHEL/CentOS 7.x or earlier!

There have been cases where UDP packet loss is noted when Podman is used with RHEL/CentOS 7.x versions. Stay tuned; the cause is
currently unkown.

# Install podman

Refer to [Installation](https://podman.io/getting-started/installation)
Expand Down Expand Up @@ -68,9 +73,10 @@ of SC4S for local configurations and context overrides. _Do not_ change the dire
the files that are laid down; change (or add) only individual files if desired. SC4S depends on the directory layout
to read the local configurations properly. See the notes below for which files will be preserved on restarts.

* In the `local/config` directory, there are example log path files (`lp-example.*`) and a filter (`example.conf`) in the
appropriate subdirectories. These should _not_ be used directly, but copied as examples for your own log path development.
They _will_ get overwritten at each SC4S start.
* In the `local/config/` directory there are four subdirectories that allow you to provide support for device types
that are not provided out of the box in SC4S. To get you started, there is an example log path template (`lp-example.conf.tmpl`)
and a filter (`example.conf`) in the `log_paths` and `filters` subdirectories, respectively. These should _not_ be used directly,
but copied as templates for your own log path development. They _will_ get overwritten at each SC4S start.

* In the `local/context` directory, if you change the "non-example" version of a file (e.g. `splunk_index.csv`) the changes
will be preserved on a restart. However, the "example" files _themselves_ (e.g. `splunk_index.csv.example`) will be updated
Expand Down
97 changes: 80 additions & 17 deletions docs/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,41 +2,104 @@

## General

Prior to production deployment, it is easier to gauge proper operation outside of the systemd startup environment. systemctl/systemd
make it difficult to see the error output of problematic services, so rather than "fight it" there, it's best to confirm proper
operation directly on the CLI.

To test the container outside of the systemd startup environment, you can run the following to test the syntax
of the container. These commands assume the local mounted directory is set up as shown in the gettingstarted
examples (and omits the disk buffer mount):
of the container. These commands assume the local mounted directories are set up as shown in the gettingstarted
examples:

```
/usr/bin/docker run --env-file=/opt/sc4s/env_file -v "/opt/sc4s/local:/opt/syslog-ng/etc/conf.d/local:z" --name SC4S_preflight --rm splunk/scs:latest -s
```bash
/usr/bin/podman run -p 514:514 -p 514:514/udp -p 6514:6514 -p 5000-5020:5000-5020 -p 5000-5020:5000-5020/udp \
--env-file=/opt/sc4s/env_file \
-v splunk-sc4s-var:/opt/syslog-ng/var \
-v /opt/sc4s/local:/opt/syslog-ng/etc/conf.d/local:z \
-v /opt/sc4s/archive:/opt/syslog-ng/var/archive:z \
--name SC4S_preflight \
--rm splunk/scs:latest -s
```

and you can run

```
/usr/bin/docker run --env-file=/opt/sc4s/env_file -v "/opt/sc4s/local:/opt/syslog-ng/etc/conf.d/local:z" --name SC4S --rm splunk/scs:latest
```bash
/usr/bin/podman run -p 514:514 -p 514:514/udp -p 6514:6514 -p 5000-5020:5000-5020 -p 5000-5020:5000-5020/udp \
--env-file=/opt/sc4s/env_file \
-v splunk-sc4s-var:/opt/syslog-ng/var \
-v /opt/sc4s/local:/opt/syslog-ng/etc/conf.d/local:z \
-v /opt/sc4s/archive:/opt/syslog-ng/var/archive:z \
--name SC4S \
--rm splunk/scs:latest
```

to test the final image. These commands can help with container errors that are hidden in the systemd process. If you
are using podman, substitute "podman" for "docker" for the container runtime command above.
to test the final image. If you are using podman, substitute "podman" for "docker" for the container runtime command above.

### Verification of TLS Server

To verify the correct configuration of the TLS server use the following command. Replace the IP, FQDN, and port as appropriate
To verify the correct configuration of the TLS server use the following command. Use `podman` or `docker` and replace the IP, FQDN,
and port as appropriate:

```bash
<podman|docker> run -ti drwetter/testssl.sh --severity MEDIUM --ip 127.0.0.1 selfsigned.example.com:6510
```

## Validating HEC/token issues (AKA "No data in Splunk")

The first thing to check are the container logs themselves, where stdout from the underlying syslog-ng is written by default. To do this,
run:

* Docker
```bash
/usr/bin/podman logs SC4S
```
docker run -ti drwetter/testssl.sh --severity MEDIUM --ip 127.0.0.1 selfsigned.example.com:6510

and note the output. You may see entries similar to these:
```
Mar 16 19:00:06 b817af4e89da syslog-ng[1]: Server returned with a 4XX (client errors) status code, which means we are not authorized or the URL is not found.; url='https://splunk-instance.com:8088/services/collector/event', status_code='400', driver='d_hec#0', location='/opt/syslog-ng/etc/conf.d/destinations/splunk_hec.conf:2:5'
Mar 16 19:00:06 b817af4e89da syslog-ng[1]: Server disconnected while preparing messages for sending, trying again; driver='d_hec#0', location='/opt/syslog-ng/etc/conf.d/destinations/splunk_hec.conf:2:5', worker_index='4', time_reopen='10', batch_size='1000'
```
This is an indication that the standard `d_hec` destination in syslog-ng (which is the route to Splunk) is being rejected by the HEC endpoint.
A `400` error (not 404) is normally caused by an index that has not been created on the Splunk side, and is a common occurrence in new
installations. This can present a serious problem, as just _one_ bad index will "taint" the entire batch (in this case, 1000 events) and
prevent _any_ of them from being sent to Splunk. _It is imperative that the container logs be free of these kinds of errors in production._

### Enabling the Alternate Debug Destination

To help debug why the `400` errors are ocurring, it is helpful to enable an alternate destination for syslog traffic that will write
the contents of the full JSON payload that is intended to be sent to Splunk via HEC. This destination will contain each event, repackaged
as a `curl` command that can be run directly on the command line to see what the response from the HEC endpoint is. To do this, set
`SC4S_DEST_GLOBAL_ALTERNATES=d_hec_debug` in the `env_file` and restart sc4s. When set, all data destined for Splunk will also be written to
`/opt/sc4s/archived/debug`, and will be further categorized in subdirectories by sourcetype. Here are the things to check:

* Podman
* In `/opt/sc4s/archived/debug`, you will see directories for each sourcetype that sc4s has collected. If you recognize any that you
don't expect, check to see that the index is created in Splunk, or that a `lastChanceIndex` is created and enabled. This is the
cause for almost _all_ `400` errors.
* If you continue to the individual log entries in these directories, you will see entries of the form
```bash
curl -k -u "sc4s HEC debug:a778f63a-5dff-4e3c-a72c-a03183659e94" "https://splunk.smg.aws:8088/services/collector/event" -d '{"time":"1584556114.271","sourcetype":"sc4s:events","source":"SC4S:s_internal","index":"main","host":"e3563b0ea5d8","fields":{"sc4s_syslog_severity":"notice","sc4s_syslog_facility":"syslog","sc4s_log_host":"e3563b0ea5d8","sc4s_fromhostip":"127.0.0.1"},"event":"syslog-ng starting up; version='3.25.1'"}'
```
podman run -ti drwetter/testssl.sh --severity MEDIUM --ip 127.0.0.1 selfsigned.example.com:6510
* These commands, with minimal modifications (e.g. multiple URLs specified or elements that needs shell escapes) can be run directly on the
command line to determine what, exactly, the HEC endpoint is returning. This can be used to refine th index or other parameter to correct the
problem.

## "Exec" into the container

You can confirm how the templating process created the actual syslog-ng config files that are in use by "exec'ing in" to the container
and navigating the syslog-ng config filesystem directly. To do this, run
```bash
/usr/bin/podman exec -it SC4S /bin/bash
```
and navigate to `/opt/syslog-ng/etc/` to see the actual config files in use. If you are adept with container operations and syslog-ng
itself, you can also modify files directly and reload syslog-ng with the command `kill -1 1` in the container. This is an advanced topic
and futher help can be obtained via the github issue tracker and Slack channels.

## Syslog-ng Metrics
## Run the container with a null entrypoint (Advanced!)

## Syslog-NG Events
You can run the container without the usual entrypoint shell script by executing this command (modified to suit your environment):

## Container Events
```bash
/usr/bin/podman run -p 514:514 -p 514:514/udp -p 5000-5020:5000-5020 -p 5000-5020:5000-5020/udp --entrypoint=tail --env-file=/opt/sc4s/env_file -v /opt/sc4s/local:/opt/syslog-ng/etc/conf.d/local:z --name SC4S --rm splunk/scs:latest -f /dev/null
```
From there, you can "exec" into the container (above) and run the `/entrypoint.sh` script by hand (or a subset of it, such as everything
but syslog-ng) and have complete control over the templating and underlying syslog-ng process. Again, this is an advanced topic but can be
very useful for low-level troubleshooting.

# Monitoring
Loading

0 comments on commit 1639208

Please sign in to comment.