using vector to send JSON-formatted nginx logs to ClickHouse

for some particular case i need to log and later analyze POST bodies of HTTP requests sent via nginx reverse proxy. ClickHouse will store the logs and be useful in analysis.

nginx configuration:

in /etc/nginx/conf.d/pQd-logformat.conf i’m defining log format that’s ndjson / newline delimited json:

log_format json_combined escape=json '{'
        '"ts":"$time_iso8601",'
        '"remote_addr":"$remote_addr",'
        '"request":"$request",'
        '"status": "$status",'
        '"body_bytes_sent":"$body_bytes_sent",'
        '"request_body":"$request_body",'
        '"request_time":"$request_time",'
        '"http_user_agent":"$http_user_agent"'
'}';

in vhost definition – kept in /etc/nginx/sites-enabled/my.vhost.com i’m referring to that format:

server {
        server_name my.vhost.com;
        access_log  /var/log/nginx/my.vhost.com-access.log json_combined;
        root /var/www/html;
        index index.html index.htm index.nginx-debian.html;

        location / {
                proxy_pass https://original.address/;
                proxy_connect_timeout 10s;
                proxy_read_timeout 300s;
        }

}

after a restart nginx is producing this in /var/log/nginx/my.vhost.com-access.log

{"ts":"2023-06-15T04:09:36+00:00","remote_addr":"10.12.40.5","request":"POST /something HTTP/1.1","status": "400","body_bytes_sent":"138","request_body":"some payload","request_time":"0.809","http_user_agent":""}
{"ts":"2023-06-15T04:09:42+00:00","remote_addr":"10.12.40.5","request":"POST /somethingelse HTTP/1.1","status": "400","body_bytes_sent":"138","request_body":"another payload","request_time":"0.798","http_user_agent":""}

clickhouse setup:

CREATE DATABASE nginx;
CREATE TABLE nginx.log
(
    `ts` DateTime,
    `remote_addr` LowCardinality(String),
    `request` LowCardinality(String),
    `status` LowCardinality(String),
    `body_bytes_sent` UInt32,
    `request_body` String,
    `request_time` Float32,
    `http_user_agent` LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY ts
SETTINGS index_granularity = 8192;

CREATE USER nginx_log_importer IDENTIFIED WITH plaintext_password BY 'somepass';
GRANT INSERT ON nginx.log TO nginx_log_importer;

vector, which i’ll use to transfer logs from nginx log files to ClickHouse, will use ClickHouse’s http interface on port 8123 TCP. i’m ensuring that firewalls allow for communication from server with nginx+vector to one with ClickHouse.

lastly setup of vector, running on machine with nginx – kept in /etc/vector/vector.toml

[sources.nginx_logs]
type = "file"
include = [ "/var/log/nginx/my.vhost.com-access.log" ]
read_from = "end"

[transforms.process]
type = "remap"
inputs = ["nginx_logs"]
source = '''
.=parse_json!(.message)
'''

[sinks.print]
type = "console"
inputs = ["process"]
encoding.codec = "json"

[sinks.clickhouse]
date_time_best_effort=true
auth.strategy="basic"
auth.user="nginx_log_importer"
auth.password="somepass"
compression="none"
type = "clickhouse"
inputs = ["process"]
endpoint = "http://addressOfClickhouseServer:8123"
database = "nginx"
table = "log"
skip_unknown_fields = true

i’m running vector manually to see if all is working:

vector --config /etc/vector/vector.toml

and i’m making http requests to my.vhost.com to generate some traffic. logs nicely get into ClickHouse’s new table:

select * from browserless_log order by ts\G
Row 20:
───────
ts:              2023-06-15 04:09:36
remote_addr:     10.0.00.5
request:         POST /somethin HTTP/1.1
status:          400
body_bytes_sent: 138
request_body:    some payload
request_time:    0.809
http_user_agent:

Row 21:
───────
ts:              2023-06-15 04:09:42
remote_addr:     10.0.0.5
request:         POST /somethingelse  HTTP/1.1
status:          400
body_bytes_sent: 138
request_body:    another payload
request_time:    0.798
http_user_agent:

to start vector as a background service:

chown vector:vector -R /var/lib/vector/
systemctl enable --now vector
systemctl start vector

once i see all works fine – i also remove [sinks.print] section from /etc/vector/vector.toml

above is based on:

  • https://clickhouse.com/docs/en/integrations/vector
  • https://medium.com/datadenys/using-vector-to-feed-nginx-logs-to-clickhouse-in-real-time-197745d9e88b
  • https://vector.dev/docs/reference/configuration/sinks/clickhouse/

Leave a Reply

Your email address will not be published. Required fields are marked *

(Spamcheck Enabled)