using vector.dev to get haproxy, nginx logs into ClickHouse

at work we have few reverse proxies exposing some APIs. ClickHouse turned out to be great tool for analyzing logs, running ad-hoc research or gathering stats cyclically. below – configs that allow me to get HTTP requests passing via those proxies, including POST bodies, into ClickHouse. i’m using vector.dev to fetch, transform and ship logs.

case 1 – reverse proxy on nginx:

/etc/nginx/conf.d/logformat.conf – defining log format well formatted Newline Delimited JSON

log_format json_combined escape=json '{'          
        '"ts":"$time_iso8601",'                   
        '"remote_addr":"$remote_addr",'           
        '"request":"$request",'                   
        '"status": "$status",'                    
        '"body_bytes_sent":"$body_bytes_sent",'   
        '"request_body":"$request_body",'         
        '"request_time":"$request_time",'         
        '"http_user_agent":"$http_user_agent"'    
'}';

vhost that proxies traffic to the backend – /etc/nginx/sites-enabled/vhost

server {                                                                                                                 
        listen 80 default_server;                                                                                   
        server_name some.name.com;                                                                           
        access_log  /var/log/nginx/some.name.com-access.log json_combined;                               
        root /var/www/html;                                                                                              
        location / {                                                                                                     
                proxy_pass http://127.0.0.1:1234/;                                                               
                proxy_connect_timeout 10s;                                                                               
                proxy_read_timeout 300s;                                                                                 
        }                                                                                                                
}                                                                                                                                                                                                                                                 

single line from from /var/log/nginx/some.name.com-access.log, pretty printed:

{
    "ts": "2025-11-21T16:29:41+00:00",
    "remote_addr": "10.16.10.100",
    "request": "POST /something...  HTTP/1.1",
    "status": "200",
    "body_bytes_sent": "539041",
    "request_body": "........post body......",
    "request_time": "11.257",
    "http_user_agent": "some.agent"
}

vector.dev config to get this into ClickHouse:

sources:                                                                             
  nginx_logs:                                                                        
    type: file                                                                       
    include:                                                                         
      - /var/log/nginx/some.name.com-access.log 
    read_from: end                                                                   
transforms:                                                                          
  process:                                                                           
    type: remap                                                                      
    inputs:                                                                          
      - nginx_logs                                                                   
    source: |                                                                        
      .=parse_json!(.message)                                                        
sinks:                                                                               
  clickhouse:                                                                        
    date_time_best_effort: true                                                      
    auth:                                                                            
      strategy: basic                                                                
      user: clickhouse_user                                                
      password: ................                                            
    compression: none                                                                
    type: clickhouse                                                                 
    inputs:                                                                          
      - process                                                                      
    endpoint: http://address.of.clickhouse.server:8123                                           
    database: mylogs
    table: nginx_log
    skip_unknown_fields: true                                                        

table structure in clickhouse – mylogs.nginx_log:

CREATE TABLE mylogs.nginx_log
(                                           
    `ts` DateTime,                          
    `remote_addr` LowCardinality(String),   
    `request` LowCardinality(String),       
    `status` LowCardinality(String),        
    `body_bytes_sent` UInt32,               
    `request_body` String,                  
    `request_time` Float32,                 
    `http_user_agent` LowCardinality(String)
)                                           
ENGINE = MergeTree                          
ORDER BY ts                                 
TTL ts + toIntervalDay(90)                  
SETTINGS index_granularity = 8192           

case 2 – doing the same with haprxy

part of haproxy.cfg defining log format:

log-format "{\"timestamp\": \"[%t]\", \"backend_name\": \"%b\", \"backend_ip_port\": \"%si:%sp\", \"remote_addr\": \"%ci\", \"request\": \"%r\", \"status\": \"%ST\", \"body_bytes_sent\": \"%B\", \"request_time\":\"%TR\", \"total_session_time\": \"%Tt\", \"server_response_time\": \"%Tr\", \"http_user_agent\": \"%[capture.req.hdr(1)]\", \"request_body\": \"%[capture.req.hdr(0),json(utf8s)]\"}"

/etc/rsyslog.d/49-haproxy.conf – to route logs to local file:

$AddUnixListenSocket /var/lib/haproxy/dev/log                                             
:programname, startswith, "haproxy" {                                                     
  /var/log/haproxy.log                                                                    
  stop                                                                                    
}                                                                                                                                                                                   

sample log line in /var/log/haproxy.log:

2025-11-21T16:48:01.651481+00:00 serverName haproxy[487893]: {"timestamp": "[21/Nov/2025:16:48:00.441]", "backend_name": "bl_workers", "backend_ip_port": "10.7.37.11:443", "remote_addr": "172.16.1.11", "request": "POST /.... HTTP/1.1", "status": "200", "body_bytes_sent": "345", "request_time":"0", "total_session_time": "1209", "server_response_time": "1156", "http_user_agent": "....", "request_body": "................."}

config for vector.dev – we need to extract from syslog format actual JSON message, then transform timestamp from common log format to ISO8601:

sources:
  haproxy_logs:
    type: file
    include:
      -  /var/log/haproxy.log
    read_from: end
transforms:
  strip_and_parse_json:
    type: remap
    inputs:
      - haproxy_logs
    source: |
      json, err = parse_regex(.message, r'^(?:\S+\s+){3}(?P<message>.*)$')
      if err == null {
        json = parse_json!(json.message)
        json.ts = parse_timestamp!(.timestamp, format: "%d/%b/%Y:%H:%M:%S.%3f")
        json.request_time = parse_int!(json.server_response_time)/1000.0
        . = json
      }
sinks:
  console:
    type: console
    inputs:
      - strip_and_parse_json
    encoding:
      codec: text
  clickhouse:
    date_time_best_effort: true
    auth:
      strategy: basic
      user: clickhouse_user
      password: .............
    compression: none
    type: clickhouse
    inputs:
      - strip_and_parse_json
    endpoint: http://address.of.clickhouse.server:8123
    database: mylogs
    table: haproxy_log
    skip_unknown_fields: true

clickhouse schema:

CREATE TABLE mylogs.haproxy_log
(                                            
    `ts` DateTime,                           
    `remote_addr` LowCardinality(String),    
    `request` LowCardinality(String),        
    `status` LowCardinality(String),         
    `body_bytes_sent` UInt32,                
    `request_body` String,                   
    `request_time` Float32,                  
    `http_user_agent` LowCardinality(String),
    `backend_ip_port` LowCardinality(String) 
)                                            
ENGINE = MergeTree                           
ORDER BY ts                                  
TTL ts + toIntervalDay(90)                   
SETTINGS index_granularity = 8192            

when trying to get the vector.dev config working – it was helpful to run ngrep and see directly in terminal what vector tried to send to ClickHouse. i was using this syntax:

ngrep -W byline "POST" tcp and host address.of.clickhouse.server -c 250

Leave a Reply

Your email address will not be published. Required fields are marked *

(Spamcheck Enabled)