{"id":3822,"date":"2025-11-21T17:50:14","date_gmt":"2025-11-21T16:50:14","guid":{"rendered":"https:\/\/kudzia.eu\/b\/?p=3822"},"modified":"2025-11-21T17:51:07","modified_gmt":"2025-11-21T16:51:07","slug":"using-vector-dev-to-get-haproxy-nginx-logs-into-clickhouse","status":"publish","type":"post","link":"https:\/\/kudzia.eu\/b\/2025\/11\/using-vector-dev-to-get-haproxy-nginx-logs-into-clickhouse\/","title":{"rendered":"using vector.dev to get haproxy, nginx logs into ClickHouse"},"content":{"rendered":"\n<p>at work we have few reverse proxies exposing some APIs. ClickHouse turned out to be great tool for analyzing logs, running ad-hoc research or gathering stats cyclically. below &#8211; configs that allow me to get HTTP requests passing via those proxies, including POST bodies, into ClickHouse. i&#8217;m using <a href=\"https:\/\/vector.dev\/\">vector.dev<\/a> to fetch, transform and ship logs.<\/p>\n\n\n\n<p>case 1 &#8211; reverse proxy on nginx:<\/p>\n\n\n\n<p>\/etc\/nginx\/conf.d\/logformat.conf &#8211; defining log format well formatted Newline Delimited JSON<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>log_format json_combined escape=json '{'          \n        '\"ts\":\"$time_iso8601\",'                   \n        '\"remote_addr\":\"$remote_addr\",'           \n        '\"request\":\"$request\",'                   \n        '\"status\": \"$status\",'                    \n        '\"body_bytes_sent\":\"$body_bytes_sent\",'   \n        '\"request_body\":\"$request_body\",'         \n        '\"request_time\":\"$request_time\",'         \n        '\"http_user_agent\":\"$http_user_agent\"'    \n'}';<\/code><\/pre>\n\n\n\n<p>vhost that proxies traffic to the backend &#8211; \/etc\/nginx\/sites-enabled\/vhost<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>server {                                                                                                                 \n        listen 80 default_server;                                                                                   \n        server_name some.name.com;                                                                           \n        access_log  \/var\/log\/nginx\/some.name.com-access.log json_combined;                               \n        root \/var\/www\/html;                                                                                              \n        location \/ {                                                                                                     \n                proxy_pass http:\/\/127.0.0.1:1234\/;                                                               \n                proxy_connect_timeout 10s;                                                                               \n                proxy_read_timeout 300s;                                                                                 \n        }                                                                                                                \n}                                                                                                                                                                                                                                                 <\/code><\/pre>\n\n\n\n<p>single line from from \/var\/log\/nginx\/some.name.com-access.log, pretty printed:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>{\n    \"ts\": \"2025-11-21T16:29:41+00:00\",\n    \"remote_addr\": \"10.16.10.100\",\n    \"request\": \"POST \/something...  HTTP\/1.1\",\n    \"status\": \"200\",\n    \"body_bytes_sent\": \"539041\",\n    \"request_body\": \"........post body......\",\n    \"request_time\": \"11.257\",\n    \"http_user_agent\": \"some.agent\"\n}<\/code><\/pre>\n\n\n\n<p>vector.dev config to get this into ClickHouse:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>sources:                                                                             \n  nginx_logs:                                                                        \n    type: file                                                                       \n    include:                                                                         \n      - \/var\/log\/nginx\/some.name.com-access.log \n    read_from: end                                                                   \ntransforms:                                                                          \n  process:                                                                           \n    type: remap                                                                      \n    inputs:                                                                          \n      - nginx_logs                                                                   \n    source: |                                                                        \n      .=parse_json!(.message)                                                        \nsinks:                                                                               \n  clickhouse:                                                                        \n    date_time_best_effort: true                                                      \n    auth:                                                                            \n      strategy: basic                                                                \n      user: clickhouse_user                                                \n      password: ................                                            \n    compression: none                                                                \n    type: clickhouse                                                                 \n    inputs:                                                                          \n      - process                                                                      \n    endpoint: http:\/\/address.of.clickhouse.server:8123                                           \n    database: mylogs\n    table: nginx_log\n    skip_unknown_fields: true                                                        \n<\/code><\/pre>\n\n\n\n<p>table structure in clickhouse &#8211; mylogs.nginx_log:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>CREATE TABLE mylogs.nginx_log\n(                                           \n    `ts` DateTime,                          \n    `remote_addr` LowCardinality(String),   \n    `request` LowCardinality(String),       \n    `status` LowCardinality(String),        \n    `body_bytes_sent` UInt32,               \n    `request_body` String,                  \n    `request_time` Float32,                 \n    `http_user_agent` LowCardinality(String)\n)                                           \nENGINE = MergeTree                          \nORDER BY ts                                 \nTTL ts + toIntervalDay(90)                  \nSETTINGS index_granularity = 8192           <\/code><\/pre>\n\n\n\n<p>case 2 &#8211; doing the same with haprxy<\/p>\n\n\n\n<p>part of haproxy.cfg defining log format:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>log-format \"{\\\"timestamp\\\": \\\"&#91;%t]\\\", \\\"backend_name\\\": \\\"%b\\\", \\\"backend_ip_port\\\": \\\"%si:%sp\\\", \\\"remote_addr\\\": \\\"%ci\\\", \\\"request\\\": \\\"%r\\\", \\\"status\\\": \\\"%ST\\\", \\\"body_bytes_sent\\\": \\\"%B\\\", \\\"request_time\\\":\\\"%TR\\\", \\\"total_session_time\\\": \\\"%Tt\\\", \\\"server_response_time\\\": \\\"%Tr\\\", \\\"http_user_agent\\\": \\\"%&#91;capture.req.hdr(1)]\\\", \\\"request_body\\\": \\\"%&#91;capture.req.hdr(0),json(utf8s)]\\\"}\"<\/code><\/pre>\n\n\n\n<p>\/etc\/rsyslog.d\/49-haproxy.conf &#8211; to route logs to local file:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$AddUnixListenSocket \/var\/lib\/haproxy\/dev\/log                                             \n:programname, startswith, \"haproxy\" {                                                     \n  \/var\/log\/haproxy.log                                                                    \n  stop                                                                                    \n}                                                                                                                                                                                   <\/code><\/pre>\n\n\n\n<p>sample log line in \/var\/log\/haproxy.log:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>2025-11-21T16:48:01.651481+00:00 serverName haproxy&#91;487893]: {\"timestamp\": \"&#91;21\/Nov\/2025:16:48:00.441]\", \"backend_name\": \"bl_workers\", \"backend_ip_port\": \"10.7.37.11:443\", \"remote_addr\": \"172.16.1.11\", \"request\": \"POST \/.... HTTP\/1.1\", \"status\": \"200\", \"body_bytes_sent\": \"345\", \"request_time\":\"0\", \"total_session_time\": \"1209\", \"server_response_time\": \"1156\", \"http_user_agent\": \"....\", \"request_body\": \".................\"}<\/code><\/pre>\n\n\n\n<p>config for vector.dev &#8211; we need to extract from syslog format actual JSON message, then transform timestamp from common log format to ISO8601:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>sources:\n  haproxy_logs:\n    type: file\n    include:\n      -  \/var\/log\/haproxy.log\n    read_from: end\ntransforms:\n  strip_and_parse_json:\n    type: remap\n    inputs:\n      - haproxy_logs\n    source: |\n      json, err = parse_regex(.message, r'^(?:\\S+\\s+){3}(?P&lt;message>.*)$')\n      if err == null {\n        json = parse_json!(json.message)\n        json.ts = parse_timestamp!(.timestamp, format: \"%d\/%b\/%Y:%H:%M:%S.%3f\")\n        json.request_time = parse_int!(json.server_response_time)\/1000.0\n        . = json\n      }\nsinks:\n  console:\n    type: console\n    inputs:\n      - strip_and_parse_json\n    encoding:\n      codec: text\n  clickhouse:\n    date_time_best_effort: true\n    auth:\n      strategy: basic\n      user: clickhouse_user\n      password: .............\n    compression: none\n    type: clickhouse\n    inputs:\n      - strip_and_parse_json\n    endpoint: http:\/\/address.of.clickhouse.server:8123\n    database: mylogs\n    table: haproxy_log\n    skip_unknown_fields: true<\/code><\/pre>\n\n\n\n<p><\/p>\n\n\n\n<p>clickhouse schema:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>CREATE TABLE mylogs.haproxy_log\n(                                            \n    `ts` DateTime,                           \n    `remote_addr` LowCardinality(String),    \n    `request` LowCardinality(String),        \n    `status` LowCardinality(String),         \n    `body_bytes_sent` UInt32,                \n    `request_body` String,                   \n    `request_time` Float32,                  \n    `http_user_agent` LowCardinality(String),\n    `backend_ip_port` LowCardinality(String) \n)                                            \nENGINE = MergeTree                           \nORDER BY ts                                  \nTTL ts + toIntervalDay(90)                   \nSETTINGS index_granularity = 8192            <\/code><\/pre>\n\n\n\n<p>when trying to get the vector.dev config working &#8211; it was helpful to run ngrep and see directly in terminal what vector tried to send to ClickHouse. i was using this syntax:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>ngrep -W byline \"POST\" tcp and host address.of.clickhouse.server -c 250<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>at work we have few reverse proxies exposing some APIs. ClickHouse turned out to be great tool for analyzing logs, running ad-hoc research or gathering stats cyclically. below &#8211; configs that allow me to get HTTP requests passing via those proxies, including POST bodies, into ClickHouse. i&#8217;m using vector.dev to fetch, transform and ship logs. [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[17],"tags":[112,136,63,135],"class_list":["post-3822","post","type-post","status-publish","format-standard","hentry","category-tech","tag-clickhouse","tag-haproxy","tag-nginx","tag-vector-dev"],"_links":{"self":[{"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/posts\/3822","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/comments?post=3822"}],"version-history":[{"count":4,"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/posts\/3822\/revisions"}],"predecessor-version":[{"id":3826,"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/posts\/3822\/revisions\/3826"}],"wp:attachment":[{"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/media?parent=3822"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/categories?post=3822"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/tags?post=3822"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}