{"id":3513,"date":"2023-06-05T18:12:00","date_gmt":"2023-06-05T17:12:00","guid":{"rendered":"https:\/\/kudzia.eu\/b\/?p=3513"},"modified":"2023-06-05T15:19:42","modified_gmt":"2023-06-05T14:19:42","slug":"using-clickhouse-local-to-analyze-archived-log-files","status":"publish","type":"post","link":"https:\/\/kudzia.eu\/b\/2023\/06\/using-clickhouse-local-to-analyze-archived-log-files\/","title":{"rendered":"using clickhouse-local to analyze archived log files"},"content":{"rendered":"\n<p>at work we&#8217;re hoarding log files. it&#8217;s a low-cost, low-tech solution: btrfs, some python script archiving \/var\/log\/*.log, *.1 from hundreds of servers. we have a peace of mind that whatever it is &#8211; as long as it&#8217;s logging to that folder &#8211; we&#8217;ll have an archive of it.<\/p>\n\n\n\n<p>till now, whenever there was a need of a witch-hunt, we&#8217;ve been using a ad-hoc written bash script that was unpacking log files, grepping them to find needed info. to speed things up &#8211; it was often parallelized work. and there were plenty of trial &amp; error attempts to get the needed info.<\/p>\n\n\n\n<p>my colleague found out that clickhouse-local is quite well suited for this work. here are few queries:<\/p>\n\n\n\n<p>instead of grepping<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>SELECT * FROM file('\/some\/path\/stdout.log_20230604_*.gz','LineAsString','c text') WHERE c LIKE '%something%';<\/code><\/pre>\n\n\n\n<p>it takes ~2.5x more time to execute than grep, but gives much more flexibility<\/p>\n\n\n\n<p>and things get even better with more structured log files:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>SELECT * FROM file('\/some\/path\/stdout.json.log_20230604_*.gz','JSONEachRow') WHERE logger LIKE '%GetEntity%';\r<\/code><\/pre>\n\n\n\n<p>extracting information from standard apache2 access log file with re2 and ClickHouse&#8217;s file + regex query:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>SELECT * FROM file('\/some\/path\/apache2\/access.log_20230603_000019.zst','Regexp','remote_host text,remote_log_name text,user text, ts text,verb text, request text,proto text, status UInt16, length UInt64,referer text, user_agent text') WHERE 1 SETTINGS format_regexp='(.*?) (.*?) (.*?) \\&#91;(.*? .*?)\\] \\\"(.*?) (.*?) (.*?)\\\" (.*?) (.*?) (.*?) \\\"(.*?)\\\"',format_regexp_skip_unmatched=1;\r\n<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>at work we&#8217;re hoarding log files. it&#8217;s a low-cost, low-tech solution: btrfs, some python script archiving \/var\/log\/*.log, *.1 from hundreds of servers. we have a peace of mind that whatever it is &#8211; as long as it&#8217;s logging to that folder &#8211; we&#8217;ll have an archive of it. till now, whenever there was a need [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[51],"tags":[112],"class_list":["post-3513","post","type-post","status-publish","format-standard","hentry","category-unimportant","tag-clickhouse"],"_links":{"self":[{"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/posts\/3513","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/comments?post=3513"}],"version-history":[{"count":2,"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/posts\/3513\/revisions"}],"predecessor-version":[{"id":3515,"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/posts\/3513\/revisions\/3515"}],"wp:attachment":[{"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/media?parent=3513"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/categories?post=3513"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/kudzia.eu\/b\/wp-json\/wp\/v2\/tags?post=3513"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}