using clickhouse-local to analyze archived log files

at work we’re hoarding log files. it’s a low-cost, low-tech solution: btrfs, some python script archiving /var/log/*.log, *.1 from hundreds of servers. we have a peace of mind that whatever it is – as long as it’s logging to that folder – we’ll have an archive of it.

till now, whenever there was a need of a witch-hunt, we’ve been using a ad-hoc written bash script that was unpacking log files, grepping them to find needed info. to speed things up – it was often parallelized work. and there were plenty of trial & error attempts to get the needed info.

my colleague found out that clickhouse-local is quite well suited for this work. here are few queries:

instead of grepping

SELECT * FROM file('/some/path/stdout.log_20230604_*.gz','LineAsString','c text') WHERE c LIKE '%something%';

it takes ~2.5x more time to execute than grep, but gives much more flexibility

and things get even better with more structured log files:

SELECT * FROM file('/some/path/stdout.json.log_20230604_*.gz','JSONEachRow') WHERE logger LIKE '%GetEntity%';

extracting information from standard apache2 access log file with re2 and ClickHouse’s file + regex query:

SELECT * FROM file('/some/path/apache2/access.log_20230603_000019.zst','Regexp','remote_host text,remote_log_name text,user text, ts text,verb text, request text,proto text, status UInt16, length UInt64,referer text, user_agent text') WHERE 1 SETTINGS format_regexp='(.*?) (.*?) (.*?) \[(.*? .*?)\] \"(.*?) (.*?) (.*?)\" (.*?) (.*?) (.*?) \"(.*?)\"',format_regexp_skip_unmatched=1;

Leave a Reply

Your email address will not be published. Required fields are marked *

(Spamcheck Enabled)