at work we’re hoarding log files. it’s a low-cost, low-tech solution: btrfs, some python script archiving /var/log/*.log, *.1 from hundreds of servers. we have a peace of mind that whatever it is – as long as it’s logging to that folder – we’ll have an archive of it.
till now, whenever there was a need of a witch-hunt, we’ve been using a ad-hoc written bash script that was unpacking log files, grepping them to find needed info. to speed things up – it was often parallelized work. and there were plenty of trial & error attempts to get the needed info.
my colleague found out that clickhouse-local is quite well suited for this work. here are few queries:
instead of grepping
SELECT * FROM file('/some/path/stdout.log_20230604_*.gz','LineAsString','c text') WHERE c LIKE '%something%';
it takes ~2.5x more time to execute than grep, but gives much more flexibility
and things get even better with more structured log files:
SELECT * FROM file('/some/path/stdout.json.log_20230604_*.gz','JSONEachRow') WHERE logger LIKE '%GetEntity%';
extracting information from standard apache2 access log file with re2 and ClickHouse’s file + regex query:
SELECT * FROM file('/some/path/apache2/access.log_20230603_000019.zst','Regexp','remote_host text,remote_log_name text,user text, ts text,verb text, request text,proto text, status UInt16, length UInt64,referer text, user_agent text') WHERE 1 SETTINGS format_regexp='(.*?) (.*?) (.*?) \[(.*? .*?)\] \"(.*?) (.*?) (.*?)\" (.*?) (.*?) (.*?) \"(.*?)\"',format_regexp_skip_unmatched=1;