最佳实践

 

基础设施 CPU 数据收集分析

更新时间 2023-09-06

随着业务稳定性和重要性的逐步提升,需要收集的基础设施监控项愈发精细,导致监控数据几何级暴增。为解决暴增数据的存储问题,监控数据的随时间变化的规律,ChronusDB 分布式时序数据库应运而生。

本小节主要介绍 ChronusDB 如何存储基础设施的 CPU 监控数据,以及如何对其进行分析。

数据库建模

  1. 建立监控数据库。

    $ echo "CREATE DATABASE monitor" | curl 'http://ChronusDB用户名:ChronusDB密码@高可用IP:8123/' --data-binary @-
    
  2. 建立CPU监控表,表中主要包含时间、云服务器名、CPU 型号等标签信息和 CPU 常见监控指标。

    $ echo "CREATE TABLE monitor.cpu_tags_metrics
     (
       time DateTime DEFAULT now(),
       hostname LowCardinality(String),
       region LowCardinality(String),
       datacenter LowCardinality(String),
       rack LowCardinality(String),
       os LowCardinality(String),
       arch LowCardinality(String),
       team LowCardinality(String),
       service LowCardinality(String),
       service_version LowCardinality(String),
       service_environment LowCardinality(String),
       usage_user Nullable(Float64),
       usage_system Nullable(Float64),
       usage_idle Nullable(Float64),
       usage_nice Nullable(Float64),
       usage_iowait Nullable(Float64),
       usage_irq Nullable(Float64),
       usage_softirq Nullable(Float64),
       usage_steal Nullable(Float64),
       usage_guest Nullable(Float64),
       usage_guest_nice Nullable(Float64)
     )
     ENGINE = MergeTree()
     PARTITION BY toYYYYMM(time)
      ORDER BY (hostname, region, datacenter, rack, os, arch, team, service, service_version, service_environment)" | curl 'http://ChronusDB用户名:ChronusDB密码@高可用IP:8123/' --data-binary @-
    
  3. 建立分布式表,以下示例用逻辑一致分布方式。

      $ echo "CREATE TABLE monitor.  
        cpu_tags_metrics_logical_distributed
      (
        time DateTime DEFAULT now(),
        hostname LowCardinality(String),
        region LowCardinality(String),
        datacenter LowCardinality(String),
        rack LowCardinality(String),
        os LowCardinality(String),
        arch LowCardinality(String),
        team LowCardinality(String),
        service LowCardinality(String),
        service_version LowCardinality(String),
        service_environment LowCardinality(String),
        usage_user Nullable(Float64),
        usage_system Nullable(Float64),
        usage_idle Nullable(Float64),
        usage_nice Nullable(Float64),
        usage_iowait Nullable(Float64),
        usage_irq Nullable(Float64),
        usage_softirq Nullable(Float64),
        usage_steal Nullable(Float64),
        usage_guest Nullable(Float64),
        usage_guest_nice Nullable(Float64)
      )
      ENGINE = Distributed(logical_consistency_cluster, monitor, cpu_tags_metrics, rand())"| curl 'http://ChronusDB用户名:ChronusDB密码@高可用IP:8123/' --data-binary @-
    

查询模型

  1. 指定五个云服务器,查询八小时内所有CPU指标的最大值。

     $ echo "SELECT
       toStartOfHour(time) AS hour,
       max(usage_user) AS max_usage_user,
       max(usage_system) AS max_usage_system,
       max(usage_idle) AS max_usage_idle,
       max(usage_nice) AS max_usage_nice,
       max(usage_iowait) AS max_usage_iowait,
       max(usage_irq) AS max_usage_irq,
       max(usage_softirq) AS max_usage_softirq,
       max(usage_steal) AS max_usage_steal,
       max(usage_guest) AS max_usage_guest,
       max(usage_guest_nice) AS max_usage_guest_nice
     FROM monitor.   cpu_tags_metrics_logical_distributed
     WHERE (hostname IN ('host_9', 'host_5', 'host_1', 'host_7', 'host_2')) AND (time >= '2019-01-01 00:17:45') AND (time < '2019-01-01 08:17:45')
     GROUP BY hour
     ORDER BY hour ASC" | curl 'http://<ChronusDB用户名>:<ChronusDB密码>@<高可用IP>:8123/'  --data-binary @-
    
  2. 查询12小时内,所有云服务器 5 个 CPU 指标的平均值。

    $ echo "SELECT
      toStartOfHour(time) AS hour,
      avg(usage_user) AS mean_usage_user,
      avg(usage_system) AS mean_usage_system,
      avg(usage_idle) AS mean_usage_idle,
      avg(usage_nice) AS mean_usage_nice,
      avg(usage_iowait) AS mean_usage_iowait
    FROM monitor.cpu_tags_metrics_logical_distributed
    WHERE (time >= '1970-01-01 00:54:10') AND (time < '1970-01-01 12:54:10')
    GROUP BY
       hour,
       hostname
    ORDER BY
       hour ASC,
       hostname ASC" | curl 'http://<ChronusDB用户名>:<ChronusDB密码>@<高可用IP>:8123/' --data-binary @-
    
  3. 查询某时刻前五分钟 CPU usage_user 的最大值。

    $ echo "SELECT
      toStartOfMinute(time) AS minute,
       max(usage_user)
    FROM monitor.cpu_tags_metrics_logical_distributed
    WHERE time < '2019-01-01 01:16:22'
    GROUP BY minute
    ORDER BY minute DESC
    LIMIT 5" | curl 'http://<ChronusDB用户名>:<ChronusDB密码>@<高可用IP>:8123/' --data-binary @-
    
  4. 查询一定时段内,所有云服务器 CPU 过载情况。

     $ echo "SELECT *
     FROM monitor.cpu_tags_metrics_logical_distributed
     PREWHERE (usage_user > 90.) AND (time >= '2019-01-01 00:16:22') AND (time < '2019-01-01 12:16:22')" | curl 'http://<ChronusDB用户名>:<ChronusDB密码>@<高可用IP>:8123/' --data-binary @-
    
  5. 查询一定时段内,某些云服务器 CPU 过载情况。

     $ echo "SELECT *
     FROM monitor.cpu_tags_metrics_logical_distributed
     PREWHERE (usage_user > 90.) AND (time >= '2019-01-01 00:08:59') AND (time < '2019-01-01 12:08:59') AND (hostname IN ('host_9', 'host_5', 'host_1', 'host_7', 'host_2'))" | curl 'http://<ChronusDB用户名>:<ChronusDB密码>@<高可用IP>:8123/' --data-binary @-
    
  6. 查询每台云服务器最近的监控记录。

     $ echo "SELECT DISTINCT
        hostname,
        usage_user,
        usage_system,
        usage_idle,
        usage_nice,
        usage_iowait,
        usage_irq,
        usage_softirq,
        usage_steal,
        usage_guest,
        usage_guest_nice
     FROM monitor. cpu_tags_metrics_logical_distributed
     ORDER BY
        hostname ASC,
        time DESC" | curl 'http://<ChronusDB用户名>:<ChronusDB密码>@<高可用IP>:8123/' --data-binary @-
    
  7. 查询某台云服务器 CPU 在指定时段内,每分钟 usage_user 的最大值。

     $ echo "SELECT
        toStartOfMinute(time) AS minute,
        max(usage_user) AS max_usage_user
     FROM monitor.cpu_tags_metrics_logical_distributed
     WHERE (hostname = 'host_9') AND (time >= '2019-01-01 01:09:26') AND (time < '2019-01-01 01:09:27')
     GROUP BY minute
     ORDER BY minute ASC" | curl 'http://<ChronusDB用户名>:<ChronusDB密码>@<高可用IP>:8123/' --data-binary @-
    

性能调试

测试数据量 10 亿行。

写入性能测试结果

每次插入 100 万行,写入时间结果。

bulk rows per bulk time sum rows sum time partition disk used
1 Millon 8.96519 s 1 Billon 8965.19 s 33GB

查询性能测试结果

逻辑一致分布式查询耗时结果。

查询模型 查询耗时
指定五个云服务器,查询八小时内所有CPU指标的最大值 0.06s
12小时内,所有云服务器 5 个 CPU 指标的平均值 0.071s
获取某时刻前五分钟 CPU usage_user 的最大值 0.017s
一定时段内,所有云服务器 CPU 过载情况 9.315s
一定时段内,某些云服务器 CPU 过载情况 0.06s
每台云服务器最近的监控记录 36s
某台云服务器 CPU 在指定时段内,每分钟 usage_user 的最大值 0.12s
这篇文档解决了您的问题吗?
0
0