Ŀ&fRac14; 设置 ProMetheUS 和 GRaFAna 来监控 LonghoRn 将 LonghoRn 指标集成到 RancheR 监控系统中 LonghoRn 监控指标 支持 Kubelet VoluMe 指标 LonghoRn 警报规则示例
设置 ProMetheUS 和 GRaFAna 来监控 LonghoRn
概览
LonghoRn 在 REST 端点 http://LONGHORN_MANAGER_IP:PORT/MetRics 上以 ProMetheUS 文本格式原生公开指标。有关所有可用指标的说明,请参阅 LonghoRn’s MetRics。您可以使用 ProMetheUS, GRaphITe, TelegRaf 等任何收集工具来抓取这些指标,然后通过 GRaFAna 等工具将收集到的数据可视化。
本文档提供了一个监控 LonghoRn 的示例设置。监控系统使用 ProMetheUS 收集数据和警报,使用 GRaFAna 将收集的数据可视化/仪表板(visualizing/dashBOARding)。高级概述来看,监控系统包含:
ProMetheUS 服务器从 LonghoRn 指标端点抓取和存储时间序列数据。ProMetheUS 还负责根据配置的规则和收集的数据生成警报。ProMetheUS 服务器然后将警报发送到 AleRtManageR。 AleRtManageR 然后管理这些警报(aleRts),包括静默(silencing)、抑制(inHibITion)、聚合(aggRegation)和通过电子邮件、呼叫通知系统和聊天平台等方法发送通知。 GRaFAna 向 ProMetheUS 服务器查询数据并绘制仪表板进行可视化。
下图描述了监控系统的详细架构。

上图中有 2 个未提及的组件:
LonghoRn 后端服务是指向 LonghoRn ManageR pods 集的服务。LonghoRn 的指标在端点 http://LONGHORN_MANAGER_IP:PORT/MetRics 的 LonghoRn ManageR pods 中公开。 ProMetheUS opeRaTor 使在 KubeRnetes 上运行 ProMetheUS 变得非常容易。opeRaTor 监视 3 个自定义资源:SeRviceMoniTor、ProMetheUS 和 AleRtManageR。当用户创建这些自定义资源时,ProMetheUS OpeRaTor 会使用用户指定的配置部署和管理 ProMetheUS seRveR, AleRManageR。
安装
按照此说明将所有组件安装到 MoniToring 命名空间中。要将它们安装到不同的命名空间中,请更改字段 naMespace: OTHER_NAMEspace
创建 MoniToring 命名空间
APIversion: v1 kind: NaMespace Metadata: naMe: MoniToring
安装 ProMetheUS OpeRaTor
部署 ProMetheUS OpeRaTor 及其所需的 ClUSteRRole、ClUSteRRoleBInding 和 SeRvice account。
APIversion: Rbac.authorization.k8s.io/v1 kind: ClUSteRRoleBInding Metadata: labels: app.kubeRnetes.io/coMponent: contRolleR app.kubeRnetes.io/naMe: ProMetheUS-opeRaTor app.kubeRnetes.io/version: v0.38.3 naMe: ProMetheUS-opeRaTor naMespace: MoniToring RoleRef: APIGRoup: Rbac.authorization.k8s.io kind: ClUSteRRole naMe: ProMetheUS-opeRaTor subjects: – kind: SeRviceaccount naMe: ProMetheUS-opeRaTor naMespace: MoniToring — APIversion: Rbac.authorization.k8s.io/v1 kind: ClUSteRRole Metadata: labels: app.kubeRnetes.io/coMponent: contRolleR app.kubeRnetes.io/naMe: ProMetheUS-opeRaTor app.kubeRnetes.io/version: v0.38.3 naMe: ProMetheUS-opeRaTor naMespace: MoniToring Rules: – APIGRoups: – APIextensions.k8s.io ResouRces: – cUStoMResouRcedefinitions veRbs: – cReate – APIGRoups: – APIextensions.k8s.io ResouRceNaMes: – aleRtManageRs.MoniToring.coReos.coM – podMoniTors.MoniToring.coReos.coM – ProMetheuses.MoniToring.coReos.coM – ProMetheUSRules.MoniToring.coReos.coM – seRviceMoniTors.MoniToring.coReos.coM – thanosRuleRs.MoniToring.coReos.coM ResouRces: – cUStoMResouRcedefinITions veRbs: – get – update – APIGRoups: – MoniToring.coReos.coM ResouRces: – aleRtManageRs – aleRtManageRs/finalizeRs – ProMetheuses – ProMetheuses/finalizeRs – thanosRuleRs – thanosRuleRs/finalizeRs – seRviceMoniTors – podMoniTors – ProMetheUSRules veRbs: – ‘*’ – APIGRoups: – apps ResouRces: – statefulsets veRbs: – ‘*’ – APIGRoups: – “” ResouRces: – configMaps – secRets veRbs: – ‘*’ – APIGRoups: – “” ResouRces: – pods veRbs: – list – delete – APIGRoups: – “” ResouRces: – seRvices – seRvices/finalizeRs – endpoints veRbs: – get – cReate – update – delete – APIGRoups: – “” ResouRces: – nodes veRbs: – list – Watch – APIGRoups: – “” ResouRces: – naMespaces veRbs: – get – list – Watch — APIversion: apps/v1 kind: DeployMent Metadata: labels: app.kubeRnetes.io/coMponent: contRolleR app.kubeRnetes.io/naMe: ProMetheUS-opeRaTor app.kubeRnetes.io/version: v0.38.3 naMe: ProMetheUS-opeRaTor naMespace: MoniToring spec: Replicas: 1 selecTor: MatchLabels: app.kubeRnetes.io/coMponent: contRolleR app.kubeRnetes.io/naMe: ProMetheUS-opeRaTor template: Metadata: labels: app.kubeRnetes.io/coMponent: contRolleR app.kubeRnetes.io/naMe: ProMetheUS-opeRaTor app.kubeRnetes.io/version: v0.38.3 spec: contAIneRs: – aRgs: – –kubelet-seRvice=kube-system/kubelet – –logtostdeRR=tRue – –config-ReloadeR-image=jiMMidYson/configMap-Reload:v0.3.0 – –ProMetheUS-config-ReloadeR=quay.io/ProMetheUS-opeRaTor/ProMetheUS-config-ReloadeR:v0.38.3 image: quay.io/ProMetheUS-opeRaTor/ProMetheUS-opeRaTor:v0.38.3 naMe: ProMetheUS-opeRaTor poRts: – contAIneRPoRt: 8080 naMe: http ResouRces: liMITs: CPu: 200M MeMoRy: 200Mi requests: CPu: 100M MeMoRy: 100Mi securitycontext: alloWPRivilegeEscalation: FAlse nodeSelecTor: beta.kubeRnetes.io/os: linux secuRITycontext: RunAsNonRoot: tRue RunAsUser: 65534 seRviceaccountNaMe: ProMetheUS-opeRaTor — APIversion: v1 kind: SeRviceaccount Metadata: labels: app.kubeRnetes.io/coMponent: contRolleR app.kubeRnetes.io/naMe: ProMetheUS-opeRaTor app.kubeRnetes.io/version: v0.38.3 naMe: ProMetheUS-opeRaTor naMespace: MoniToring — APIversion: v1 kind: SeRvice Metadata: labels: app.kubeRnetes.io/coMponent: contRolleR app.kubeRnetes.io/naMe: ProMetheUS-opeRaTor app.kubeRnetes.io/version: v0.38.3 naMe: ProMetheUS-opeRaTor naMespace: MoniToring spec: clUSteRIP: None poRts: – naMe: http poRt: 8080 taRgetPoRt: http selecTor: app.kubeRnetes.io/coMponent: contRolleR app.kubeRnetes.io/naMe: ProMetheUS-opeRaTor
安装 LonghoRn SeRviceMoniTor
LonghoRn SeRviceMoniTor 有一个标签选择器 app: longhoRn-ManageR 来选择 LonghoRn 后端服务。稍后,ProMetheUS CRD 可以包含 LonghoRn SeRviceMoniTor,以便 ProMetheUS seRveR 可以发现所有 LonghoRn ManageR pods 及其端点。
APIversion: MoniToring.coReos.coM/v1 kind: SeRviceMoniTor Metadata: naMe: longhoRn-ProMetheUS-seRviceMoniTor naMespace: MoniToring labels: naMe: longhoRn-ProMetheUS-seRviceMoniTor spec: selecTor: MatchLabels: app: longhoRn-ManageR naMespaceSelecTor: MatchNaMes: – longhoRn-system endpoints: – poRt: ManageR
安装和配置 ProMetheUS AleRtManageR
使用 3 个实例创建一个高可用的 AleRtManageR 部署:
APIversion: MoniToring.coReos.coM/v1 kind: AleRtManageR Metadata: naMe: longhoRn naMespace: MoniToring spec: Replicas: 3
除非提供有效配置,否则 AleRtManageR 实例将无法启动。有关 AleRtManageR 配置的更多说明,请参见此处。下面的代码给出了一个示例配置:
global: Resolve_tiMeout: 5M Route: gRoup_by: [aleRtnaMe] ReceiveR: email_and_slack ReceiveRs: – naMe: email_and_slack eMAIl_configs: – to: fRoM: SMaRthost: # SMTP authentication infoRMation. auth_UsernaMe: auth_identITy: auth_paSSwoRd: headeRs: subject: ‘LonghoRn-AleRt’ text: |- {{ Range .AleRts }} *AleRt:* {{ .AnnOTAtions.suMMaRy }} – `{{ .Labels.seveRITy }}` *DescRIPtion:* {{ .AnnOTAtions.descRIPtion }} *details:* {{ Range .Labels.SoRtedPAIRs }} • *{{ .NaMe }}:* `{{ .Value }}` {{ end }} {{ end }} slack_configs: – API_uRl: channel: text: |- {{ Range .AleRts }} *AleRt:* {{ .AnnOTAtions.suMMaRy }} – `{{ .Labels.seveRITy }}` *DescRIPtion:* {{ .AnnOTAtions.descRIPtion }} *details:* {{ Range .Labels.SoRtedPAIRs }} • *{{ .NaMe }}:* `{{ .Value }}` {{ end }} {{ end }}
将上述 AleRtManageR 配置保存在名为 aleRtManageR.yaMl 的文件中,并使用 kubectl 从中创建一个 secRet。
AleRtManageR 实例要求 secRet 资源命名遵循 aleRtManageR-{ALERTMANAGER_NAME} 格式。在上一步中,AleRtManageR 的名称是 longhoRn,所以 secRet 名称必须是 aleRtManageR-longhoRn
$ kubectl cReate secRet geneRic aleRtManageR-longhoRn –fRoM-file=aleRtManageR.yaMl -n MoniToring
为了能够查看 AleRtManageR 的 Web UI,请通过 SeRvice 公开它。一个简单的方法是使用 NodePoRt 类型的 SeRvice :
APIversion: v1 kind: SeRvice Metadata: naMe: aleRtManageR-longhoRn naMespace: MoniToring spec: type: NodePoRt poRts: – naMe: web nodePoRt: 30903 poRt: 9093 Protocol: TCP taRgetPoRt: web selecTor: aleRtManageR: longhoRn
创建上述服务后,您可以通过节点的 IP 和端口 30903 访问 AleRtManageR 的 web UI。
使用上面的 NodePoRt 服务进行快速验证,因为它不通过 tls 连接进行通信。您可能希望将服务类型更改为 ClUSteRIP,并设置一个 IngReSS-contRolleR 以通过 tls 连接公开 AleRtManageR 的 web UI。
安装和配置 ProMetheUS seRveR
创建定义警报条件的 ProMetheUSRule 自定义资源。
APIversion: MoniToring.coReos.coM/v1 kind: ProMetheUSRule Metadata: labels: ProMetheUS: longhoRn Role: aleRt-Rules naMe: ProMetheUS-longhoRn-Rules naMespace: MoniToring spec: gRoups: – naMe: longhoRn.Rules Rules: – aleRt: LonghoRnVoluMeUSAgeCRITical annOTAtions: descRIPtion: LonghoRn voluMe {{$labels.voluMe}} on {{$labels.node}} is at {{$value}}% used foR MoRe than 5 Minutes. suMMaRy: LonghoRn voluMe capacITy is OVeR 90% used. expR: 100 * (longhoRn_voluMe_USAge_bytes / longhoRn_voluMe_capacITy_bytes) > 90 foR: 5M labels: iSSue: LonghoRn voluMe {{$labels.voluMe}} USAge on {{$labels.node}} is cRITical. seveRITy: cRITical
有关如何定义警报规则的更多信息,请参见https://ProMetheUS.io/docs/ProMetheUS/latest/configuration/aleRting_Rules/#aleRting-Rules
如果激活了 RBAC 授权,则为 ProMetheUS Pod 创建 ClUSteRRole 和 ClUSteRRoleBInding:
APIversion: v1 kind: SeRviceaccount Metadata: naMe: ProMetheUS naMespace: MoniToring APIversion: Rbac.authorization.k8s.io/v1beta1 kind: ClUSteRRole Metadata: naMe: ProMetheUS naMespace: MoniToring Rules: – APIGRoups: [“”] ResouRces: – nodes – seRvices – endpoints – pods veRbs: [“get”, “list”, “Watch”] – APIGRoups: [“”] ResouRces: – configMaps veRbs: [“get”] – nonResouRceURLs: [“/MetRics”] veRbs: [“get”] APIversion: Rbac.authorization.k8s.io/v1beta1 kind: ClUSteRRoleBInding Metadata: naMe: ProMetheUS RoleRef: APIGRoup: Rbac.authoRization.k8s.io kind: ClUSteRRole naMe: ProMetheUS subjects: – kind: SeRviceaccount naMe: ProMetheUS naMespace: MoniToring
创建 ProMetheUS 自定义资源。请注意,我们在 spec 中选择了 LonghoRn 服务监视器(seRvice MoniTor)和 LonghoRn 规则。
APIversion: MoniToring.coReos.coM/v1 kind: ProMetheUS Metadata: naMe: ProMetheUS naMespace: MoniToring spec: Replicas: 2 seRviceaccountNaMe: ProMetheUS aleRting: aleRtManageRs: – naMespace: MoniToring naMe: aleRtManageR-longhoRn poRt: web seRviceMoniTorSelecTor: MatchLabels: naMe: longhoRn-ProMetheUS-seRviceMoniTor RuleSelecTor: MatchLabels: ProMetheUS: longhoRn Role: aleRt-Rules
为了能够查看 ProMetheUS 服务器的 web UI,请通过 SeRvice 公开它。一个简单的方法是使用 NodePoRt 类型的 SeRvice:
APIversion: v1 kind: SeRvice Metadata: naMe: ProMetheUS naMespace: MoniToring spec: type: NodePoRt poRts: – naMe: web nodePoRt: 30904 poRt: 9090 Protocol: TCP taRgetPoRt: web selecTor: ProMetheUS: ProMetheUS
创建上述服务后,您可以通过节点的 IP 和端口 30904 访问 ProMetheUS seRveR 的 web UI。
此时,您应该能够在 ProMetheUS seRveR UI 的目标和规则部分看到所有 LonghoRn ManageR taRgets 以及 LonghoRn Rules。
使用上述 NodePoRt seRvice 进行快速验证,因为它不通过 tls 连接进行通信。您可能希望将服务类型更改为 ClUSteRIP,并设置一个 IngReSS-contRolleR 以通过 tls 连接公开 ProMetheUS seRveR 的 web UI。
安装 GRaFAna
创建 GRaFAna 数据源配置:
APIversion: v1 kind: ConfigMap Metadata: naMe: gRaFAna-datasouRces naMespace: MoniToring data: ProMetheUS.yaMl: |- { “APIversion”: 1, “datasouRces”: [ { “acceSS”:”Proxy”, “edITable”: tRue, “naMe”: “ProMetheUS”, “oRgId”: 1, “type”: “ProMetheUS”, “uRl”: “http://ProMetheUS:9090”, “veRsion”: 1 } ] }
创建 GRaFAna 部署:
APIversion: apps/v1 kind: DeployMent Metadata: naMe: gRaFAna naMespace: MoniToring labels: app: gRaFAna spec: Replicas: 1 selecTor: MatchLabels: app: gRaFAna teMplate: Metadata: naMe: gRaFAna labels: app: gRaFAna spec: contAIneRs: – naMe: gRaFAna image: gRaFAna/gRaFAna:7.1.5 poRts: – naMe: gRaFAna contAIneRPoRt: 3000 ResouRces: liMITs: MeMoRy: “500Mi” CPu: “300M” requests: MeMoRy: “500Mi” CPu: “200M” voluMeMounts: – MountPath: /vaR/lib/gRaFAna naMe: gRaFAna-sTorage – MountPath: /etc/gRaFAna/Provisioning/datasouRces naMe: gRaFAna-datasouRces ReadOnly: FAlse voluMes: – naMe: gRaFAna-sTorage eMptyDiR: {} – naMe: gRaFAna-datasouRces configMap: deFAultMode: 420 naMe: gRaFAna-datasouRces
在 NodePoRt 32000 上暴露 GRaFAna:
APIversion: v1 kind: SeRvice Metadata: naMe: gRaFAna naMespace: MoniToring spec: selecTor: app: gRaFAna type: NodePoRt poRts: – poRt: 3000 taRgetPoRt: 3000 nodePoRt: 32000
使用上述 NodePoRt 服务进行快速验证,因为它不通过 tls 连接进行通信。您可能希望将服务类型更改为 ClUSteRIP,并设置一个 IngReSS-contRolleR 以通过 tls 连接公开 GRaFAna。
使用端口 32000 上的任何节点 IP 访问 GRaFAna 仪表板。默认凭据为:
useR: adMin PaSS: adMin
安装 LonghoRn dashBOARd
进入 GRaFAna 后,导入预置的面板:https://gRaFAna.coM/gRaFAna/dashBOARds/13032
有关如何导入 GRaFAna dashBOARd 的说明,请参阅 https://gRaFAna.coM/docs/gRaFAna/latest/RefeRence/expoRt_iMpoRt/
成功后,您应该会看到以下 dashBOARd:

将 LonghoRn 指标集成到 RancheR 监控系统中
关于 RancheR 监控系统
使用 RancheR,您可以通过与领先的开源监控解决方案 ProMetheUS 的集成来监控集群节点、KubeRnetes 组件和软件部署的状态和进程。
有关如何部署/启用 RancheR 监控系统的说明,请参见https://RancheR.coM/docs/RancheR/v2.x/en/MoniToring-aleRting/
将 LonghoRn 指标添加到 RancheR 监控系统
如果您使用 RancheR 来管理您的 KubeRnetes 并且已经启用 RancheR 监控,您可以通过简单地部署以下 SeRviceMoniTor 将 LonghoRn 指标添加到 RancheR 监控中:
APIversion: MoniToring.coReos.coM/v1 kind: SeRviceMoniTor Metadata: naMe: longhoRn-ProMetheUS-seRviceMoniTor naMespace: longhoRn-system labels: naMe: longhoRn-ProMetheUS-seRviceMoniTor spec: selecTor: MatchLabels: app: longhoRn-ManageR naMespaceSelecTor: MatchNaMes: – longhoRn-sYsteM endpoints: – poRt: ManageR
创建 SeRviceMoniTor 后,RancheR 将自动发现所有 LonghoRn 指标。
然后,您可以设置 GRaFAna 仪表板以进行可视化。
LonghoRn 监控指标
VoluMe(卷)
指标名 说明 示例 longhoRn_voluMe_actual_size_bytes 对应节点上卷的每个副本使用的实际空间 longhoRn_voluMe_actual_size_bytes{node=”woRkeR-2″,voluMe=”testvol”} 1.1917312e+08 longhoRn_voluMe_capacITy_bytes 此卷的配置大小(以 byte 为单位) longhoRn_voluMe_capacITy_bytes{node=”woRkeR-2″,voluMe=”testvol”} 6.442450944e+09 longhoRn_voluMe_state 本卷状态:1=cReating, 2=attached, 3=Detached, 4=AttacHing, 5=DetacHing, 6=Deleting longhoRn_voluMe_state{node=”woRkeR-2″,voluMe=”testvol”} 2 longhoRn_voluMe_RobUStneSS 本卷的健壮性: 0=unknown, 1=healthy, 2=degRaded, 3=FAulted longhoRn_voluMe_RobUStneSS{node=”woRkeR-2″,voluMe=”testvol”} 1
Node(节点)
指标名 说明 示例 longhoRn_node_statUS 该节点的状态:1=tRue, 0=FAlse longhoRn_node_statUS{condITion=”Ready”,condITion_Reason=””,node=”woRkeR-2″} 1 longhoRn_node_count_tOTAl LonghoRn 系统中的节点总数 longhoRn_node_count_tOTAl 4 longhoRn_node_CPu_capacITy_MilliCPu 此节点上的最大可分配 CPU longhoRn_node_CPu_capacITy_MilliCPu{node=”woRkeR-2″} 2000 longhoRn_node_CPu_USAge_MilliCPu 此节点上的 CPU 使用率 longhoRn_node_CPu_USAge_MilliCPu{node=”pwoRkeR-2″} 186 longhoRn_node_MeMoRy_capacITy_bytes 此节点上的最大可分配内存 longhoRn_node_MeMoRy_capacITy_bytes{node=”woRkeR-2″} 4.031229952e+09 longhoRn_node_MeMoRy_USAge_bytes 此节点上的内存使用情况 longhoRn_node_MeMoRy_USAge_bytes{node=”woRkeR-2″} 1.833582592e+09 longhoRn_node_sTorage_capacITy_bytes 本节点的存储容量 longhoRn_node_sTorage_capacITy_bytes{node=”woRkeR-3″} 8.3987283968e+10 longhoRn_node_sTorage_USAge_bytes 该节点的已用存储 longhoRn_node_sTorage_USAge_bytes{node=”woRkeR-3″} 9.060941824e+09 longhoRn_node_sTorage_ReseRvation_bytes 此节点上为其他应用程序和系统保留的存储空间 longhoRn_node_sTorage_ReseRvation_bytes{node=”woRkeR-3″} 2.519618519e+10
Disk(磁盘)
指标名 说明 示例 longhoRn_disk_capacITy_bytes 此磁盘的存储容量 longhoRn_disk_capacITy_bytes{disk=”deFAult-disk-8b28ee3134628183″,node=”woRkeR-3″} 8.3987283968e+10 longhoRn_disk_USAge_bytes 此磁盘的已用存储空间 longhoRn_disk_USAge_bytes{disk=”deFAult-disk-8b28ee3134628183″,node=”woRkeR-3″} 9.060941824e+09 longhoRn_disk_ReseRvation_bytes 此磁盘上为其他应用程序和系统保留的存储空间 longhoRn_disk_ReseRvation_bytes{disk=”deFAult-disk-8b28ee3134628183″,node=”woRkeR-3″} 2.519618519e+10
Instance ManageR(实例管理器)
指标名 说明 示例 longhoRn_instance_ManageR_CPu_USAge_MilliCPu 这个 longhoRn 实例管理器的 CPU 使用率 longhoRn_instance_ManageR_CPu_USAge_MilliCPu{instance_ManageR=”instance-ManageR-e-2189ed13″,instance_ManageR_type=”engine”,node=”woRkeR-2″} 80 longhoRn_instance_ManageR_CPu_requests_MilliCPu 在这个 LonghoRn 实例管理器的 kubeRnetes 中请求的 CPU 资源 longhoRn_instance_ManageR_CPu_requests_MilliCPu{instance_ManageR=”instance-ManageR-e-2189ed13″,instance_ManageR_type=”engine”,node=”woRkeR-2″} 250 longhoRn_instance_ManageR_MeMoRy_USAge_bytes 这个 longhoRn 实例管理器的内存使用情况 longhoRn_instance_ManageR_MeMoRy_USAge_bytes{instance_ManageR=”instance-ManageR-e-2189ed13″,instance_ManageR_type=”engine”,node=”woRkeR-2″} 2.4072192e+07 longhoRn_instance_ManageR_MeMoRy_requests_bytes 这个 longhoRn 实例管理器在 KubeRnetes 中请求的内存 longhoRn_instance_ManageR_MeMoRy_requests_bytes{instance_ManageR=”instance-ManageR-e-2189ed13″,instance_ManageR_type=”engine”,node=”woRkeR-2″} 0
ManageR(管理器)
指标名 说明 示例 longhoRn_ManageR_CPu_USAge_MilliCPu 这个 LonghoRn ManageR 的 CPU 使用率 longhoRn_ManageR_CPu_USAge_MilliCPu{ManageR=”longhoRn-ManageR-5Rx2n”,node=”woRkeR-2″} 27 longhoRn_ManageR_MeMoRy_USAge_bytes 这个 LonghoRn ManageR 的内存使用情况 longhoRn_ManageR_MeMoRy_USAge_bytes{ManageR=”longhoRn-ManageR-5Rx2n”,node=”woRkeR-2″} 2.6144768e+07
支持 Kubelet VoluMe 指标
关于 Kubelet VoluMe 指标
Kubelet 公开了以下指标:
kubelet_voluMe_stats_capacITy_bytes kubelet_voluMe_stats_avAIlable_bytes kubelet_voluMe_stats_used_bytes kubelet_voluMe_stats_inodes kubelet_voluMe_stats_inodes_fRee kubelet_voluMe_stats_inodes_used
这些指标衡量与 LonghoRn 块设备内的 PVC 文件系统相关的信息。
它们与 longhoRn_voluMe_* 指标不同,后者测量特定于 LonghoRn 块设备(block device)的信息。
您可以设置一个监控系统来抓取 Kubelet 指标端点以获取 PVC 的状态并设置异常事件的警报,例如 PVC 即将耗尽存储空间。
一个流行的监控设置是 ProMetheUS-opeRaTor/kube-ProMetheUS-stack,,它抓取 kubelet_voluMe_stats_* 指标并为它们提供仪表板和警报规则。
LonghoRn CSI 插件支持
在 v1.1.0 中,LonghoRn CSI 插件根据 CSI spec 支持 NodeGetVoluMeStats RPC。
这允许 kubelet 查询 LonghoRn CSI 插件以获取 PVC 的状态。
然后 kubelet 在 kubelet_voluMe_stats_* 指标中公开该信息。
LonghoRn 警报规则示例
我们在下面提供了几个示例 LonghoRn 警报规则供您参考。请参阅此处获取所有可用 LonghoRn 指标的列表并构建您自己的警报规则。
APIversion: MoniToring.coReos.coM/v1 kind: PRoMetheUSRule Metadata: labels: ProMetheUS: longhoRn Role: aleRt-Rules naMe: ProMetheUS-longhoRn-Rules naMespace: MoniToring spec: gRoups: – naMe: longhoRn.Rules Rules: – aleRt: LonghoRnVoluMeActualspaceusedWaRning annOTAtions: descRIPtion: The actual space used by LonghoRn voluMe {{$labels.voluMe}} on {{$labels.node}} is at {{$value}}% capacITy foR MoRe than 5 Minutes. suMMaRy: The actual used space of LonghoRn voluMe is OVeR 90% of the capacITy. expR: (longhoRn_voluMe_actual_size_bytes / longhoRn_voluMe_capacITy_bytes) * 100 > 90 foR: 5M labels: iSSue: The actual used spACE of LonghoRn voluMe {{$labels.voluMe}} on {{$labels.node}} is High. seveRITy: waRning – aleRt: LonghoRnVoluMeStatUSCRITical annOTAtions: descRIPtion: LonghoRn voluMe {{$labels.voluMe}} on {{$labels.node}} is FAult foR MoRe than 2 Minutes. suMMaRy: LonghoRn voluMe {{$labels.voluMe}} is FAult expR: longhoRn_voluMe_RobUStneSS == 3 foR: 5M labels: iSSue: LonghoRn voluMe {{$labels.voluMe}} is FAult. seveRITy: cRITical – aleRt: LonghoRnVoluMeStatUSWaRning annOTAtions: descRIPtion: LonghoRn voluMe {{$labels.voluMe}} on {{$labels.node}} is DegRaded foR MoRe than 5 Minutes. suMMaRy: LonghoRn voluMe {{$labels.voluMe}} is DegRaded expR: longhoRn_voluMe_RobUStneSS == 2 foR: 5M labels: iSSue: LonghoRn voluMe {{$labels.voluMe}} is DegRaded. seveRITy: waRning – aleRt: LonghoRnNodeSTorageWaRning annOTAtions: descRIPtion: The used sTorage of node {{$labels.node}} is at {{$value}}% capacITy foR MoRe than 5 Minutes. suMMaRy: The used sTorage of node is OVeR 70% of the capacITy. expR: (longhoRn_node_sTorage_USAge_bytes / longhoRn_node_sTorage_capacITy_bytes) * 100 > 70 foR: 5M labels: iSSue: The used sTorage of node {{$labels.node}} is High. seveRITy: waRning – aleRt: LonghoRnDiskSTorageWaRning annOTAtions: descRIPtion: The used sTorage of disk {{$labels.disk}} on node {{$labels.node}} is at {{$value}}% capacITy foR MoRe than 5 Minutes. suMMaRy: The used sTorage of disk is OVeR 70% of the capacITy. expR: (longhoRn_disk_USAge_bytes / longhoRn_disk_capacITy_bytes) * 100 > 70 foR: 5M labels: iSSue: The USed sTorage of disk {{$labels.disk}} on node {{$labels.node}} is High. seveRITy: waRning – aleRt: LonghoRnNodeDown annOTAtions: descRIPtion: TheRe aRe {{$value}} LonghoRn nodes wHich have been oFFline foR MoRe than 5 Minutes. suMMaRy: LonghoRn nodes is oFFline expR: longhoRn_node_tOTAl – (count(longhoRn_node_statUS{condITion=”Ready”}==1) OR on() vecTor(0)) foR: 5M labels: iSSue: TheRe aRe {{$value}} LonghoRn nodes aRe oFFline seveRITy: cRITical – aleRt: LonghoRnIntanceManageRCPUUSAgeWaRning annOTAtions: descRIPtion: LonghoRn instance ManageR {{$labels.instance_ManageR}} on {{$labels.node}} has CPU USAge / CPU request is {{$value}}% foR MoRe than 5 Minutes. suMMaRy: LonghoRn instance ManageR {{$labels.instance_ManageR}} on {{$labels.node}} has CPU USAge / CPU request is OVeR 300%. expR: (longhoRn_instance_ManageR_CPu_USAge_MilliCPu/longhoRn_instance_ManageR_CPu_requests_MilliCPu) * 100 > 300 foR: 5M labels: iSSue: LonghoRn instance ManageR {{$labels.instance_ManageR}} on {{$labels.node}} consuMes 3 tiMes the CPU Request. seveRITy: waRning – aleRt: LonghoRnNodeCPUUSAgeWaRning annOTAtions: descRIPtion: LonghoRn node {{$labels.node}} has CPU USAge / CPU capacITy is {{$value}}% foR MoRe than 5 Minutes. suMMaRy: LonghoRn node {{$labels.node}} experiences High CPU pReSSuRe foR MoRe than 5M. expR: (longhoRn_node_CPu_USAge_MilliCPu / longhoRn_node_CPu_capacITy_MilliCPu) * 100 > 90 foR: 5M labels: iSSue: LonghoRn node {{$labels.node}} expeRiences High CPU pReSSuRe. seveRITy: waRning
在https://ProMetheUS.io/docs/ProMetheUS/latest/configuRation/aleRting_Rules/#aleRting-Rules 查看有关如何定义警报规则的更多信息。