代码埋点 + 监控系统完整指南
一、监控体系全景图
业务监控体系 = 代码埋点 + 数据采集 + 存储计算 + 可视化告警
↓ ↓ ↓ ↓
应用埋点 Prometheus 时序数据库 Grafana
日志埋点 日志Agent Elasticsearch Alertmanager
链路埋点 OpenTelemetry 数据湖 告警通道
二、代码埋点技术栈详解
2.1 埋点类型分类
| 类型 | 目的 | 技术方案 | 频率 |
|---|---|---|---|
| 指标埋点 | 监控系统状态 | Micrometer/Prometheus | 实时 |
| 日志埋点 | 问题排查 | SLF4J + Logback | 按需 |
| 链路埋点 | 性能分析 | OpenTelemetry/SkyWalking | 采样 |
| 事件埋点 | 行为分析 | 消息队列 + 大数据 | 实时 |
2.2 指标埋点完整实现
2.2.1 依赖配置
<!-- pom.xml -->
<dependencies>
<!-- Micrometer核心 -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-core</artifactId>
<version>1.10.5</version>
</dependency>
<!-- Prometheus注册表 -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
<version>1.10.5</version>
</dependency>
<!-- 支持Spring Boot自动配置 -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
</dependencies>
2.2.2 全局监控配置类
@Configuration
@EnableScheduling
public class MonitoringConfig {
/**
* 全局MeterRegistry配置
*/
@Bean
public MeterRegistryCustomizer<MeterRegistry> metricsCommonTags() {
return registry -> registry.config()
.commonTags(
"application", "order-service",
"environment", System.getenv("ENV") != null ?
System.getenv("ENV") : "dev",
"cluster", System.getenv("CLUSTER") != null ?
System.getenv("CLUSTER") : "default",
"instance", ManagementFactory.getRuntimeMXBean().getName()
);
}
/**
* Prometheus指标暴露端点
*/
@Bean
public ServletRegistrationBean<Servlet> prometheusServlet() {
ServletRegistrationBean<Servlet> bean =
new ServletRegistrationBean<>(
new MetricsServlet(), "/metrics");
bean.addInitParameter(
"quantiles", "0.5,0.75,0.95,0.99,0.999");
return bean;
}
/**
* JVM和系统指标自动收集
*/
@PostConstruct
public void initSystemMetrics() {
// JVM内存使用
new JvmMemoryMetrics().bindTo(Metrics.globalRegistry);
// JVM GC信息
new JvmGcMetrics().bindTo(Metrics.globalRegistry);
// 系统CPU
new ProcessorMetrics().bindTo(Metrics.globalRegistry);
// 日志框架
new LogbackMetrics().bindTo(Metrics.globalRegistry);
}
}
2.3 业务指标埋点实战
2.3.1 订单服务完整埋点示例
@Service
@Slf4j
public class OrderService {
// 1. 计数器:统计成功/失败次数
private final Counter orderCreateSuccessCounter;
private final Counter orderCreateFailureCounter;
private final Counter orderCreateCounter;
// 2. 分布摘要:记录数值分布(如订单金额)
private final DistributionSummary orderAmountSummary;
// 3. 计时器:记录方法执行时间
private final Timer orderCreateTimer;
// 4. 仪表盘:记录瞬时值(如库存数量)
private final Map<Long, AtomicInteger> inventoryGauges = new ConcurrentHashMap<>();
public OrderService(MeterRegistry meterRegistry) {
// 初始化所有指标
this.orderCreateSuccessCounter = Counter.builder("order.create.success")
.description("订单创建成功总数")
.tag("service", "order-service")
.register(meterRegistry);
this.orderCreateFailureCounter = Counter.builder("order.create.failure")
.description("订单创建失败总数")
.tag("service", "order-service")
.register(meterRegistry);
this.orderCreateCounter = Counter.builder("order.create.total")
.description("订单创建总数(不分成功失败)")
.tag("service", "order-service")
.register(meterRegistry);
this.orderAmountSummary = DistributionSummary.builder("order.amount")
.description("订单金额分布")
.baseUnit("CNY")
.scale(1.0) // 金额单位:元
.publishPercentiles(0.5, 0.95, 0.99) // 50%, 95%, 99%分位数
.register(meterRegistry);
this.orderCreateTimer = Timer.builder("order.create.duration")
.description("订单创建耗时")
.publishPercentiles(0.5, 0.95, 0.99)
.sla(Duration.ofMillis(100), Duration.ofMillis(500),
Duration.ofMillis(1000), Duration.ofMillis(3000))
.register(meterRegistry);
}
/**
* 创建订单 - 完整埋点示例
*/
@Transactional
public Order createOrder(CreateOrderRequest request) {
// 开始计时
Timer.Sample sample = Timer.start();
orderCreateCounter.increment();
Order order = null;
try {
// 1. 参数验证
validateRequest(request);
// 2. 检查库存(包含库存指标)
checkInventory(request.getProductId(), request.getQuantity());
// 3. 创建订单
order = buildOrder(request);
order = orderRepository.save(order);
// 4. 扣减库存
reduceInventory(request.getProductId(), request.getQuantity());
// 5. 记录订单金额分布
orderAmountSummary.record(order.getTotalAmount().doubleValue());
// 6. 成功计数
orderCreateSuccessCounter.increment();
log.info("订单创建成功,订单号: {}", order.getOrderNo());
return order;
} catch (BusinessException e) {
// 业务异常:记录失败原因
orderCreateFailureCounter.increment(1.0,
Tags.of("error_type", "business",
"error_code", e.getErrorCode()));
throw e;
} catch (Exception e) {
// 系统异常
orderCreateFailureCounter.increment(1.0,
Tags.of("error_type", "system",
"exception", e.getClass().getSimpleName()));
throw new OrderCreateException("订单创建失败", e);
} finally {
// 结束计时
sample.stop(orderCreateTimer);
// 记录耗时日志
if (order != null) {
log.debug("订单创建完成,耗时: {}ms",
sample.duration(orderCreateTimer).toMillis());
}
}
}
/**
* 库存检查 - 包含库存监控
*/
private void checkInventory(Long productId, Integer quantity) {
Product product = productRepository.findById(productId)
.orElseThrow(() -> new ProductNotFoundException(productId));
// 记录当前库存量(仪表盘)
registerInventoryGauge(product);
if (product.getStock() < quantity) {
// 库存不足时触发告警指标
Metrics.counter("inventory.insufficient",
"product_id", String.valueOf(productId))
.increment();
throw new InsufficientStockException(
String.format("商品 %s 库存不足,剩余: %d",
product.getName(), product.getStock()));
}
}
/**
* 动态注册库存仪表盘
*/
private void registerInventoryGauge(Product product) {
inventoryGauges.computeIfAbsent(product.getId(), id -> {
AtomicInteger gauge = new AtomicInteger(product.getStock());
Gauge.builder("product.inventory.current", gauge, AtomicInteger::get)
.description("商品当前库存量")
.tag("product_id", String.valueOf(product.getId()))
.tag("product_name", product.getName())
.tag("category", product.getCategory())
.register(Metrics.globalRegistry);
return gauge;
}).set(product.getStock());
}
}
2.4 AOP统一埋点方案
2.4.1 自定义监控注解
/**
* 方法监控注解
*/
@Target(ElementType.METHOD)
@Retention(RetentionPolicy.RUNTIME)
public @interface MonitorMethod {
/** 指标名称 */
String name() default "";
/** 指标描述 */
String description() default "";
/** 记录参数 */
boolean recordParams() default false;
/** 记录返回值 */
boolean recordResult() default false;
/** 是否记录异常 */
boolean recordException() default true;
/** 自定义标签 */
String[] tags() default {};
}
/**
* 业务指标注解
*/
@Target({ElementType.METHOD, ElementType.TYPE})
@Retention(RetentionPolicy.RUNTIME)
public @interface BusinessMetric {
/** 业务类型 */
String businessType();
/** 业务子类型 */
String subType() default "";
/** 是否记录QPS */
boolean qps() default true;
/** 是否记录耗时 */
boolean duration() default true;
/** 是否记录错误率 */
boolean errorRate() default true;
}
2.4.2 AOP切面实现
@Aspect
@Component
@Slf4j
public class MonitoringAspect {
private final MeterRegistry meterRegistry;
private final ObjectMapper objectMapper;
private final ThreadLocal<Timer.Sample> timerSample = new ThreadLocal<>();
@Autowired
public MonitoringAspect(MeterRegistry meterRegistry, ObjectMapper objectMapper) {
this.meterRegistry = meterRegistry;
this.objectMapper = objectMapper;
}
/**
* 方法监控切面
*/
@Around("@annotation(monitorMethod)")
public Object monitorMethod(ProceedingJoinPoint joinPoint,
MonitorMethod monitorMethod) throws Throwable {
String methodName = getMethodName(joinPoint);
String metricName = monitorMethod.name().isEmpty() ?
methodName : monitorMethod.name();
// 开始计时
Timer.Sample sample = Timer.start(meterRegistry);
timerSample.set(sample);
Object result = null;
boolean success = false;
try {
// 记录调用次数
Counter.builder(metricName + ".call.total")
.description(monitorMethod.description())
.tags(monitorMethod.tags())
.register(meterRegistry)
.increment();
// 执行原方法
result = joinPoint.proceed();
success = true;
// 记录成功
Counter.builder(metricName + ".call.success")
.tags(monitorMethod.tags())
.register(meterRegistry)
.increment();
// 记录返回值(如果需要)
if (monitorMethod.recordResult() && result != null) {
recordResultMetric(metricName, result, monitorMethod.tags());
}
return result;
} catch (Exception e) {
// 记录失败
Counter.builder(metricName + ".call.failure")
.tags(monitorMethod.tags())
.tag("exception", e.getClass().getSimpleName())
.register(meterRegistry)
.increment();
if (monitorMethod.recordException()) {
log.error("方法 {} 执行失败", methodName, e);
}
throw e;
} finally {
// 结束计时
if (timerSample.get() != null) {
timerSample.get().stop(Timer.builder(metricName + ".duration")
.tags(monitorMethod.tags())
.publishPercentiles(0.5, 0.95, 0.99)
.register(meterRegistry));
timerSample.remove();
}
// 记录参数(如果需要)
if (monitorMethod.recordParams()) {
recordParamsMetric(metricName, joinPoint.getArgs(), monitorMethod.tags());
}
}
}
/**
* 业务指标切面
*/
@Around("@annotation(businessMetric)")
public Object businessMetric(ProceedingJoinPoint joinPoint,
BusinessMetric businessMetric) throws Throwable {
String businessKey = businessMetric.businessType() +
(businessMetric.subType().isEmpty() ? "" : "." + businessMetric.subType());
// 记录QPS
if (businessMetric.qps()) {
meterRegistry.counter("business.qps",
"business_type", businessMetric.businessType(),
"sub_type", businessMetric.subType())
.increment();
}
Timer.Sample sample = null;
if (businessMetric.duration()) {
sample = Timer.start(meterRegistry);
}
try {
Object result = joinPoint.proceed();
// 记录成功
meterRegistry.counter("business.success",
"business_type", businessMetric.businessType(),
"sub_type", businessMetric.subType())
.increment();
return result;
} catch (Exception e) {
// 记录失败
meterRegistry.counter("business.failure",
"business_type", businessMetric.businessType(),
"sub_type", businessMetric.subType(),
"exception", e.getClass().getSimpleName())
.increment();
throw e;
} finally {
if (sample != null && businessMetric.duration()) {
sample.stop(meterRegistry.timer("business.duration",
"business_type", businessMetric.businessType(),
"sub_type", businessMetric.subType()));
}
}
}
/**
* 控制器监控切面
*/
@Around("@within(org.springframework.web.bind.annotation.RestController) || " +
"@within(org.springframework.stereotype.Controller)")
public Object controllerMonitor(ProceedingJoinPoint joinPoint) throws Throwable {
MethodSignature signature = (MethodSignature) joinPoint.getSignature();
String controllerName = joinPoint.getTarget().getClass().getSimpleName();
String methodName = signature.getMethod().getName();
String fullMethodName = controllerName + "." + methodName;
// HTTP请求指标
Counter requestCounter = Counter.builder("http.requests.total")
.tag("controller", controllerName)
.tag("method", methodName)
.register(meterRegistry);
requestCounter.increment();
Timer.Sample sample = Timer.start(meterRegistry);
boolean success = false;
try {
Object result = joinPoint.proceed();
success = true;
meterRegistry.counter("http.requests.success")
.tag("controller", controllerName)
.tag("method", methodName)
.increment();
return result;
} catch (Exception e) {
meterRegistry.counter("http.requests.error")
.tag("controller", controllerName)
.tag("method", methodName)
.tag("exception", e.getClass().getSimpleName())
.increment();
throw e;
} finally {
sample.stop(Timer.builder("http.requests.duration")
.tag("controller", controllerName)
.tag("method", methodName)
.publishPercentiles(0.5, 0.95, 0.99)
.register(meterRegistry));
// 记录成功率
if (success) {
meterRegistry.counter("http.requests.complete")
.tag("controller", controllerName)
.tag("method", methodName)
.increment();
}
}
}
// 辅助方法
private String getMethodName(ProceedingJoinPoint joinPoint) {
return joinPoint.getSignature().getDeclaringTypeName() + "." +
joinPoint.getSignature().getName();
}
private void recordResultMetric(String metricName, Object result, String[] tags) {
try {
String resultJson = objectMapper.writeValueAsString(result);
// 可以记录到日志或发送到消息队列
log.debug("方法 {} 返回值: {}", metricName, resultJson);
} catch (JsonProcessingException e) {
log.warn("记录返回值失败", e);
}
}
private void recordParamsMetric(String metricName, Object[] args, String[] tags) {
if (args != null && args.length > 0) {
try {
String paramsJson = objectMapper.writeValueAsString(args);
log.debug("方法 {} 参数: {}", metricName, paramsJson);
} catch (JsonProcessingException e) {
log.warn("记录参数失败", e);
}
}
}
}
2.5 使用示例
@RestController
@RequestMapping("/orders")
@Slf4j
public class OrderController {
@Autowired
private OrderService orderService;
/**
* 创建订单接口
*/
@PostMapping
@BusinessMetric(businessType = "order", subType = "create",
qps = true, duration = true, errorRate = true)
@MonitorMethod(name = "order.create.api",
description = "订单创建API接口",
recordParams = true,
recordException = true,
tags = {"api", "order"})
public ResponseEntity<ApiResponse<OrderDTO>> createOrder(
@Valid @RequestBody CreateOrderRequest request) {
// 业务逻辑
Order order = orderService.createOrder(request);
// 返回结果
return ResponseEntity.ok(ApiResponse.success(
OrderDTO.fromEntity(order)));
}
/**
* 查询订单
*/
@GetMapping("/{orderNo}")
@BusinessMetric(businessType = "order", subType = "query")
public ResponseEntity<ApiResponse<OrderDTO>> getOrder(
@PathVariable String orderNo,
@RequestHeader(value = "X-User-Id") Long userId) {
// 记录用户行为
meterRegistry.counter("user.behavior.query",
"user_id", String.valueOf(userId),
"action", "query_order")
.increment();
Order order = orderService.findByOrderNo(orderNo);
// 检查权限
if (!order.getUserId().equals(userId)) {
throw new AccessDeniedException("无权访问此订单");
}
return ResponseEntity.ok(ApiResponse.success(
OrderDTO.fromEntity(order)));
}
}
三、监控系统配置
3.1 Prometheus配置
# prometheus.yml
global:
scrape_interval: 15s # 抓取间隔
evaluation_interval: 15s # 规则评估间隔
scrape_configs:
# 应用指标
- job_name: 'order-service'
metrics_path: '/actuator/prometheus'
scrape_interval: 10s
static_configs:
- targets:
- 'order-service-1:8080'
- 'order-service-2:8080'
labels:
service: 'order-service'
env: 'production'
# JVM指标
- job_name: 'jvm'
static_configs:
- targets:
- 'order-service-1:8081'
- 'order-service-2:8081'
# 业务自定义指标
- job_name: 'business-metrics'
static_configs:
- targets:
- 'order-service-1:9090' # 自定义metrics端口
# 基础设施监控
- job_name: 'node-exporter'
static_configs:
- targets:
- 'node-exporter:9100'
# 数据库监控
- job_name: 'mysql-exporter'
static_configs:
- targets:
- 'mysql-exporter:9104'
3.2 告警规则配置
# alert-rules.yml
groups:
- name: business-alerts
rules:
# 订单成功率告警
- alert: OrderSuccessRateLow
expr: |
# 计算成功率:成功数 / (成功数 + 失败数)
(
sum(rate(order_create_success_total[5m]))
/
sum(rate(order_create_success_total[5m]) + rate(order_create_failure_total[5m]))
) * 100 < 95
for: 2m
labels:
severity: warning
service: order-service
annotations:
summary: "订单成功率低"
description: |
{{ $labels.service }} 订单成功率当前为 {{ $value | printf "%.2f" }}%,
低于阈值 95%,最近5分钟失败订单数: {{ humanize
(sum(rate(order_create_failure_total[5m]))) }}
# 支付成功率告警
- alert: PaymentSuccessRateLow
expr: |
(
sum(rate(payment_success_total[5m]))
/
(sum(rate(payment_success_total[5m])) + sum(rate(payment_failure_total[5m])))
) * 100 < 98
for: 1m
labels:
severity: critical
service: payment-service
annotations:
summary: "支付成功率低"
description: "支付成功率低于98%,请立即检查支付渠道"
# 库存预警
- alert: ProductInventoryLow
expr: |
product_inventory_current < 10
for: 0m
labels:
severity: warning
service: product-service
annotations:
summary: "商品库存不足"
description: |
商品 {{ $labels.product_name }} (ID: {{ $labels.product_id }})
库存仅剩 {{ $value }},请及时补货
# API响应时间告警
- alert: APIResponseTimeHigh
expr: |
histogram_quantile(0.95,
sum(rate(http_requests_duration_seconds_bucket[5m])) by (le, controller, method)
) > 1
for: 2m
labels:
severity: warning
annotations:
summary: "API响应时间过长"
description: |
{{ $labels.controller }}.{{ $labels.method }} 接口
95%响应时间超过1秒,当前值: {{ $value | printf "%.3f" }}秒
# 业务异常突增告警
- alert: BusinessExceptionSpike
expr: |
rate(order_create_failure_total{error_type="business"}[2m])
>
5 * rate(order_create_failure_total{error_type="business"}[10m:1m])
for: 1m
labels:
severity: critical
annotations:
summary: "业务异常突增"
description: "订单创建业务异常在2分钟内增长超过5倍"
# 系统容量预警
- alert: SystemCapacityWarning
expr: |
# CPU使用率
(1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)) * 100 > 80
or
# 内存使用率
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 85
or
# JVM老年代使用率
jvm_memory_used_bytes{area="heap", id="G1 Old Gen"} /
jvm_memory_max_bytes{area="heap", id="G1 Old Gen"} * 100 > 85
for: 3m
labels:
severity: warning
annotations:
summary: "系统资源使用率过高"
description: |
{{ $labels.instance }} 资源使用率过高:
CPU: {{ printf "%.1f" (query1 $value) }}%
内存: {{ printf "%.1f" (query2 $value) }}%
JVM堆: {{ printf "%.1f" (query3 $value) }}%
3.3 Alertmanager配置
# alertmanager.yml
global:
smtp_smarthost: 'smtp.qq.com:587'
smtp_from: 'monitor@yourcompany.com'
smtp_auth_username: 'monitor@yourcompany.com'
smtp_auth_password: 'password'
slack_api_url: 'https://hooks.slack.com/services/xxx'
route:
group_by: ['alertname', 'severity', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'slack-notifications'
routes:
# 关键告警直接打电话
- match:
severity: critical
receiver: 'phone-call'
continue: true
# 业务告警发企业微信
- match:
service: order-service
receiver: 'wechat-work'
# 基础设施告警发邮件
- match:
severity: warning
receiver: 'email'
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#monitoring-alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}'
- name: 'phone-call'
webhook_configs:
- url: 'http://phone-call-service/alert'
send_resolved: true
- name: 'wechat-work'
wechat_configs:
- corp_id: 'your-corp-id'
agent_id: '1000002'
secret: 'your-secret'
to_user: '@all'
message: '{{ range .Alerts }}告警: {{ .Annotations.summary }}\n描述: {{ .Annotations.description }}\n{{ end }}'
- name: 'email'
email_configs:
- to: 'devops@yourcompany.com'
subject: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
告警名称: {{ .Labels.alertname }}
严重级别: {{ .Labels.severity }}
服务: {{ .Labels.service }}
实例: {{ .Labels.instance }}
摘要: {{ .Annotations.summary }}
描述: {{ .Annotations.description }}
时间: {{ .StartsAt }}
{{ end }}
inhibit_rules:
# 当有critical告警时,抑制warning告警
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['service', 'instance']
3.4 Grafana仪表板配置
{
"dashboard": {
"title": "订单服务监控仪表板",
"panels": [
{
"title": "订单成功率",
"type": "stat",
"targets": [{
"expr": "sum(rate(order_create_success_total[5m])) / sum(rate(order_create_total[5m])) * 100",
"legendFormat": "成功率"
}],
"thresholds": {
"steps": [
{"color": "red", "value": null},
{"color": "yellow", "value": 95},
{"color": "green", "value": 98}
]
}
},
{
"title": "订单创建QPS",
"type": "graph",
"targets": [{
"expr": "sum(rate(order_create_total[1m]))",
"legendFormat": "总QPS"
}, {
"expr": "sum(rate(order_create_success_total[1m]))",
"legendFormat": "成功QPS"
}, {
"expr": "sum(rate(order_create_failure_total[1m]))",
"legendFormat": "失败QPS"
}]
},
{
"title": "API响应时间分布",
"type": "heatmap",
"targets": [{
"expr": "histogram_quantile(0.99, sum(rate(http_requests_duration_seconds_bucket[5m])) by (le))",
"legendFormat": "P99"
}, {
"expr": "histogram_quantile(0.95, sum(rate(http_requests_duration_seconds_bucket[5m])) by (le))",
"legendFormat": "P95"
}, {
"expr": "histogram_quantile(0.50, sum(rate(http_requests_duration_seconds_bucket[5m])) by (le))",
"legendFormat": "P50"
}]
},
{
"title": "错误类型分布",
"type": "piechart",
"targets": [{
"expr": "sum(rate(order_create_failure_total[5m])) by (error_type, error_code)",
"legendFormat": "{{error_type}} - {{error_code}}"
}]
}
],
"refresh": "10s",
"time": {
"from": "now-1h",
"to": "now"
}
}
}
四、最佳实践和注意事项
4.1 埋点设计原则
- 明确监控目标:每个埋点都要有明确的监控目的
- 避免过度埋点:只埋点关键业务路径和核心指标
- 标签设计:标签要可枚举,避免高基数标签(如用户ID)
- 指标命名:使用统一命名规范:
<metric_type>.<service>.<metric_name> - 文档化:维护埋点文档,说明每个指标的含义和告警阈值
4.2 性能考虑
// 性能优化示例
public class OptimizedMonitoring {
// 1. 使用预创建的Meter,避免重复创建
private static final Counter PRECREATED_COUNTER =
Counter.builder("precreated.counter").register(Metrics.globalRegistry);
// 2. 批量更新指标
public void batchOperation(List<Order> orders) {
Timer.Sample sample = Timer.start();
// 批量处理
processBatch(orders);
sample.stop(Timer.builder("batch.process.duration")
.register(Metrics.globalRegistry));
// 批量增加计数,而不是每条记录增加一次
PRECREATED_COUNTER.increment(orders.size());
}
// 3. 异步记录指标
@Async
public void asyncRecordMetric(String metricName, double value) {
Metrics.timer(metricName).record(() -> {
// 耗时操作
heavyCalculation();
});
}
}
4.3 常见陷阱和解决方案
-
标签基数爆炸
// 错误:使用用户ID作为标签 Counter.builder("user.action") .tag("user_id", userId) // 每个用户都会创建新的时间序列! .register(registry); // 正确:使用用户分组 Counter.builder("user.action") .tag("user_group", getUserGroup(userId)) // 可枚举的分组 .register(registry); -
内存泄漏
// 动态创建的Gauge需要手动清理 private final Map<String, AtomicInteger> gauges = new ConcurrentHashMap<>(); public void removeGauge(String key) { AtomicInteger gauge = gauges.remove(key); if (gauge != null) { // 从注册表中移除 Metrics.globalRegistry.remove(new Meter.Id( "dynamic.gauge", Tags.of("key", key), null, null, Meter.Type.GAUGE)); } } -
采样率控制
// 高频率调用的方法使用采样 public void highFrequencyMethod() { // 只采样1%的调用 if (ThreadLocalRandom.current().nextDouble() < 0.01) { Timer.Sample sample = Timer.start(); try { doWork(); } finally { sample.stop(highFrequencyTimer); } } else { doWork(); } }
4.4 监控告警分级
告警级别:
P0(致命): 影响核心业务,需要立即处理
- 支付成功率 < 90%
- 数据库主节点宕机
- 响应时间 > 10秒
P1(严重): 影响用户体验,需要当天处理
- 订单成功率 < 95%
- 关键API错误率 > 5%
- 内存使用率 > 90%
P2(警告): 潜在风险,需要关注
- 磁盘使用率 > 80%
- 业务异常增长
- 库存低于安全线
P3(提醒): 信息性通知
- 定时任务完成
- 系统启动/关闭
- 配置变更
4.5 监控数据生命周期管理
-- 数据保留策略
-- 原始数据: 保留30天(用于详细分析)
-- 1小时聚合: 保留90天(用于趋势分析)
-- 1天聚合: 保留1年(用于长期趋势)
-- 业务指标: 永久保留(用于业务分析)
-- 自动清理脚本示例
DELETE FROM metrics
WHERE timestamp < NOW() - INTERVAL '30 days'
AND granularity = 'raw';
DELETE FROM metrics
WHERE timestamp < NOW() - INTERVAL '90 days'
AND granularity = 'hourly';
DELETE FROM metrics
WHERE timestamp < NOW() - INTERVAL '1 year'
AND granularity = 'daily';
五、总结
5.1 关键成功要素
- 业务驱动:监控指标要服务于业务目标
- 可操作性:告警要包含明确的处理建议
- 可持续性:监控系统本身要可监控
- 成本意识:平衡监控覆盖面和存储成本
- 持续优化:定期回顾和优化监控策略
5.2 推荐实施路径
第1阶段:基础设施监控(CPU、内存、磁盘)
第2阶段:应用性能监控(响应时间、错误率)
第3阶段:业务指标监控(成功率、业务量)
第4阶段:用户体验监控(端到端性能、用户行为)
第5阶段:智能预警和预测
通过以上完整的代码埋点和监控系统配置,你可以建立起一个从代码层面到业务层面的全方位监控体系,及时发现并解决问题,确保系统的稳定运行和业务的持续发展。