Skip to main content

Monitoring

What to watch in production.

Metrics

The SDK does not expose Prometheus / OpenTelemetry metrics directly in v0.1 — it's all tracing for now. Wrap the Axum router with your own metrics middleware:

use axum_prometheus::PrometheusMetricLayer;

let (metrics_layer, metric_handle) = PrometheusMetricLayer::pair();

let app = Router::new()
.nest("/v1/auth", routes::auth::router())
.layer(metrics_layer)
.route("/metrics", get(|| async move { metric_handle.render() }));

This gives you per-endpoint latency, request count, error rate — the basics.

Key metrics to alert on

SignalWhy it matters
POST /v1/auth/login p95 > 500msArgon2id is CPU-heavy; spikes suggest Argon2 contention.
POST /v1/auth/login 401 rate > 10% sustainedCredential-stuffing campaign.
POST /v1/auth/mfa/challenge 429 rateBrute-force attempt against a specific user.
POST /v1/auth/password/forgot rate vs baselinePhishing-style reset spam.
POST /v1/auth/refresh 401 rate > 1%Sessions expiring before refresh; check access_expiry_secs vs your client retry policy.
5xx rate on any endpoint > 0.1%Database / cache / email transport failure.

Logs

The SDK emits structured logs via tracing. Recommended filter for production:

RUST_LOG=info,identsphere=debug,sqlx=warn

Key log targets:

  • identsphere_axum::routes::* — per-handler debug logs.
  • identsphere_axum::middleware::auth_middleware — JWT validation + session-cache hits.
  • identsphere_core::services::audit — audit-pipeline diagnostics.
  • IdentSphere::invite — invite-email send failures.

Audit-table monitoring

Query audit_logs for suspicious patterns:

-- Failed-login bursts per IP in the last hour
SELECT ip_address, COUNT(*) AS attempts
FROM IdentSphere.audit_logs
WHERE action = 'auth.login.failed'
AND created_at > now() - interval '1 hour'
GROUP BY ip_address
ORDER BY attempts DESC
LIMIT 20;
-- Successful logins from a never-before-seen IP for each user
SELECT actor_id, ip_address, MIN(created_at) AS first_seen
FROM IdentSphere.audit_logs
WHERE action = 'auth.login' AND status = 'success'
GROUP BY actor_id, ip_address
HAVING MIN(created_at) > now() - interval '1 day';
-- MFA disables
SELECT *
FROM IdentSphere.audit_logs
WHERE action = 'auth.mfa.disabled'
AND created_at > now() - interval '1 day';

Send these to a SIEM (Splunk, ELK, Datadog) for real-time alerting.

Health endpoints

The SDK doesn't ship a /health endpoint by default; add your own that checks SELECT 1 against the DB and pings the session cache:

async fn health(State(state): State<AppState>) -> StatusCode {
let db_ok = state.db.ping().await.is_ok();
let cache_ok = state.session_cache.get("health").await.is_ok();
if db_ok && cache_ok { StatusCode::OK } else { StatusCode::SERVICE_UNAVAILABLE }
}

Wire it into your Kubernetes / load-balancer health checks.

Tracing

OpenTelemetry support via tracing-opentelemetry:

use opentelemetry::trace::TracerProvider;
use tracing_subscriber::layer::SubscriberExt;

let tracer = opentelemetry_otlp::new_pipeline()
.tracing()
.with_exporter(opentelemetry_otlp::new_exporter().tonic())
.install_batch(opentelemetry_sdk::runtime::Tokio)?;

tracing_subscriber::registry()
.with(tracing_subscriber::fmt::layer())
.with(tracing_opentelemetry::layer().with_tracer(tracer))
.init();

Now every Axum request produces a span with route + status + latency.

Backup verification

A backup you haven't restored is a backup that doesn't exist. Quarterly:

  1. Spin up a fresh Postgres instance.
  2. Restore the most recent backup.
  3. Run identsphere migrate status — it should report "up to date."
  4. Run a SELECT against IdentSphere.users to confirm the row count matches production.