Skip to content

Robust Error Handling and Post-Build Actions

Robust Error Handling and Post-Build Actions

Section titled “Robust Error Handling and Post-Build Actions”

Most pipeline outages are not caused by one failing command. They come from weak failure classification, blind retries, and missing post-build cleanup.

Robust CI/CD design means:

  1. Classifying failures clearly.
  2. Applying controlled recovery.
  3. Running deterministic post-build actions every time.

Use categories that map to specific responses:

  1. Transient infrastructure failures:
    • Agent disconnects, registry timeouts, temporary DNS issues.
  2. Deterministic code/test failures:
    • Unit test failures, lint violations, build compile errors.
  3. Dependency/system failures:
    • Upstream package registry outage, database unavailable.
  4. Policy/security failures:
    • Secrets scan failure, SAST policy block, unsigned artifact.
  5. Deployment/runtime verification failures:
    • Canary health checks fail, error budget burn spikes.

Without this taxonomy, pipelines either retry everything or fail with no remediation.

  1. Fail early for deterministic correctness issues.
  2. Retry only transient failures with strict caps.
  3. Always run cleanup and evidence-collection steps.
  4. Keep post-build side effects idempotent.
  5. Preserve forensic data for debugging and audit.

Recommended flow:

  1. Validate inputs and environment quickly.
  2. Execute quality gates in order of cost and signal.
  3. Stop immediately on non-recoverable errors.
  4. Retry transient operations with exponential backoff.
  5. In post phase, always run cleanup, publish reports, and emit status.

Retry Patterns That Do Not Hide Real Problems

Section titled “Retry Patterns That Do Not Hide Real Problems”

Good retry guardrails:

  1. Retry count is low (usually 2-3).
  2. Backoff increases per attempt.
  3. Retry only on known transient codes/messages.
  4. Log each retry reason explicitly.
  5. Escalate after retry exhaustion.

Bad pattern:

  1. Wrapping entire pipeline in a generic retry block.

This hides deterministic regressions and wastes compute.

Fail-fast stages:

  1. Lint and static checks.
  2. Unit tests and schema validation.
  3. Security policy gates for release branches.

Fail-safe behaviors:

  1. Preserve artifacts/logs when build fails.
  2. Revoke temporary credentials.
  3. Ensure workspace cleanup.
  4. Publish final status to chat/ticketing.

Practical Jenkins controls:

  1. Use options { timeout(...) } to bound hung runs.
  2. Use retry(n) only around known flaky network actions.
  3. Use catchError for non-critical reporting steps.
  4. Use post { always { ... } } for mandatory cleanup.
  5. Use archiveArtifacts and junit in all outcomes where useful.

Every pipeline should define post-build actions by severity and outcome.

Always:

  1. Cleanup temporary files and secrets.
  2. Archive logs and test reports.
  3. Emit build metadata (commit SHA, image digest, duration).

On failure:

  1. Attach failure classification.
  2. Route alert to owning team.
  3. Link runbook and recent similar incidents.

On success:

  1. Publish release evidence.
  2. Update deployment ledger/change record.

Retention should be policy-driven:

  1. Keep release artifacts longer than non-release artifacts.
  2. Keep security scan outputs with auditable retention windows.
  3. Delete workspace-sensitive intermediates aggressively.
  4. Keep enough logs for triage, not unlimited noisy data.

Notification quality matters more than volume.

Include:

  1. Pipeline name, stage, and failure class.
  2. First failing command or gate.
  3. Last successful run reference.
  4. Suggested immediate action and runbook link.

Avoid:

  1. Sending all failures to all channels.

For deployment pipelines, post-build should include rollback readiness:

  1. Store last-known-good artifact reference.
  2. Trigger automated rollback for failed canary verification.
  3. Reconcile environment state after rollback.
  4. Record rollback reason and blast radius.

Track metrics that improve reliability:

  1. Failure rate by stage and category.
  2. Retry rate and retry success ratio.
  3. Mean time to detect and resolve pipeline failures.
  4. Post-build action success rate.
  5. Flake rate for tests and external dependencies.

Use this data to fix systemic noise, not just individual runs.

pipeline {
agent any
options {
timeout(time: 45, unit: 'MINUTES')
timestamps()
}
stages {
stage('Lint') {
steps {
sh 'make lint'
}
}
stage('Unit Test') {
steps {
sh 'make test'
}
post {
always {
junit testResults: 'reports/junit/*.xml', allowEmptyResults: true
}
}
}
stage('Build Artifact') {
steps {
retry(2) {
sh 'make build'
}
}
}
stage('Publish') {
steps {
retry(2) {
sh 'make publish'
}
}
}
}
post {
always {
archiveArtifacts artifacts: 'reports/**,dist/**,logs/**', allowEmptyArchive: true
sh 'rm -rf .tmp || true'
}
success {
echo 'Build success: release evidence published.'
}
failure {
echo 'Build failed: classify error and notify owner.'
}
}
}
  1. Retrying non-idempotent deployment steps without safeguards.
  2. Skipping report publication on failure.
  3. Keeping long-lived credentials in environment variables.
  4. No timeout settings, causing stuck executors.
  5. Using manual cleanup that fails silently and leaves residue.
  1. Define failure classes and route each to a clear action.
  2. Add bounded retries only to transient operations.
  3. Add mandatory post { always { ... } } cleanup/reporting.
  4. Improve alerts with ownership, context, and runbook links.
  5. Track failure/retry metrics and remove top reliability bottlenecks.

Robust pipelines are built on deterministic behavior under both success and failure. If you classify failures well, apply targeted retries, and enforce post-build actions every run, delivery becomes faster and more reliable over time.