Robust Error Handling and Post-Build Actions
Table of Contents
Section titled “Table of Contents”- Overview
- Failure Taxonomy for CI/CD
- Design Principles
- Pipeline Error-Handling Strategy
- Retry Patterns That Do Not Hide Real Problems
- Fail-Fast vs Fail-Safe Decisions
- Jenkins Patterns
- Post-Build Action Framework
- Artifact and Log Retention
- Notifications and Incident Routing
- Recovery and Rollback Hooks
- Observability and Reporting
- Reference Jenkinsfile
- Common Anti-Patterns
- Implementation Checklist
- Conclusion
Robust Error Handling and Post-Build Actions
Section titled “Robust Error Handling and Post-Build Actions”Overview
Section titled “Overview”Most pipeline outages are not caused by one failing command. They come from weak failure classification, blind retries, and missing post-build cleanup.
Robust CI/CD design means:
- Classifying failures clearly.
- Applying controlled recovery.
- Running deterministic post-build actions every time.
Failure Taxonomy for CI/CD
Section titled “Failure Taxonomy for CI/CD”Use categories that map to specific responses:
- Transient infrastructure failures:
- Agent disconnects, registry timeouts, temporary DNS issues.
- Deterministic code/test failures:
- Unit test failures, lint violations, build compile errors.
- Dependency/system failures:
- Upstream package registry outage, database unavailable.
- Policy/security failures:
- Secrets scan failure, SAST policy block, unsigned artifact.
- Deployment/runtime verification failures:
- Canary health checks fail, error budget burn spikes.
Without this taxonomy, pipelines either retry everything or fail with no remediation.
Design Principles
Section titled “Design Principles”- Fail early for deterministic correctness issues.
- Retry only transient failures with strict caps.
- Always run cleanup and evidence-collection steps.
- Keep post-build side effects idempotent.
- Preserve forensic data for debugging and audit.
Pipeline Error-Handling Strategy
Section titled “Pipeline Error-Handling Strategy”Recommended flow:
- Validate inputs and environment quickly.
- Execute quality gates in order of cost and signal.
- Stop immediately on non-recoverable errors.
- Retry transient operations with exponential backoff.
- In
postphase, always run cleanup, publish reports, and emit status.
Retry Patterns That Do Not Hide Real Problems
Section titled “Retry Patterns That Do Not Hide Real Problems”Good retry guardrails:
- Retry count is low (usually 2-3).
- Backoff increases per attempt.
- Retry only on known transient codes/messages.
- Log each retry reason explicitly.
- Escalate after retry exhaustion.
Bad pattern:
- Wrapping entire pipeline in a generic retry block.
This hides deterministic regressions and wastes compute.
Fail-Fast vs Fail-Safe Decisions
Section titled “Fail-Fast vs Fail-Safe Decisions”Fail-fast stages:
- Lint and static checks.
- Unit tests and schema validation.
- Security policy gates for release branches.
Fail-safe behaviors:
- Preserve artifacts/logs when build fails.
- Revoke temporary credentials.
- Ensure workspace cleanup.
- Publish final status to chat/ticketing.
Jenkins Patterns
Section titled “Jenkins Patterns”Practical Jenkins controls:
- Use
options { timeout(...) }to bound hung runs. - Use
retry(n)only around known flaky network actions. - Use
catchErrorfor non-critical reporting steps. - Use
post { always { ... } }for mandatory cleanup. - Use
archiveArtifactsandjunitin all outcomes where useful.
Post-Build Action Framework
Section titled “Post-Build Action Framework”Every pipeline should define post-build actions by severity and outcome.
Always:
- Cleanup temporary files and secrets.
- Archive logs and test reports.
- Emit build metadata (commit SHA, image digest, duration).
On failure:
- Attach failure classification.
- Route alert to owning team.
- Link runbook and recent similar incidents.
On success:
- Publish release evidence.
- Update deployment ledger/change record.
Artifact and Log Retention
Section titled “Artifact and Log Retention”Retention should be policy-driven:
- Keep release artifacts longer than non-release artifacts.
- Keep security scan outputs with auditable retention windows.
- Delete workspace-sensitive intermediates aggressively.
- Keep enough logs for triage, not unlimited noisy data.
Notifications and Incident Routing
Section titled “Notifications and Incident Routing”Notification quality matters more than volume.
Include:
- Pipeline name, stage, and failure class.
- First failing command or gate.
- Last successful run reference.
- Suggested immediate action and runbook link.
Avoid:
- Sending all failures to all channels.
Recovery and Rollback Hooks
Section titled “Recovery and Rollback Hooks”For deployment pipelines, post-build should include rollback readiness:
- Store last-known-good artifact reference.
- Trigger automated rollback for failed canary verification.
- Reconcile environment state after rollback.
- Record rollback reason and blast radius.
Observability and Reporting
Section titled “Observability and Reporting”Track metrics that improve reliability:
- Failure rate by stage and category.
- Retry rate and retry success ratio.
- Mean time to detect and resolve pipeline failures.
- Post-build action success rate.
- Flake rate for tests and external dependencies.
Use this data to fix systemic noise, not just individual runs.
Reference Jenkinsfile
Section titled “Reference Jenkinsfile”pipeline { agent any options { timeout(time: 45, unit: 'MINUTES') timestamps() } stages { stage('Lint') { steps { sh 'make lint' } } stage('Unit Test') { steps { sh 'make test' } post { always { junit testResults: 'reports/junit/*.xml', allowEmptyResults: true } } } stage('Build Artifact') { steps { retry(2) { sh 'make build' } } } stage('Publish') { steps { retry(2) { sh 'make publish' } } } } post { always { archiveArtifacts artifacts: 'reports/**,dist/**,logs/**', allowEmptyArchive: true sh 'rm -rf .tmp || true' } success { echo 'Build success: release evidence published.' } failure { echo 'Build failed: classify error and notify owner.' } }}Common Anti-Patterns
Section titled “Common Anti-Patterns”- Retrying non-idempotent deployment steps without safeguards.
- Skipping report publication on failure.
- Keeping long-lived credentials in environment variables.
- No timeout settings, causing stuck executors.
- Using manual cleanup that fails silently and leaves residue.
Implementation Checklist
Section titled “Implementation Checklist”- Define failure classes and route each to a clear action.
- Add bounded retries only to transient operations.
- Add mandatory
post { always { ... } }cleanup/reporting. - Improve alerts with ownership, context, and runbook links.
- Track failure/retry metrics and remove top reliability bottlenecks.
Conclusion
Section titled “Conclusion”Robust pipelines are built on deterministic behavior under both success and failure. If you classify failures well, apply targeted retries, and enforce post-build actions every run, delivery becomes faster and more reliable over time.