Role, Responsibilities And Key Accountabilities
- Leads investigation and resolutions of complex incidents escalated to the team. Runs post incident review sessions and implements fixes and improvements.
- Leads service transition activities including establishing metrics to track performance, setting up monitoring, Runbook updates, executing Game Day/OAT, and support team training.
- Maintains services once they are live by measuring and monitoring availability, latency, and overall system health.
- Scales systems sustainably through mechanisms like automation and observability, evolving systems by advocating for changes that improve reliability and velocity.
- Maintains scalable and efficient CI/CD pipelines for application enhancement and fixes.
- Conducts regular capacity and finops review based on usage trends and growth projections.
- Develops disaster recovery (DR) plans and conducts regular DR testing to validate recovery procedures and identify areas for improvement.
- Ensures application compliance with regulatory and security requirements.
- Provides mentorship to other team members on handling availability and performance of critical services, on building automation to prevent problem recurrence, and on building automated responses for non-exceptional service conditions.
- Proactively continues to build and apply relevant domain knowledge that may relate to workflows, data pipelines, business policies, configurations and constraints.
- Coordinates on security principal access management and triages security issues.
Qualifications And Experience
- Degree in Computer Science, Software Engineering, Electronics/Electrical Engineering, or equivalent.
- 10+ years of experience working as a site reliability engineer or DevOps engineer responsible for application availability and reliability, implementing automation, and optimizing system performance.
- Extensive hands-on experience with Azure services preferably Microsoft Fabric and Purview.
- Familiarity with infrastructure-as-a-code tools such as Terraform and Azure Resource Manager.
- Scripting and automation skills using Python, PowerShell, or other languages
- Strong knowledge of ITIL framework and best practices for incident, change, configuration and problem management.
- Have a good understanding of REST API.
- Excellent English communication skill. Must be able to work with stakeholders located globally.
- Excellent troubleshooting skills and ability to analyze complex issues.