Lead the design, implementation, and operation of highly available and scalable infrastructure solutions to support our organization's applications and services.
Support services before they go live such as system design consulting, capacity planning, and launch reviews.
Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
Scale systems sustainably through mechanisms like automation; evolve systems by pushing for changes that improve reliability and velocity.
Improve monitoring, alerting and resilience of systems.
Practice sustainable incident response and blameless postmortems.
What you’ll Need:
Minimum of 5-10 years of experience in a Site Reliability Engineering (SRE) role, with a proven track record of designing and implementing scalable and reliable infrastructure solutions.
Systematic problem-solving approach, coupled with effective communication skills and a sense of drive.
Experience in designing, analyzing, and troubleshooting micro-services.
Understanding of monitoring, logging, and tracing systems to help teams quickly detect problems such as ELK, Prometheus, Grafana, Jaeger.
It’d be Great if you have:
Experience with Linux and Network administration skills for troubleshooting.
Familiar with Cloud Platform (AWS or Google Cloud) and Kubernetes
Experience programming in Go or similar is an advantage is an advantage
Experience designing and managing MongoDB and MySQL databases is an advantage
Knowledge in Security and how to test is an advantage
Skills
Logging
Application Services
Leadership
Troubleshooting
Promethean
Problem Solving
Functions
Engineering
Job Overview
Job Type:
Hybrid
Company
LINE MAN Wongnai
147 active jobs
Industry:
Consumer Goods, Retail & E-Commerce
Ready to Apply?
Submit your application now and take the next step in your career journey.