We're looking for a Senior Site Reliability Engineer to join our Infrastructure team. This Engineer will enable our developers as they work efficiently while building a vibrant ecosystem for the Avalanche Blockchain. You'll enable our teams across several business units and engineering teams to design, optimize, and and implement greenfield technology for a variety of use cases. This particular role will be a key part of our release schedule and production monitoring.
WHAT YOU WILL DO
- Develop and optimize highly reliable and scalable infrastructure focused on SRE principles.
- Implement and maintain monitoring, logging, and tracing tools to gain insights into service behavior and health.
- Uphold SLOs (Service Level Objectives), SLIs (Service Level Indicators), and error budgets for critical systems.
- Enhance the reliability and resiliency of critical systems by identifying single points of failure and implementing best practices.
- Collaborate with software developers to build reliability and performance into applications from inception.
- Automate and streamline incident management processes to minimize service disruption and improve response times.
- Participate in on-call rotations, ensuring quick restoration of services and fostering a blameless post-mortem culture.
- Foster a continuous improvement mindset by analyzing and learning from incidents and implementing preventive measures.
- Leverage cloud technologies and IaC tools to ensure scalability and repeatability.
- Advocate for best practices in reliability, security, and maintainability within the team.