As a Senior Site Reliability Engineer, you will play a crucial role in ensuring the reliability, scalability, and performance of our systems and services. You will lead initiatives to design, implement, and maintain robust infrastructure and automation solutions, driving excellence in reliability engineering practices across the organization.
Responsibilities:
- Lead the design, implementation, and maintenance of highly available, scalable, and secure infrastructure solutions to support our products and services.
- Drive initiatives to optimize automation tools and processes for deployment, monitoring, and incident response, enhancing operational efficiency and reliability.
- Collaborate closely with cross-functional teams to define and implement best practices for reliability, performance, and scalability, ensuring alignment with business objectives.
- Conduct performance analysis, capacity planning, and system tuning to optimize resource utilization and meet performance targets.
- Define and implement monitoring, alerting, and logging solutions to proactively identify and address potential issues, minimizing downtime and service disruptions.
- Lead post-incident reviews and root cause analysis to drive continuous improvement and prevent recurrence of incidents.
- Stay current with industry trends, emerging technologies, and best practices in site reliability engineering, driving innovation and continuous improvement.
- Mentor and coach junior engineers, fostering a culture of learning, collaboration, and excellence within the team.