This role is for one of the Weekday's clients
Min Experience: 4 years
Location: Bengaluru
JobType: full-time
We are looking for a seasoned Site Reliability Engineer (SRE) to join our infrastructure team. As an SRE, you will play a critical role in ensuring the reliability, scalability, and performance of our systems, particularly across bare metal infrastructure and containerized environments. You will be responsible for bridging the gap between software development and operations by applying a software engineering mindset to system administration topics. This role is ideal for someone passionate about automation, observability, infrastructure as code, and production excellence.
Requirements
Key Responsibilities:
- Design, build, and maintain scalable and reliable infrastructure across bare metal environments.
- Develop and manage containerized services using Docker and orchestrate them using Kubernetes.
- Leverage Terraform to implement and manage infrastructure as code, enabling consistent, repeatable deployments.
- Create, maintain, and improve monitoring, alerting, and visualization systems using Grafana and other observability tools.
- Collaborate closely with development teams to ensure new services are scalable, observable, and deployable.
- Automate routine operational tasks to improve efficiency and reduce the risk of human error.
- Troubleshoot complex production issues spanning applications, systems, networks, and services.
- Participate in incident management, root cause analysis, and postmortem reviews to continuously improve system reliability.
- Ensure high availability and performance of production systems and services.
Key Skills and Experience Required:
- 4–8 years of hands-on experience in site reliability, DevOps, or infrastructure engineering roles.
- Strong experience managing bare metal servers, including provisioning, configuration, and lifecycle management.
- Deep understanding of Docker containers and orchestration using Kubernetes, including managing multi-node clusters in production environments.
- Proficient in using Terraform for building and managing infrastructure across environments (cloud/on-prem).
- Hands-on experience with Grafana for monitoring and visualization, along with Prometheus or other metrics tools.
- Solid understanding of system internals (Linux), networking concepts, and distributed system patterns.
- Experience with CI/CD pipelines and automating deployment workflows.
- Proficiency in at least one scripting or programming language such as Python, Bash, or Go.
- Familiarity with logging, alerting, and tracing tools and principles of observability.
- Strong problem-solving and analytical skills, with the ability to work independently and as part of a team.
Good to Have:
- Exposure to hybrid or multi-cloud environments.
- Experience with performance tuning and capacity planning.
- Background in security best practices for infrastructure.
- Familiarity with configuration management tools like Ansible, Chef, or Puppet.