Role Overview:
We are looking for a proactive and detail-oriented SRE Engineer to support the reliability and performance of our systems. You will be responsible for monitoring, incident response, and continuous improvement of our infrastructure and services.
Key Responsibilities:
- Perform day-to-day operations monitoring and incident management.
- Respond to and drive resolution for P1/P2 incidents.
- Implement observability and monitoring tools to ensure system health.
- Support internal and external users with performance monitoring, troubleshooting, and root cause analysis.
- Build and maintain dashboards for operational and user metrics using Grafana.
- Contribute to automation efforts to reduce manual tasks and improve incident response.
- Participate in testing during the Handover to Support process.
- Assist with cloud administration across Azure, WCNP, and Edge environments.
- Collaborate with cross-functional teams to enhance system stability and reliability.
- Troubleshoot and resolve software issues efficiently.
- Embrace and apply SRE best practices to improve platform resilience.
Required Skills:
- Solid understanding of SRE methodologies.
- Experience with Java, MVC Pattern, JDBC, RESTful APIs, and Spring Boot.
- Familiarity with Azure Cloud and Python scripting.
- Knowledge of observability tools.
- Proficiency with ServiceNow, Slack/Teams, Xmatters, and Grafana.
- Strong analytical and communication skills