Overall Objectives of Job: (If multiple sections, accord weightage to each section)
- Proven experience in an SRE or infrastructure engineering role with a focus on monitoring, automation, and orchestration.
- Good understanding of of Networking and Security domain, with the ability to critically analyse infrastructure designs and propose innovative improvements to enhance performance, reliability, stability and security
- Strong Linux administration skills
- Expertise in monitoring tools (Prometheus, ELK, Grafana etc.,) with ability to optimize monitoring systems and integrate ML/AI models to improve visibility, anomaly detection, and proactive issue resolution.
- Extensive hands-on experience with automation tools such as Terraform, Ansible, and Jenkins, along with proficiency in CI/CD pipelines, to efficiently streamline and optimize network operations and workflows.
- Extensive hands-on experience with automation tools such as Terraform, Ansible, and Jenkins, along with proficiency in CI/CD pipelines, to efficiently streamline and optimize network operations and workflows.
- Proficiency in scripting languages (Bash, Python, Go).
- Proficiency with containerization and orchestration (Docker, Kubernetes).
- Understanding of cloud platforms such as AWS, Azure, or Google Cloud.
- Familiarity with microservices architecture and distributed systems.
|
100%
|
Duties and Responsibilities
List in order of importance and state approximate weightage accorded to each.
Work closely with developers, QA, and operations teams to foster a DevOps culture focused on security, reliability, and automation.
Monitoring & Alerting:
- Design, implement, and manage comprehensive monitoring solutions using tools like Prometheus, Grafana, ELK stack, etc.
- Develop and maintain alerting systems that proactively provide insights into system health and performance.
- Integrate ML/Gen AI models for anomaly detection, trend analysis, and proactive alerts to enhance observability
- Identify and implement innovative features to improve visibility into system performance and reliability.
· Integrate ML/Gen AI models for anomaly detection, trend analysis, and proactive alerts to enhance observability.
· Identify and implement innovative features to improve visibility into system performance and reliability
- Define and track SLIs, SLOs, and SLAs for critical services and ensure continuous compliance.
Automation & Infrastructure Management:
- Automate infrastructure provisioning and management using tools such as Ansible or Terraform eliminate manual interventions.
- Build and maintain CI/CD pipelines ( GitLab CI) to streamline deployments and ensure system consistency.
- Implement automated testing and validation processes for infrastructure and applications.
|
30
|
Orchestration & Infrastructure as Code:
- Leverage containerization and orchestration technologies (Docker, Kubernetes) to manage scalable, resilient, and fault-tolerant services.
- Use Infrastructure as Code (IaC) to automate and standardize environment provisioning and configuration management.
|
20
|
Networking & Security:
- Review network designs and propose enhancements using emerging technologies and industry best practices for efficiency and innovation.
- Ensure the security and compliance of infrastructure by implementing best practices in network security, including encryption, firewall management, access controls, and intrusion detection.
- Perform regular security audits and vulnerability assessments to identify and mitigate risks.
- Monitor network traffic and optimize performance through network tuning and troubleshooting.
|
20
|
Reliability Engineering:
- Develop high-availability and disaster recovery solutions for mission-critical services.
- Conduct postmortems for major incidents, perform root cause analysis, and implement preventive measures.
- Collaborate with development teams to optimize applications for performance and security.
- Continuously improve operational processes by identifying bottlenecks, automating workflows, and enhancing security measures.
|
30
|
Qualification, Experience, Technical and Functional Skills
- Candidate with below experience
Candidate with 10+ years of experience.
- Strong knowledge of Networking and Security domain, with the ability to critically analyse network designs and propose innovative improvements to enhance performance, reliability, stability and security
- Expertise in monitoring tools (Prometheus, ELK) with ability to optimize monitoring systems and integrate ML/AI models to improve visibility, anomaly detection, and proactive issue resolution.
- Proven experience in an SRE, DevOps, or infrastructure engineering role with a focus on monitoring, automation, and orchestration.
- Extensive hands-on experience with automation tools such as Terraform, Ansible, and Jenkins, along with proficiency in CI/CD pipelines, to efficiently streamline and optimize network operations and workflows.
- Proficiency in scripting languages (Bash, Python, Go).
- Proficiency with containerization and orchestration (Docker, Kubernetes).
- Understanding of cloud platforms such as AWS, Azure, or Google Cloud.
- Familiarity with microservices architecture and distributed systems.
Soft Skills
- Excellent verbal & non verbal communication skills
- Should be a team player.
- Good analytical and problem-solving skills.
- Leadership skill
|
Key Competencies
- Strong knowledge of Networking and Security domain, with the ability to critically analyse network designs and propose innovative improvements to enhance performance, reliability, stability and security
- Proven experience in an SRE, DevOps, or infrastructure engineering role with a focus on monitoring, automation, and orchestration.
- Expertise in monitoring tools (Prometheus, ELK) with ability to optimize monitoring systems and integrate ML/AI models to improve visibility, anomaly detection, and proactive issue resolution.
- Extensive hands-on experience with automation tools such as Terraform, Ansible, and Jenkins, along with proficiency in CI/CD pipelines, to efficiently streamline and optimize network operations and workflows.
- Proficiency in scripting languages (Bash, Python, Go).
- Proficiency with containerization and orchestration (Docker, Kubernetes).
- Understanding of cloud platforms such as AWS, Azure, or Google Cloud.
- Familiarity with microservices architecture and distributed systems.
|
What you do
What you bring
What we offer
[please translate into your local language]
72694 | IT & Tech Engineering | Professional | Non-Executive | Allianz Technology | Full-Time | Permanent.