Who You'll Work For
We are seeking an experienced and analytically-minded Site Reliability Engineer to join our organisation on a permanent, remote basis from Ireland. In this role, you will be instrumental in building, deploying, and operating critical production systems with a steadfast commitment to scalability, reliability, observability, and security. You will work collaboratively with cross-functional teams to ensure our infrastructure remains resilient, efficient, and future-ready. This is an excellent opportunity for a detail-oriented professional who thrives in a dynamic environment and is passionate about solving complex infrastructure challenges.
What You'll Do
- Design, build, and deploy production systems with a focus on scalability, reliability, observability, and performance, ensuring systems meet stringent security standards
- Develop and maintain comprehensive automation solutions to eliminate toil and streamline operational efficiency across production environments
- Proactively monitor production systems, establish intelligent alerting strategies, and implement automated incident response mechanisms to minimise downtime
- Create and maintain detailed incident response runbooks; conduct thorough postmortem analyses following incidents to identify root causes and prevent recurrence
- Collaborate with software engineering teams to identify and resolve infrastructural bottlenecks, designing innovative solutions that enhance product deployment workflows
- Manage and optimise monitoring infrastructure using industry-standard tools, ensuring comprehensive visibility across all systems
- Plan, communicate, and execute maintenance windows on production systems with minimal disruption to service availability
- Triage platform and infrastructural issues with decisiveness and analytical rigour; engage with third-party vendors and support teams as required
- Deploy new systems and updates in a staged, risk-managed manner, ensuring safe and incremental rollouts
- Survey and adopt best practices in infrastructure and platform management to maintain secure, scalable, and fault-tolerant systems
- Study the design and implementation details of open-source systems to enhance troubleshooting capabilities and accelerate issue resolution
- Work transparently with stakeholders to communicate system status, planned maintenance, and infrastructure improvements
#LI-EO1
#automation #Ansible #Terraform #observability #Prometheus #Grafana #cloud platforms #AWS #GCP #Azure #container #orchestration #Kubernetes #Docker #CI/CD #Jenkins #GitLab