ScyllaDB is looking for experienced and dynamic individuals to join our Cloud Operations & Site Reliability Engineering (SRE) team.
As a Scylla Cloud Operations & SRE Engineer, you'll play a crucial role in maintaining the operational excellence of our cutting-edge NoSQL database platform, Scylla Cloud.
Using your expertise in cloud infrastructure, Kubernetes, and system operations, you'll ensure the reliability, scalability, and performance of our cloud services. If you're passionate about working in a fast-paced environment, collaborating with cross-functional teams, and driving continuous improvement, this role is ideal for you.
Responsibilities:
- Collaborate with the Cloud Operations & SRE team to ensure the smooth day-to-day operation of Scylla Cloud.
- Monitor system health, troubleshoot issues, and proactively address any operational challenges.
- Assist and perform upgrades for Scylla Cloud, including Scylla database versions, OS upgrades, and security patches.
- Collaborate with DevOps/Cloud Engineering to ensure seamless upgrade processes.
- Participate in scaling up and down Scylla Monitor & Scylla Managers servers based on demand.
- Employ proactive monitoring strategies to identify and address potential performance bottlenecks and resource constraints.
- Act as a liaison with the Support Organization to address cloud platform-related issues.
- Respond to tasks and tickets escalated by Support Staff, and collaborate to ensure timely resolutions.
- Develop and maintain a comprehensive runbook that can be leveraged by Support Staff to troubleshoot and resolve common issues, improving efficiency in issue resolution.
- Create scripts and automation solutions to streamline operational tasks and enhance efficiency.
- Contribute to the development of automation strategies for cloud infrastructure management.
- Collaborate with the Cloud Engineering team to define and create feature requests that enhance the functionality and performance of Scylla Cloud.
- Conduct regular cluster health and performance audits, identifying areas for optimization.
- Implement strategies to enhance the efficiency and reliability of Scylla Cloud clusters.
- Work closely with the Customer Success team to ensure that provisioned resources align with customer needs and purchased packages. Provide insights into potential scaling opportunities and usage optimization.
- Demonstrate a deep understanding of public cloud environments (AWS, GCP, Azure), Kubernetes, Linux system operations, and NoSQL database deployment/management. Apply this knowledge to resolve complex technical challenges.
- Utilize scripting languages like Python, Terraform, Ansible and Bash to create automation tools that enhance operational efficiency.
- Collaborate closely with Support and Engineering teams to address issues, drive improvements, and implement customer-focused solutions.