Site Reliability Engineering (SRE) combines software and systems engineering to build and run scalable, massively distributed, fault-tolerant systems. As part of the team you will be working on ensuring that Tetrate’s platform has reliability/uptime appropriate to users’ needs as well as fast rate of improvement. Additionally much of our engineering effort focuses on building infrastructure, improving the platform troubleshooting abilities and eliminating toil through automation.
- Fundamentals-based problem solving skills; Drive decision by function, first principles based mindset. We are not “title” driven and we value results over process
- Demonstrate bias-to-action and avoid analysis-paralysis; Drive action to the finish line with high quality and on time
- You are ego-less when searching for the best ideas; You contribute effectively outside of your specialty; You think about solving problems from the standpoint of best outcome for the team
- Values autonomy and results over process
- Systematic problem-solving approach, coupled with excellent communication skills and a sense of ownership and drive
- Strong fundamentals in distributed systems and networking
- Ability to debug, optimise code, and automate routine tasks
- Experience programming in at least one of the following languages: C++, Rust, Python, Go
- Familiarity with the concepts of quantifying failure and availability in a prescriptive manner using SLOs and SLIs
- Production experience with K8s/Kubernetes. Familiarity with Istio or other variations of service mesh
- Experience in performance analysis and tuning is a plus
Location: We are worldwide and fully remote with access to offices in SF, Boston, Barcelona and Bandung/Tangerang.