- Distributed systems
Chef Software is the industry leader in IT automation and DevOps solutions. We are a dynamic and rapidly growing software company with a strong sense of dedication to our customers and the Chef community. We work hard but try not to take ourselves too seriously. This is a very collaborative and inclusive work environment where individuals strong on aptitude and attitude will have an opportunity to grow their professional careers through working with some of the most advanced technology and talented developers in the business. We provide competitive compensation, generous benefits, and a professional yet relaxed atmosphere.
At Chef, we develop the world's best products for managing applications and infrastructure at scale, and we deploy them against real problems in all kinds of industries. We’re writing the rules of the cloud - rules the world’s top engineers live, breathe and contribute to. Our platform is used to enable hundreds of millions of people around the world to chat, fly, present, bank, game, shop, and learn. Chances are the web applications you use everyday have infrastructure built, deployed, secured and managed with our code.
We are seeking a highly motivated, results oriented individual with strong Site Reliability Engineering skills and experience in cloud technologies to join our platform engineering team. This person will play a lead role in designing, implementing, and supporting the platform for Chef Cloud services. They will also have a key influence on our future processes and platform design.
This position is located in Belfast, NI with the option to work in Chef’s Belfast office or from home.
What you'll do:
Lead a new team of SRE’s, serving as a point of technical escalation and mentor, while helping organize and guide the work of others.
You will build, operate, and maintain a platform for Chef Cloud services. This will include technologies such as AWS services (ECS, EKS, S3, and more), Kubernetes, service mesh (Linkerd or Envoy), Postgres/RDS, Graph databases and GraphQL, API gateways, authentication services, 3rd party integrations, and more.
Collaborate with other Engineering teams to support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
You’ll implement modern systems observability solutions including monitoring, alerting, metrics, logging, and APM & distributed tracing.
Scale systems sustainably through automation, and evolve systems by pushing for changes that improve reliability and velocity.
Be on-call for services that the SRE team owns.
Practice sustainable incident response and post-incident analysis by acting as an incident manager. You’ll follow our existing incident management process, and recommend improvements to that process.
Who you are (Our ideal candidate will have some or all of these qualifications):
You have a Bachelor's degree in Computer Science or related field and 5+ years relevant experience (or equivalent combination of education and experience).
You have a solid understanding of and experience with configuration management and compliance automation.
You have expert-level understanding of and at least 2 years working experience with containerization using Docker and Kubernetes in a production environment.
You’re comfortable deploying and operating services using AWS technologies, and have an expert understanding of the various offerings available.
You’ve built and supported systems using cloud-native (CNCF) technologies at scale.
You are interested in designing, analyzing and troubleshooting large-scale distributed systems.
You understand what it means to operate infrastructure as code, and have experience developing services and automation to do so.
You have a great ability to debug and optimize code, and automate routine tasks to eliminate toil.
You have a systematic problem-solving approach, coupled with strong communication skills and a sense of ownership, initiative, grit, and drive.
You have designed and implemented applications and systems that scale, are resilient to failure, and are observable.
You have practical experience developing and improving applications written in Go, Rust, other modern languages.
Our SRE team is 100% remote and distributed in multiple locations. This position will be located in an offset timezone together with our team in Belfast. Allowing us to not only live where we will be most productive, but enables us to create a work environment that celebrates all of our humanity. We celebrate the difference of perspective this brings and the barriers it removes.
Work isn’t just about work though. We offer competitive salaries, equity, fantastic benefits, and flexible hours. You’ll also get to work with some of the most down to earth, driven, intelligent and well read engineers in the industry!
Our successful Belfast site is growing and we want to continue to grow a team with a diverse worldview, that are excited to build a high quality DevOps engineering environment in an exciting, enjoyable and creative place to work.
If this sounds like a team you’d enjoy working with and technology you want to spend time on, for a product you can get behind, get in touch!
At Chef, we celebrate and support our differences. We know employing a team rich in diverse thoughts, experiences, and opinions allows our employees, our products and our community to flourish. Chef is honored to be an equal opportunity workplace. We are dedicated to equal employment opportunities regardless of race, color, ancestry, religion, sex, national orientation, age, citizenship, marital status, disability, gender identity, sexual orientation or Veteran status.
Travel to the United States will be required occasionally.