Lead Site Reliability (DevOps) Engineer

Boston, MA

Lead Site Reliability Engineer

We’re looking for a top-notch, hands-on SRE to lead our small and talented infrastructure engineering team and help us elevate our game when it comes to designing, building and operating high-performance and highly-available systems.We’re backed by Insight Venture Partners and Iconiq Capital, we’re on a path to $1B in 2019, and we’ll get there - even more surely if you come help us.

Every engineer is responsible for the software they build, and SREs play a critical part in providing the tools, practices, and expertise to support them succeed.

Our production systems are hosted in AWS datacenters running a large Ruby on Rails web application and a handful of smaller services in Ruby, Node.js, and Java. We currently deploy 3-5 times a day. Our systems are stable and fire drills are rare. Technologies we’re currently using include:

Amazon Web Services (EC2, ELB, S3, RDS, ElastiCache) and Ubuntu Linux

Postgres, Redis, Memcached, ElasticSearch

Chef, ServerSpec, Terraform, NewRelic, DataDog, Sumo Logic and Test Kitchen

In this mission-critical role, you would:

Design, build, and maintain the core infrastructure of our product

Actively manage the backlog for our infrastructure team and work closely with other SREs on the team to provide coaching and mentorship

Help us increase developer productivity and get to true continuous delivery

Develop operational and security standards and champion operational excellence and secure coding practices

Partner with engineering teams closely to educate and consult

Participate in solution design for new features, products, systems and tooling

Debug complex problems across the whole stack

Continually monitor application/system performance and costs, generate actionable insights and either implement or advocate for them

Participate in on-call rotations, along with every member of the engineering team

Ruthlessly eliminate repetitive manual tasks and recurring errors

Ensure we are always employing best-of-breed tooling for all our infrastructure and automation needs

Collaboratively plot course for the maturing and growth of our infrastructure

Participate (and sometimes run point) in handling production incidents

Work closely with engineering teams to conduct root cause analysis for production incidents, and evolve infrastructure and tooling.

This role might be that rare opportunity if you:

Thrive in a highly collaborative, no red-tape, rapid-growth environment

Love building tooling and infrastructure to help developers be more productive

Love eliminating repetitive manual tasks through automation

Have a healthy appreciation of what it means to work in production

Have solid Unix command line and systems chops

Have experience with substantial, distributed SaaS or eCommerce systems

Can point to a solid track record of success leading small-to-medium infrastructure teams

Have vision and well-informed opinions about how to build infrastructure for a high-growth, technology-driven company that’s headed towards the $1B mark