Lead Site Reliability Engineer/DevOps

Boston, MA

We’re looking for a top-notch, hands-on SRE to lead our small and talented infrastructure engineering team and help us elevate our game when it comes to designing, building and operating high-performance and highly-available systems.

Every engineer is responsible for the software they build, and SREs play a critical part in providing the tools, practices, and expertise to support them.

Our production systems are hosted in AWS data centers running a large Ruby on Rails web application and a handful of smaller services in Ruby, Node.js, and Java. We currently deploy an average of 5 times a day. Our systems are stable and fire drills are rare.

Technologies we’re currently using include:

Amazon Web Services (EC2, ELB, S3, RDS, ElastiCache) and Ubuntu Linux

Postgres, Redis, Memcached, ElasticSearch

Chef, ServerSpec, Terraform, NewRelic, DataDog, Sumo Logic and Test Kitchen

In this mission-critical role, what you'll do:

Design, build and maintain the core infrastructure
Actively manage the backlog for our infrastructure team and work closely with other SREs on the team to provide coaching and mentorship

Help us increase developer productivity and get to true continuous delivery

Develop operational and security standards and champion operational excellence and secure coding practices

Partner with engineering teams closely to educate and consult

Participate in solution design for new features, products, systems, and tooling

Debug complex problems across the whole stack

Continually monitor application/system performance and costs, generate actionable insights and either implement or advocate for them

Participate in on-call rotations, along with every member of the engineering team

Ruthlessly eliminate repetitive manual tasks and recurring errors

Ensure we are always employing best-of-breed tooling for all our infrastructure and automation needs

Collaboratively plot course for the maturing and growth of our infrastructure

Participate (and sometimes run point) in handling production incidents

Work closely with engineering teams to conduct root cause analysis for production incidents, and evolve infrastructure and tooling.

This role might be that rare opportunity if you have:

Thrive in a highly collaborative, no red-tape, rapid-growth environment

Love building tooling and infrastructure to help developers be more productive

Love eliminating repetitive manual tasks through automation

Have a healthy appreciation of what it means to work in production

Have solid Unix command line and systems chops

Have experience with substantial, distributed SaaS or eCommerce systems

Can point to a solid track record of success leading small-to-medium infrastructure teams

Have vision and well-informed opinions about how to build infrastructure for a high-growth, technology-driven company that’s headed towards the $1B mark

What you’ll get from us:

Importantly, you’ll get sane working hours and a huge amount of flexibility around work/life balance. Have people in your life – of any age – who always, often, or sometimes need your help? We make room for that. Have a bad thing or a good thing happen to you? We make room for that, too.

Oh, and here’s what else you’ll get: Market salary, stock options you’ll help make worth a lot, the usual holidays, all-you-can-eat vacation, 401K, health/dental/FSA, long-term disability insurance, subsidized T-passes, a great office smack-dab in Boston’s Downtown Crossing, a tremendous amount of responsibility and autonomy, wicked awesome co-workers, cupcakes (and many more goodies), and knowing that you helped get this rocket ship to the moon.