Reply to http://perfcap.blogspot.com/2012/03/ops-devops-and-noops-at-netflix.html

3/20/2012 - 7:18 PM

Reply to http://perfcap.blogspot.com/2012/03/ops-devops-and-noops-at-netflix.html

Rendered
Source

Adrian -

I appreciate that you spent time in writing this post. I know I've been up until 2am writing similarly long ones as well. I will take responsibility for having what is likely an irrational response (I blame Twitter for that) to the term "NoOps", but I invite you to investigate why that might be. I'm certainly not the only one who feels this way, apparently, and thus far have decided this issue is easily the largest distraction in my field I've encountered in recent years. I have had the option to simply ignore my opposition to the term, and just let the chips fall where they may with how popular the term "NoOps" may or may not get. I have obviously not taken that option in the past, but I plan to in the future.

You're not an analyst saying "NoOps". Analysts are easy (for me) to ignore, because they're not practitioners. We have expectations of engineering maturity from practitioners in this field of web engineering, especially those we consider leaders. I don't have any expectations from analysts, other than that they are annoying.

So: I'm responding here (and have on Twitter) to your remarks because I have been surprised with respect to my expectations, and because I think that our field can get better when there is reasonable discourse, especially when it comes to domain expertise and organizational evolution. Having said that, it's possible that this particular issue has strayed from being useful. Therefore, I'll just make this last comment on it here.

A couple of thoughts on your post, in no real order:

tl;dr - Like many have pointed out, the issue people have with "NoOps" is the presence of "No" in it. This is because you (and others) are intermingling titles (and therefore organizational groups) with the often blurred domain expertise usually (but not evenly-distributed across a diverse spectrum of organizations) associated with Operations. I'm going to attempt below to illustrate what you describe as the "Ops" in your term "NoOps" is what most of us in the community describe as Doing It Wrong™. Therefore, I suggest if you want to have actual influence in the field of large-scale web engineering, I would do this: 's/NoOps/OpsDoneMaturelyButStillOps/g'

Regardless, you're still doing what almost everyone I know would call "Ops".
"Tomato" or "toe-mah-toe" appears to be the crux of the issue.

So: A datacenter full of high-end IBM, Oracle, and SAN? No wonder why AWS seems cheaper to you. :) I don't know any large-scale (or maybe even medium-scale) web outfits anymore with those three pieces in combination, or even the architecture that it implies. Ouch. One might think that Netflix actually tried to build the most expensive on-premise solution to begin with. It's possible that you can then build an extremely inefficient AWS footprint and still spend less than you did before. It sounds like it was awful, I'm sorry about that. :/

"We tried bringing in new ops managers, and new engineers, but they were always overwhelmed by the fire fighting needed to keep the current systems running."

This is obviously not an unfamiliar scenario. This is essentially why the DevOps concept(s) are popular; to avoid and prevent this from happening. And many organizations find success in digging themselves out of fire-fighting/reactive to proactive/future-facing by better cooperation and collaboration. Note that this situation isn't Ops-specific. Organizations can easily find themselves in such firefighting modes with no Ops people on staff. If you don't think developers can't unwittingly turn themselves into overwhelmed fire-fighters, I have a bridge I'd like to sell you.

Mark Imbriaco mentioned something to me with regards to the being overwhelmed with reactive activity:

"The operations folks being overwhelmed by fire fighting. Who developed the software that they're trying to operate that caused the fires? It's a two way street here, when you run an organization that has an explicit handoff from dev to ops and you complain that ops is too slow because of fire fighting, you have to consider that they're fighting the fires set by developers. That's not to say that the developers are in any way malicious or incompetent, but only that they were far too isolated from the result of their actions.

Congratulations, you solved that by instituting the same 'developers own their code throughout the entire lifecycle' approach that the rest of us having been using for years. Welcome to the party, we like it here."

Back to your post:

"They never have to have a meeting with ITops, or file a ticket asking someone from ITops to make a change to a production system, or request extra capacity in advance."

At Etsy, development make production changes themselves, and this has the effect of bringing them closer to production, which enables having an operability mindset. This is opposed to having a ship-to-QA-and-consider-it-done mindset. Developers deploying their own code also brings accountability, responsibility, and the requisite authority to influence production. No Operations engineers stand in the way of a Development engineer from deploying. As I look now, yesterday there were 25 production web changes that involved 50 unique committers, an average of 4 unique committers per deploy, and 17 unique deployers (developers will sometimes self-organize to deploy small changes together). This isn't counting 4 production deploys to our Search (Lucene) infra, 2 to our Blog infra, and 14 application config deploys (dark launches, A/B rampups, etc.). This also isn't counting many new changes to our Chef cookbooks and recipes per day, which are made largely by the Ops team but can and are made by development teams. (These numbers are not meant to impress; we obviously favor a continuous stream of small and frequent changes to production instead of scheduled releases. This benefits Etsy, it may not benefit Netflix. I don't want to be sidetracked onto a conversation about continuous deployment. Its mechanics are only related to my main point here, it's not the main point.)

So, Etsy has an Operations org, people with "Operations" in their title, and yet don't have a culture of red tape like you describe. The premise you're implying doesn't exist here (that Ops are grumpy people that say no all the time and are a source of frustration and holdups) and I'm willing to bet it's the same in other companies. Maybe not all, but likely all truly successful ones that you've heard about.

"Notice that we didn't use the typical DevOps tools Puppet or Chef to create builds at runtime. This is largely because the people making decisions are development managers, who have been burned repeatedly by configuration bugs in systems that were supposed to be identical."

This statement appears to imply that non-development managers (presumably Ops folks) are prone to choosing and running systems that would introduce inconsistencies, that those people can't learn from being burned in the past by such things. Along the way, you throw Puppet and Chef under the bus passively. And that 'development managers' are immune to making mistakes. Which is, of course, silly. If this is actually the case, then the reality is that you'd be hiring the wrong people and using the tools incorrectly, not that one type of engineer is immune from making mistakes. Getting consistent infrastructure spun up quickly from bare metal has been a possibility since the late 90s. All self-respecting systems people have known the work of Mark Burgess for years, and HPC clusters have built themselves consistently and quickly for over a decade, designed by people who would call themselves "Operations" engineers.

"The developers used to spend hours a week in meetings with Ops discussing what they needed, figuring out capacity forecasts and writing tickets to request changes for the datacenter."

I'm trying to think of how to reply to this piece, but failing. Concisely, I'll simply assert: you were doing it wrong to begin with. That's not Operations, that's an organization who failed to see Operations as a way to enable the business and not hinder it. My definition of Ops involves the responsibility to make it safe to make whatever change to production is necessary, at the rate that it is necessary for the business to evolve.

"They use a web based portal to deploy hundreds of new instances running their new code alongside the old code, put one "canary" instance into traffic, if it looks good the developer flips all the traffic to the new code. If there are any problems they flip the traffic back to the previous version (in seconds) and if it's all running fine, some time later the old instances are automatically removed. This is part of what we call NoOps."

Ok so you've just described an engineer deploying a change to production, choosing and relying on signals and metrics to increase or decrease confidence in said change, taking action (or not) based on those signals and metrics, and then adjusting resources to make the change pseudo-permanent. Congratulations, you've just described one of the most basic patterns of modern Operations, one that aligns quite nicely with Col. Boyd's OODA loop (http://en.wikipedia.org/wiki/OODA_loop), well-known to the Operations community. No matter how many different titles you give it, no matter who that engineer reports to, this is Operations. This is the Operations that is known currently, in the field. Really, ask anyone who's been doing this at scale. Ask Theo Schlossnagle. Ask Jesse Robbins. Ask Benjamin Black. Ask Artur Bergman. Ask Jonathan Heiliger. Ask Jon Jenkins.

You have the person responsible for the code deploy the code and take responsibility for it in production. Excellent! This is how many of us have been doing it for years.

The Operations that you describe (if I'm hearing you correctly) that makes roadblocks, gatekeepers, endless or long-standing meetings and process is the Operations of the 1990s, and the frustrations that arise from that were the genesis of the DevOps concept. As an aside, the capacity planning and change management you're describing was in place when you wrote your book on capacity planning, not when I wrote mine.

What you describe as the CORE team at Netflix is a subset of what Operations at Etsy does. Maybe this can help. At a high level, it breaks down like this:

Etsy Operations is responsible for:

Responding to outages, takes on-call
Alerting systems thresholding, design
Architecture design and review
Building metrics collection
Application configuration
Infrastructure buildout/management

Etsy Development is responsible for:

Responding to outages, takes on-call
Alerting systems thresholding, design
Architecture design and review
Building metrics collection
Application configuration
Shipping public-facing code

Neither of those lists are comprehensive, I'm sure I'm missing something there. While Etsy Ops has made production-facing application changes, they're few but real (and sometimes quite deep). While Etsy Dev makes Chef changes, they're few but real. If there's so much overlap in responsibilities, why the difference, you might ask? Domain expertise and background. Not many Devs have deep knowledge of how TCP slow start works, but Ops does. Not many Ops have a comprehensive knowledge of sorting or relevancy algorithms, but Dev does. Ops has years of experience in forecasting resource usage quickly with acceptable accuracy, Dev doesn't. Dev might not be aware of the pros and cons of distributing workload options across all layers1-7, maybe only just at 7, Ops does. Entity-relationship modeling may come natural to a developer, it may not to ops. In the end, they both discover solutions to various forms of Byzantine failure scenarios and resilience patterns, at all tiers and layers.

As a result, Etsy doesn't have to endure a drama-filled situation (like you allude to) with arguments concerning stability, availability, risk, and shipping new features and making change, between the two groups. Why is this? Because these (sometimes differing) perspectives are heralded as important and inform each other as the two groups equally take responsibility in allowing Etsy to work as effectively and efficiently as it needs to in our market.

These differences in domain expertise turn out to be important in practice, and we have both because it's beneficial for Etsy. If it wasn't, we wouldn't have both. They constantly influence each other, and educate each other, informing the decisions we make with different and complimenting perspectives. As we continue (as Netflix does, it sounds like) to evolve our processes and tooling, it's my job (as well as the CTO and VP of Engineering) to keep this flow strong and balanced.

"There is no ops organization involved in running our cloud,"

There is, it's just not called Operations. Tomato.

"no need for the developers to interact with ops people to get things done"

I'm willing to bet that you have developers discussing amongst themselves on how to best build automation, fault-tolerance, metrics collection, alerting mechanisms, resource allocation, etc. to allow Netflix to move fast. Yes? Your cloud architecture sounds like it's evolved to be better over time. During that evolution, I bet one engineer talked to another, which means that someone interacted with Ops.

"I think that's different to the way most DevOps places run,"

On the contrary, I think what you've described above is basically identical to how many large-scale and successful organizations run. It's how Flickr was run, how Etsy is run, how Amazon, Facebook, Google, and all the other companies are run. Being good at Operations (no matter what your title) is a competitive advantage.

To sum up: the term NoOps sounds like it suits you. Super. I would simply suggest that you consider the possibility that how you're using the words "No" and "Ops" together is almost certainly in opposition to how almost everyone I respect uses them, and that this may reflect on how others view Netflix engineering as a whole.

This post has helped me get my thoughts down further, but I'm not willing to comment more on the topic. I sincerely hope this helps clarify at least my thoughts on the term. There's been enough back and forth on it, and I've got work to do in Operations Engineering, making Etsy a nice place to visit on the internet. :)

For anyone who's read this far: you ought to get a trophy. You at least deserve cake.

Cacher is the code snippet organizer for pro developers

We empower you and your team to get more done, faster

Reply to http://perfcap.blogspot.com/2012/03/ops-devops-and-noops-at-netflix.html