There's simply no substitute for Kanban processes and for proactive communication from engineers. In a small team without dedicated customer support, a manager takes the customer call, decides whether it's legitimately a bug, creates a ticket to track it and prioritizes it in the Kanban queue. An engineer takes the ticket, fixes it, ships it, communicates that they shipped something to the rest of their team, is responsible for monitoring it in production afterwards, and only takes a new ticket from the queue when they're satisfied that the change is working. But the proactive communication is key: other engineers on the team are also shipping, and everyone needs to understand what production looks like. Management is responsible for balancing support and feature tasks by balancing the priority of tasks in the Kanban queue.
Solution: don’t. If a bug has been introduced by the currently running long process, forward it back. This is not distracting, this is very much on topic.
And if a bug is discovered after the cycle ends - then the teams swap anyway and the person who introduced the issue can still work on the fix.
Additionally, defensive devs have brutal SLAs, and are frequently touching code with no prior exposure to the domain.
They got known as "platform vandals" by the feature teams, & we eventually put an end to the separation.
Good management means finding the right balance for the team, product, and business context that you have, rather than inflexibly trying to force one strategy to work because it’s supposedly the best.
The thing I found most surprising about this article was this phrasing:
> We instruct half the team (2 engineers) at a given point to work on long-running tasks in 2-4 week blocks. This could be refactors, big features, etc. During this time, they don’t have to deal with any support tickets or bugs. Their only job is to focus on getting their big PR out.
This suggests that this pair of people only release 1 big PR for that whole cycle - if that's the case this is an extremely late integration and I think you'd benefit from adopting a much more continuous integration and deployment process.
I think that's a too-literal reading of the text.
The way I took it, it was meant to be more of a generalization.
Yes, sometimes it really does take weeks before one can get an initial PR out on a feature, especially when working on something that is new and complex, and especially if it requires some upfront system design and/or requirements gathering.
But other times, surely, one also has the ability to pump out small PRs on a more continuous basis, when the work is more straightforward. I don't think the two possibilities are mutually exclusive.
The important part is that you're not interrupted during your large-scale tasks, not the absolute length of those tasks.
I don't think it suggests how the time block translates into PRs. It could very well be a series of PRs.
In any case, the nature of the product / features / refactorings usually dictates the minimum size of a PR.
Why not split the big tickets into smaller tickets which are delivered individually? There's cases where you literally can't but in my experience those are the minority or at least should be assuming a decently designed system.
Because it is already the smallest increment you can make. Or because splitting it further would add a lot of overhead.
> There's cases where you literally can't but in my experience those are the minority
I think in this sentence, there's a hidden assumption that most projects look like your project(s). That's likely false.
You left out the part of that quote where I explained my assumption very clearly: A decently designed system.
In my experience if you cannot split tasks into <1 week the vast majority of the time then your code has massive land mines in it. The design may be too inter-connected, too many assumptions baked too deeply, not enough tests, or various other issues. You should address those landmines before you step on them rather than perpetually trying to walk around them. Then splitting projects down becomes much much easier.
That's one possible reason. Sometimes software is designed badly from the ground up, sometimes it accumulates a lot of accidental complexity over years or decades. Solving that problem is usually out of your control in those cases, and only sometimes there's a business driver to fix it.
But there are many other cases. You have software with millions of lines of code, decades of commit history. Even if the design is reasonable, there will be a significant amount of both accidental and essential complexity - from certain size/age you simply won't find any pristine, perfectly clean project. Implementing a relatively simple feature might mean you will need to learn the interacting features you've never dealt with so far, study documentation, talk to people you've never met (no one has a complete understanding either). Your acceptance testing suite runs for 10 hours on a cluster of machines, and you might need several iterations to get them right. You have projects where the trade-off between velocity and tolerance for risk is different from yours, and the processes designed around it are more strict and formal than you're used to.
Lots of time it is true that ticket == pr but it is not the law.
It sometimes makes sense to separate subtasks under a ticket but that is only if it makes sense in business context.
That's only late if there are other big changes going in at the same time. The vast majority of operational/ticketing issues have few code changes.
I'm glad I had the experience of working on a literal waterfall software project in my life (e.g. plan out the next 2 years first, then we "execute" according to a very detailed plan that entire time). Huge patches were common in this workflow, and only caused chaos when many people were working in the same directory/area. Otherwise it was usually easier on testing/integration - only 1 patch to test.
In my experience whenever that happens someone always finds an "oh @#$&" case where a bug is actually far more serious than everyone thought.
It is an approach that's less productive than slowing down and delivering quality, but it's also completely inevitable once a team/company grows to a sufficient size.
Small, in-person, high-trust teams have the advantage of not falling into bad offense habits.
Additionally, a slower shipping pace simply isn’t an option, seeing as the only advantage we have over our giant competitors is speed.
Wouldn't they be incentivized to maintain discipline because they will be the defensive engineers next week when their own code breaks?
Tell that to seemingly every engineering manager and product manager coming online over the last 8-10 years.
I first noticed in 2016 there seemed to be a direct correlation between more private equity and MBA's getting into the field and the decline of software quality.
So now you have a generation of managers (and really executives) who know little of the true tradeoffs between quality and quantity because they only ever saw success pushing code as fast as possible regardless of its quality and dealing with the aftermath. This lead them to promotions, new jobs etc.
We did this to ourselves really, by not becoming managers and executives ourselves as engineers.
> The result is that our product breaks more often than we’d like. The core functionality may remain largely intact but the periphery is often buggy, something we expect will improve only as our engineering headcount catches up to our product scope.
I really resonate with this problem. It was fun to read. We've been tried different methods to balance customers and long-term projects too.
Some more ideas that can be useful:
* Make quality projects an explicit monthly goal.
For example, when we noticed our the edges in our surface area got too buggy, we started a 'Make X great' goal for the month. This way you don't only have to react to users reporting bugs, but can be proactive
* Reduce Scope
Sometimes it can help to reduce scope; for example, before adding a new 'nice to have feature', focus on making the core experience really great. We also considered pausing larger enterprise contracts, mainly because it would take away from the core experience.
---
All this to say, I like your approach; I would also consider a few others (make quality projects a goal, and cut scope)
What are some proactive ways? Ideally that cannot easily be gamed?
I suppose test coverage and such things, and an internal QA team. What I thought the article was about (before having read it) was having half of the developers do red team penetration testing, or looking for UX bugs, of things the other half had written.
Any more ideas? Do you have any internal definitions of "a quality project"?
This is akin to having a boat that isn't seaworthy, so the suggestion is to have a rowing team and a bucket team. One rows, and the other scoops the water out. While missing the actual issue at hand. Instead, focus on creating a better boat. In this case, that would mean investing in testing: unit tests, integration tests, and QA tests.
Have staff engineers guide the teams and make their KPI reducing incidents. Increase the quality and reduce the bugs, and there will be fewer outages and issues.
Even when they rotate - who wants to clock in to wade through a fresh swamp they've never seen? Don't make the swamp: if you're moving too slow shipping things without sinking half the ship each PR then raise your budget to better engineers - they exist.
This premise is like advocating for tech debt loan sharks; I really hope TFA was ironic. Sure, it makes sense from a business perspective as a last gasp to sneakily sell off your failed company but you would never blog "hey here at LLM-4-YOU, Inc. we're sinking".
Sometimes as an engineer I like the frantically scooping water while we try to scale rapidly because it means leaderships vision is to get an exit for everyone as fast as possible. If leadership said "lets take a step back and spend 3 months stabilizing everything and creating a testing/QA framework" I would know they want to ride it out til the end.
It shouldn't have ever come to the point where incidents, outages, and bugs have become prominent enough to warrant a team.
Either have kickass dev(s), although improbable, thus the second level: Implement mitigations, focus on testing, and have staff engineers with KPIs to lower incidents. Give them them the space but be prepared to let them go if incidents don't go down.
There is no stopping of development. Refactoring by itself doesn't guarantee better code or fewer incidents. But don't allow bugs, or known issues, as they can be death by thousand cuts.
The viewpoint is from not an engineer. Having constant incidents doesn't show confidence or competence to investors and customers. As it diverts attention from creating business value into firefighting, which has zero business value and is bad for morale.
Thus, tech investment rather than debt always pays off if implemented right.
Agreed - this is a survival mode tactic in every company I’ve been when it’s happened. If you’re permanently in the described mode and you’re small sized, you might as well be dead.
If mid to large and temporary, this might be acceptable to right the ship.
How does one get to that state?
* avg tenure / skill level of team is relatively uniform
* team is small with high-touch comms (eg: same/near timezone)
* most importantly - everyone feels accountable and has agency for work others do (eg: codebase is small, relatively simple, etc)
Where I would expect to see this fall apart is when these assumptions drift and holding accountability becomes harder. When folks start to specialize, something becomes complex, or work quality is sacrificed for short-term deliverables, the folks that feel the pain are the defense folks and they dont have agency to drive the improvements.
The incentives for folks on defense are completely different than folks on offense, which can make conversations about what to prioritize difficult in the long term.
This is basically how we ran things for the reliability team at Netflix. One person was on call for a week at a time. They had to deal with tickets and issues. Everyone else was on backup and only called for a big issue.
The week after you were on call was spent following up on incidents and remediation. But the remaining weeks were for deep work, building new reliability tools.
The tools that allowed us to be resilient enough that being on call for one week straight didn't kill you. :)
On call for a week at a time only really works if you only get paged at night once a week max. If you get paged every night, you will die from sleep deprivation.
To borrow a football term, sometimes company structure seems like it’s playing the “long ball” game. Everyone sitting back in defence, then the occasional hail mary long pass up to the opposite end. I would love to see a more well developed understanding within companies that certain teams, and the processes that they have are defensive, others are attacking, and others are “mid field”, i.e. they’re responsible for developing the foundations on which an attacking team can operate (e.g. longer term refactors, API design, filling in gaps in features that were built to a deadline). To win a game you need a good proportion of defence, mid field and attack, and a good interface between those three groups.
You need trust in your team to make this work but you also need trust in your team to make any high velocity system work. Personally, I find the ideas here extremely compelling and optimizing for distraction minimization sounds like a really interesting framework to view engineering from.
For prioritization, use a triage queue because it aims the whole team at the most valuable work. This needs to be the mission-critical MVP & PMF work, rather than what the article describes as "event driven" customer requests i.e. interruptions.
We ended up with a system where we break work up into things that take about a day. If someone thinks something is going to take a long time then we try to break it down until some part of it can be done in about a day. So we kinda side-step the problem of having people able to focus on something for weeks by not letting anything take weeks. The same person will probably end up working on the smaller tasks, but they can more easily jump between things as priorities change, and pretty often after doing a few of the smaller tasks either more of us can jump in or we realize we don't actually need to do the rest of it.
It also helps keep PRs reasonably sized (if you do PRs).
That said, there's some credence to what the author is describing. Although I haven't personally worked under the exact system described, I have worked in environments where engineers take turns being the first point of contact for support. In my experience, it worked pretty well. People know your bandwidth is going to be a bit shorter when you're on support, and so your tasks get dialed back a bit during that period.
I think the author, and several people in the comments, make the mistake of assuming that an "engineer on support" necessarily can fix any given problem they are approached with. Larger firms could allocate a complete cross-functional team of support engineers, but this is very costly for small outfits. If you have mobile apps, in-house hardware products and/or integrations with third-party hardware, it's basically guaranteed that your support engineer(s) will eventually be given a problem that they don't have the expertise to solve.
In that situation, the support engineer still has the competencies to figure out who does know how to fix the problem. So, the support engineer often acts more as a dispatcher than a singular fixer of bugs. Their impact is still positive, but more subtle than "they fix the bugs." The support engineer's deep system knowledge allows them to suss out important details before the bug is dispatched to the appropriate dev(s), thereby minimizing downtime for the folks who will actually implement the fix.
- "and our “lean” team is more a product of our inability to identify and hire great engineers, rather than an insistence on superhuman efficiency."
Can we all at some point have a serious discussion on hiring and training. It seems that many teams are unstaffed or at least not satisfied with the quality and quantity of their team. Why is that? Why does it seem to be the norm?
- what about mitigating bugs in the first place? Shouldn't someone be assigned to that? Yeah, sure, bugs are a given. They are going to happen. But in production bugs are something real and paying customers shouldn't experience. At the very least what about feature flags? That is sonething new is introduced to a limited number of user. If there's a bug and it's significant enough, the flag is flipped and the new feature withdrawn. Then the bug can be sorted as someone is available.
Prehaps the profession just is what it is? Some teams are almost miraculously better than others? Maybe that's luck, individuals, product, and/or the stack? Maybe like plumbers and shit there are just things that engineering teams can't avoid? I'm not suggesting we surrender, but that we become more realistic about expectations.
And the team should check the balances once in a while, and maybe rethink the strategy, to avoid overworking someone and underworking someone else, thus creating bottlenecks and vacuums.
At least this is the way i have worked and organised such teams - 2-5 ppl covering everything. Frankly, we never had many customers :/ but even one is enough to generate plenty of "noise" - which sometimes is just noise, but if good customer, will be mostly real defects and generally under-tended parts. Also, good customers accept a NO as answer. So, do say more NOs.. there is some psychological phenomena in software engineering in saying yes and promising moonshots when one knows it cannot happen NOW, but looks good..
have fun!
[0] https://svilendobrev.com/rabota/orgpat/OrgPatterns-patlets.h...
The aim is generally not to provide a perfect fix but an MVP fix and raise tickets in the queue for regular planning.
It rotates round every week or so.
My company's not very devops so it's not on-call, but it's 'point of contact'.
Specifically, we show that individuals following clock-time [where tasks are organized based on a clock**] rather than event-time [where tasks are organized based on their order of completion] discriminate less between causally related and causally unrelated events, which in turn increases their belief that the world is controlled by chance or fate. In contrast, individuals following event-time (vs. clock-time) appear to believe that things happen more as a result of their own actions.[0]
** - in my experience, clock based organisation seems to be very characteristic to what OP describes as defensive, when you become driven by incoming priorities and meetingsBroader article about impact of schedules at [1] is also highly relevant and worth the read.
[0] - https://psycnet.apa.org/record/2014-44347-001
[1] - https://hbr.org/2021/06/my-fixation-on-time-management-almost-broke-me
Some engineers are more likely to avoid interrupting others because they can sympathize.
But the fact that this explicit split makes the choice visible is clearly an upside.
But sounds like there has to be a lot of micro management involved and when you have team of 4 it is easy to keep up but as soon as you go to 20 and that increase also means much more customer requests it will fall apart.
You want the defensive team to work on automating away stuff that pays off for itself in the 1-4 week timeframe. If they get any slack to do so!
Reptile was my favorite Mortal Kombat character, and our ISP added a G before all the sub accounts. They put a P in front of my dad's.
My team had a bunch of stability work, and bug fixes (and there was a lot of bugs and a lot of tech debt, and very little organizational enthusiasm to fix the latter).
Guess where there morale was, compared to some of the other teams?
Edit: I mean an ongoing split, not a rotation
> At the end of the cycle, we swap.
They swap teams every 2-4 weeks so nobody will always be on team defense.
Putting a couple of buzzwords on a practice being performed for at least 15 years now doesn't make you clever. Quite the opposite in fact.
There are 2 fundamental aspects of software engineering:
Get it right
Keep it right
You have only 4 engineers on your team. That is a tiny team. The entire team SHOULD be playing "offense" and "defense" because you are all responsible for getting it right and keeping it right. Part of the challenge sounds like poor engineering practices and shipping junk into production. That is NOT fixed by splitting your small team's cognitive load. If you have warts in your product, then all 4 of you should be aware of it, bothered by it and working to fix it.
Or, if it isn't slowing growth and core metrics, just ignore it.
You've got to be comfortable with painful imperfections early in a product's life.
Product scope is a prioritization activity not an team organization question. In fact, splitting up your efforts will negatively impact your product scope because you are dividing your time and creating more slack than by moving as a small unit in sync.
You've got to get comfortable telling users: "that thing that annoys you, isn't valuable right now for the broader user base. We've got 3 other things that will create WAY MORE value for you and everyone else. So we're going to work on that first."
It's just a support rota at the end of the day. Everyone does it, but not all the time, freeing you up to focus on more challenging things for a period without interruption.
This was an established business (although small), with some big customers, and responsive support was necessary. There was no way we could just say "that thing that annoys you, tough, we are working on something way more exciting." Maybe that works for startups.
As developers we like to focus. But there is vast difference between "manager time" and "builder time" and what you are experiencing.
You are creating immense value with every single customer interaction!
CUSTOMER FACING FIXES ARE NOT 'MANAGER TIME'!!!!!!
They are builder time!!!!
The only reason I'm insisting is because I've lived through it before and made every mistake in the book...it was painful scaling an engineering and product team to >200 people the first time I did it. I made so many mistakes. But at 4 people you are NOT yet facing any real scaling pain. You don't have the team size where you should be solving things with organizational techniques.
I would advise that you have a couple of columns in a kanban board: Now, Next, Later, Done & Rejected. And communicate it to customers. Pull up the board and say: "here is what we are working on." When you lay our the priorities to customers you'd be surprised how supportive they are and if they aren't...tough luck.
Plus, 2-3 weeks feels like an eternity when you are on defense. You start to dread defense.
And, it also divorces the core business value into 2 separate outcomes rather than a single outcome. If a bug helps advance your customers to their outcome, then it isn't "defense" it is "offense". If it doesn't advance your customer, why are you doing it? If you succeed, all of your ugly, monkey patched code will be thrown away or phased out within a couple of years anyway.
Many, many people I’ve dealt with in these roles don’t or can’t, and seem to think their sole task is to mainline customer needs into dev teams. The PM’s I’ve had who _actually_ do manage back properly had happier dev teams, and ultimately happier clients, it’s not a mystery, but for some reason it’s a rare skill.
I’m assuming that the OP is a founder and can actually make these calls.
- saying No, and sticking to it when it matters — what you’ve mentioned.
- knowing how the product gets built — knowing *the why behind the no*.
PMs don’t usually have the technical understanding to do the second one. so the first one falls flat because why would someone stick to their guns when they do not understand why they need to say No, and keep saying No.
there are cases where talking to customer highlights a mistaken understanding in the *why we’re saying No*. those moments are gold because they’re challenging crucial assumptions. i love those moments. they’re basically higher level debugging.
but, again, without the technical understanding a PM can’t notice those moments.
they end up just filling up a massive backlog of everything because they don’t know how to filter wants vs. needs and stuff.
— also i agree with a lot of what you’ve said in this chain of discussion.
get it right first time, then keep it right is so on point these days. especially for smaller teams. 90% of teams are not the next uber and don’t need to worry about massive growth spurts. most users don’t want the frontend changing every single day. they want stability.
worry about getting it right first. be like uber/google if you need to, when you need to.
Yes, but you've got to spend time talking to users to say that. Many engineering teams have incoming "stuff". Depending on your context that might be bug reports from your customer base, or feature requests from clients etc. You don't want these queries (that take half an hour and are spread out over the week) to be repeatedly interrupting your engineering team, it's not great for getting stuff done and isn't great for getting timely helpful answers back to the people who asked.
There's a few approaches. This post describes one ("take it in turns"). In some organisations, QA is the first line of defence. In my team, I (as the lead) do as much of it as I can because that's valuable to keep the team productive.
I feel similar things about the product and business side, it often feels like people are trying to pass their job off to you and if you push back then you’re the asshole. For example, sending us unfinished designs and requirements that haven’t been fully thought through.
I imagine this is exactly how splitting teams into offense and defense will go.
Oh man. Once had a founder who did this to the dev team: blurry, pixelated screenshots with 2 or 3 arrows and vague “do something like <massively under specified statement>”.
The team _requested_ that we have a bit more detail and clarity in the designs, because it was causing us significant slowdown and we were told “be quiet, stop complaining, it’s a ‘team effort’ so you’re just as at fault too”.
Unsurprisingly, morale was low and all the good people left quickly.
And why should team members be collaborative amongst their team? E.g. why should the "offence" team members suddenly help each other if it's not happening generally?
This sounds a lot like JDD - Jock Driven Development.
Perhaps the underlying problems of "don't touch it because we don't understand it" should be solved before engaging in fake competition to increase the stress levels.
The idea has nothing to do with creating artificial competition and it is actually designed as a form of collaboration.
Some work requires concentration and the defensive team is there to maintain the conditions for this concentration, i.e. prevent the offensive team from getting interrupted.
Then perhaps the terminology - for me - has a different meaning.
That’s easy to fix with an exception: you won’t have to worry about support for X time unless you’re the one who recently made the bug.
It turns out that once they’re responsible for their bugs, there won’t actually be that many bugs and so interruptions to a focused engineer will be rare.
That's how we do it in my startup. We have six engineers, most are even pretty junior. Only one will be responsible for support in any given sprint and often he’ll have time left over to work on other things e.g. updating dependencies.