Supporting a product team reorg with a code reorg

Published in

malt-engineering

15 min readJan 12, 2021

Malt is ever growing. And doing so, it evolves and constantly searches how to best organize its product team so that it can build a good product. This is the story of our latest evolution, and about how we re-organized our codebase and services to enable that evolution.

The path to maturity

To understand where we started, please allow me a digression about Malt product team’s growth.

Emojis: two men with laptops — Hugo and JB, 2 of the 3 founders of Malt (Emojis from https://openmoji.org/)

When Malt started in late 2012, there were only two engineers— two of the founders — working on it with their guts. Then a designer joined the team, as well as a product manager and some more engineers in 2015, and this allowed Malt’s product team to build and maintain a more solid vision of the product.
But it was still built the same way: everyone would work on whatever subject we decided to invest in. Everyone was responsible for the whole product and therefore the whole codebase. Bugs and user issues were handled as they appeared, by whoever could take it.

Emojis: a woman and three men with laptops — A small product team, with yours truly somewhere in the middle (Emojis from https://openmoji.org/)

Fast forward in 2017, Malt hired some more product managers and developers, and we formed feature teams, that would work on small projects with more focus. Two developers (front-end & back-end) were designated each day to handle support issues and let the other developers focus on their tasks.
It was also the year we hired our first ops engineer, in charge to form a team that would build a platform upon which feature teams would build their services.

Emojis: many men and women with laptops — Malt product team keeps growing (Emojis from https://openmoji.org/)

At some point around 2018, we came to the conclusion that bugs and support issues would be better handled (and faster) by the team having built the incriminated feature. Therefore, the “support devs” of the day would now mostly triage tickets and assigned them to the right team when possible, and they would only work on the remaining tickets.

A problem remained: we were mostly taking a project after another, and while we did have some KPIs, we weren’t good at monitoring the results of those projects, which prevented us to spot much needed improvements or even sometimes actual failures. How could we even improve on a subject, when everyone was already 100% dedicated to the next one?

We started 2020 with a resolution: three build teams would be in charge of building key features, and they would have a few weeks dedicated to monitoring and improving a feature after it launched, before moving to another subject. A separate run team would be in charge of constantly improving existing features, so that they aren’t left behind. Those 4 teams would still work hand in hand with the platform team to achieve their goals.
Finally, a pioneer team would be in charge of preparing the next key project for the build teams.

Our plan to take over the market was ready… or so we thought

As 2020 progressed with its unprecedented events, we quickly realized that our resolution wasn’t enough and that we needed to change more deeply. In the end, what we want isn’t delivering episodic projects but forever improving the experience of the same typical users (personas) of our platform, to propose them the best product. Products over projects.

And this is how we came to the decision to form 3 tribes permanently in charge to address the 3 main populations of our users (freelancers, clients, and prospects). They would still rely on a platform tribe, which would now be composed of several specialized squads (shared services, data, ops).
Using the terminology of Team Topologies, business tribes would be stream-aligned teams, while the squads of the platform tribe would act as either platform or enabling teams.
Each tribe should be autonomous and would have complete latitude about how to address its mission.

Sounds like a plan, but…

While the previous statement looked good on paper, the reality was that there weren’t any chance tribes be autonomous with the current state of the codebase. Please let me give you the second half of the context and describe our codebase and services, 3 months ago.

Malt’s codebase has grown somewhat empirically: while we’ve sometimes identified logical modules right from the start, more often than not new code has been placed near existing logic that looked similar enough. Consequently, some business modules had very blurry bounds, were overloaded, and thus were a mandatory dependency of everything else.

Also, many modules and applications had a structure where the technical concepts (the how) are visible first, while the business concepts (the what) are spread and hidden behind those technical layers. Those are remnants of a time when going fast and CRUD was the just the right think to do (though going CRUD doesn’t mean tech must appear first), while nowadays we know our domains well enough to grant them more attention and apply appropriate recipes (DDD, hexagonal architecture, event sourcing, you name it).

3 different views of the same code. 1st view: 1st level = domains. 2nd view: 1st level = tech. 3rd view: no clear structure. — The Good-enough, the Bad, and the Ugly for 3 domains (naive, CRUDy example). Which one would you prefer to work with?

Also, the initial monolith had been split into several (macro) services a long time ago, but here again we sometimes gave birth to technical services embedding business logic rather than the other way around. For instance, we had a service in charge of running periodic jobs, whatever they are, and those jobs were directly implemented within that service. Given that jobs are a pretty common need, that technical service soon depended on all our business modules.

Our front-end is also split between several applications, and since there are some parts that are common to all pages, the code in charge of building those parts was shared too. And that code may sometimes require details from many domains! For instance, our menu heavily depends on who the user is and what features she’s access to.

Various screenshots of Malt’s menu, where we can see that different entries are present depending on who the user is. — Some examples — highlighted with green markers — of how Malt’s menu changes depending on the user and the features available to her.

While we knew about — and sometimes used — different techniques to handle such issues (view composition, derived data, micro-frontend, etc.), until 2020 the cost of it was still acceptable, compared to the work needed to improve the situation.

At the end of 2019, we started being concerned by the situation as it slowed us down, and consequently we addressed some points. For instance, our menu is now built using derived data fed without depending on other domains. We also agreed on guidelines for writing new code in a more appropriate way, defining bounded contexts and applying suitable tactical patterns where needed. And we started writing architectural decision records (ADR).

But while those actions have great value, day to day the existing system encouraged us to continue as usual and it would take a lot of will and effort to go the “right way”.

At that point I think you get the idea, so I’ll complete the picture by emphasizing some key numbers about our code base in October:

1.8M lines of code, according to cloc (if you ask me it starts being something)
304 maven modules
26 applications, none with a clear owner
120 undifferentiated “core” shared modules (i.e. upon which business logic may depend on one way or the other)
50 technical modules, that rarely change
32% of “business logic” commits edited a “core” module
delivering a feature would almost always require to deploy several applications (more than 3 not being uncommon)

Useless to say, it wasn’t a good breeding ground for our new organization.

Your mission, should you decide to accept it…

This is how we decided for a real investment: we would fully dedicate two engineers during two months to reorganize the codebase (my colleague Cédric and yours truly).

Two months is both a lot and not much. It’s a lot of effort to invest on a subject which benefit on business isn’t immediately visible. And it’s not much to untangle 1.8M lines of code.
We felt like two months was the right amount of time to give us a chance to do something meaningful and then assess the outcome and reorient ourselves accordingly.
Finally, two months was also the time remaining before our new team organization was fully in place, so we better had something by that time.

Objects on a table suggesting exploration and adventure: a hat, a compass, a notebook and a pen. — We weren’t sure yet how to equip ourselves…

OK, so where do we start?

How could we maximize our impact in that amount of time?

As stated earlier, our number one objective was to make our new tribes as autonomous and agile as possible. As far as code is concerned, it meant to us that each tribe should only work on modules and applications it owns.

This in turn meant that a tribe’s applications should only deal with domains the tribe is responsible for, and as a corollary that most domains should only be manipulated by a single application (or at the very least only by applications of a same tribe).

Doing so, a tribe could deliver features while changing less modules, which means less build time and less applications to deploy, and without interfering with other tribes’ work.

Our plan was now clear:

Map our code, list the domains present in each application, and the tribe which should own it.
Within applications, isolate domains from each other.
Move domains between applications according to the target. Create more meaningful applications and destroy meaningless ones where needed.
Reaching this point for 80% of the code concerned would already be a significant improvement.
Move any module related to a domain that was previously shared into its new “home” application. Where not possible, at least clearly make it a module owned by the right tribe, and only used by that tribe’s applications.

And we started!

Mapping the codebase

It took us more than a week to list all domains present in our applications. Sometimes we had to get a head start on the next stage and to move code around, to form clusters and make sense of them.

The outcome was a simple document containing a section for each application, bullet points for domains, colors to visualize tribes owning the domains, and comments about where and how to possibly relocate a given domain.

Several pages of our mapping document, zoomed out to see the overall structure, but otherwise unreadable. — A look at our document at some point during the process

Isolating domains

Before we could move any logic around, we had to make consistent packages of it. If you remember the introduction, we could face two different problems for parts of our domain-related code:

it could be hidden behind technical layers
it could be intertwined with other unrelated code

The idea here was to pull up domains as first-level modules/packages/directories/you name it, depending on the technology concerned: Java/Kotlin, JS/Typescript, JSP, Vue.js…
That way it would then be much easier to move them.

Relocating domains

Now comes the interesting part! We spent 5 weeks moving domains between applications according to the target.

As said previously, in the process we created new applications when no existing one would be a good home for a domain. For many domains though, as we’re mostly not doing “micro-services” but rather “rightly-sized applications”, we just moved them into an existing application that was suitable to host the domain (and most importantly, an application that would be owned by the right tribe!).
Also, we could kill services that existed for the wrong reasons.

A spreadsheet listing applications and their new purpose and responsible tribe. — A listing of our applications: black ones have been killed, red ones have dramatically decrease in size, green ones are new. All other applications have been kept but reworked.

I would lie if saying that it’s been an easy task! Especially as there were some pitfalls we wanted to avoid.

The first one was our users experiencing downtime! We had to come with a strategy to move logic between applications without downtime.

The second pitfall, closely related to first one as you’ll see below, was to lose our git history for parts of the code that would be moved.

Also, we added another— temporal — constraint: we wanted to avoid creating WebServices as much as possible during that step, as it would take us too much time for a “quick win”. Inevitably we would encounter logic that couldn’t be moved to some other app “as is” because of some in-process dependency, but the idea was to refrain our wish to solve that problem at this precise moment, where possible.

I should precise here that we had one advantage: our mono-repo helped us dramatically! Maybe you’ll be able to figure out why while I quickly present 2 techniques we used to prevent the pitfalls listed above.

Aside 1: how to move stuff without breaking things

At Malt, we’re continuously delivering changes, and as such we’re used to move things or change contracts in several steps, in order not to break anything.

I like to call our strategy “expand/propagate/contract”. The general script goes as follows:

Expand. Introduce/modify the provider side: add new stuff, be it a column in DB, a field in a payload, etc.
Deploy provider. At this point, consumers may rely on the old contract or the new one.
Update consumers to make use of the new stuff/contract. It can mean changing some code and deploying it, and/or migrating data in DB.
Contract. Remove dead code or data.

The strategy can be more precisely adapted to what’s being changed, and there are times when we must use strongest tools: feature flags, live bidirectional migration scripts… but the general idea is the same. I won’t go into more details as it would deserve an entire article (and as a matter of fact, I once gave a very short (French) talk on that very subject).

For our code migration of interest, we had several types of logic to move: HTTP endpoints, RabbitMQ listeners, Quartz jobs…

Let’s focus on HTTP endpoints. Following our strategy here meant:

Duplicating the logic into its new home application.
Deploying that application.
Directing traffic to the new endpoint.
Removing the old endpoint.

Nothing fancy there. We had to be careful to precisely target the right paths when re-configuring the external traffic, but it was otherwise very easy thanks to the way our Kubernetes ingresses are configured. Re-configuring our internal cross-services traffic could have been a bit easier, but this is due to our lack of a proper service registry, a problem we should solve in the upcoming months.

The real problem with this zero-downtime approach is that applying it naively could mean losing a fair amount of git history, as we introduced new files (the copied ones) rather than moving existing ones.

Aside 2: how to duplicate files with their git history

When one duplicates a file within git the new file has a virgin history, therefore there’s no way to understand how the content evolve to its state at the time of the copy. (For the sake of completeness, using git log --follow -- the_file does show the pre-copy history, but git blame --follow -- the_file doesn’t. Also, using the --follow option may take long depending on your code base).

Searching the Web, we find a solution to make git see that history. It’s a bit involved, but it does work:

Create a branch for the copy:
git checkout -b xxx-copy
Move (don’t copy) the code to its new place:
git mv foo/xxx bar/xxx git commit -m "Move xxx from foo to bar"
Restore previous version:
git checkout HEAD^ foo git commit -m "Temporarily Restore xxx into foo"
Move back to your initial branch:
git checkout -
And finally merge the copy branch:
git merge --no-ff xxx-copy

Now, using git blame -- bar/xxx works perfectly, and without the --followoption. Unfortunately that option is still required for git log --follow -- bar/xxx, but overall the goal has been achieved and we can easily understand what happened to the content.

Back to the context of this article: once the migration is over for a code part, we would then remove the original files (foo/xxx).

Reducing the volume of shared code

Repeating our recipe for all identified domains, we could reach our first target of moving most of them to a suitable application in 5 weeks. At that point we concluded that our last 2 weeks would be better spent reducing the big blob of code shared between applications.

When only a single application would now use the code, we would simply pull up that code into the application.

When that wasn’t possible, we would attempt to at least clearly make a tribe responsible for those modules and ensure they would only be used within the applications of that tribe.

Finally, the remaining shared modules would be reorganized in clear groups forming layers of dependencies: a module from a lower layer wouldn’t be authorized to depend on a module of higher layer. Those groups would also allow us to better reflect in the future on the impact of putting code in a specific one of them rather than another.

A graph showing dependencies between “layers” of modules. Top to bottom: apps, […], shared services, platform, tech starters… — Our module layers. Dependencies go top to bottom. Most layers are clearly affected to a tribe. Some others remains to be sorted out or are bits for which we’re OK to continue sharing a common ownership.

Enforcing rules and ownership

We were now at the end of the allotted time. While we were pretty happy with how far we went, it was clear that: 1) the journey wasn’t over, 2) we would need some safeguards to collectively continue progressing in the right direction.

So while the new code organization already gave very visible examples of how to do things from now, we added simple automated tests to verify that dependencies indeed continue to go as presented above. (This article being already long enough, I may present the way we did it in another post.)

A unit test written in Kotlin + Junit5 declaring that a certain group of modules may only depend on specific other groups. — Example of a test enforcing our dependency rules

Most importantly, now that we had clearly identified owners (tribes) both for features and parts of the code, we could entirely rework the way we handled support issues and the technical monitoring of our platform.

First, there’s no more global “support dev”: support issues are directly assigned to the right tribe, as it’s now far more easier to do so before they land to the product side.

Then, each tribe now has a dedicated monitoring channel on Slack (and also an email address) where to redirect alerts concerning it. All technical issues raised by Sentry are now directed to a tribe, based on the impacted application. But not only.

For instance, when a message fail to be processed by a RabbitMQ listener, or a command repeatedly won’t succeed within our custom command execution queue, such items are “moved to a quarantine” as we call it. We can now know for sure which tribe is responsible for each item, and thus an alert is emitted to the tribe’s channel when it has items in quarantine.

Depending on the tribe, other tools have been instructed to notify on its monitoring channel.

Screen capture of a Slack channel, where 3 alerts can be seen, from Sentry, Bamboo, and Datadog — Extract of a tribe’s monitoring channel

Finally, while it’s still permitted for any tribe to work on the code of another one, it should now be an exception and it’s expected to be done mostly through either pull requests or full collaboration. There’s a tolerance for very small modifications though, as we would like to avoid useless blockers and bureaucracy.
That being said, we thought a nice feature would be to make any change to a tribe’s code highly visible to that tribe. Consequently we quickly hacked together a Github webhook, that would publish to a Slack channel dedicated to a tribe, commits amending that tribe’s code, highlighting the modules impacted.

Screenshot of a Slack channel, where commits are listed with impacted parts modules highlighted. — A tribe’s commits channel

Again, maybe I will share our solution in another post, but right now this article is officially too long :-)
I’ve no doubt we’ll find many other ways to improve ownership and collaboration in the future, and maybe we’ll share them as well.

Let’s conclude, shall we?

As I said previously the journey is not over, but that two-month effort is already a huge improvement. Also, the hardest part of anything is often getting started, and I would say we’ve now reached cruising speed.

Here are some first results after one month, to be compared to the numbers presented in the introduction:

324 maven modules (vs 304 before)
27 applications (vs 26), 26 having a clear owner.
86 shared “core” modules (vs 120), among which 31 are still to be sorted out
43 non-application modules now have a clear owner (vs none before), and half of them aren’t shared anymore
63 technical modules that rarely change (vs 50 before, but this is actually a good thing)
22% of “business logic” commits edited a “core” module (vs 32% before), but 2/3 of those modules are those that haven’t been sorted out yet
unfortunately I can’t provide more of those right now

While those numbers may not seem that impressive, we already feel their impact daily (especially, less shared code amended also means shorter build/deploy pipelines).

Moreover, now our codebase shows the way, and many highly-changed parts of code are at the right place for our new tribes to work efficiently and autonomously!

And I can’t stress this enough: most of out code now has a clear owning tribe and that tribe is fully responsible for it. It will improve code quality and it should put an incentive on tribes to continue either moving the code they use closer to them, or to collaborate in a more structured way with each other.

That shows us it was definitely the right move to do :-)

I hope you enjoyed that feedback about our growth. In case you happened to be interested in such challenges and more, don’t hesitate to contact us: we’re hiring software engineers!