Jan
05
2021
--

How Segment redesigned its core systems to solve an existential scaling crisis

Segment, the startup Twilio bought last fall for $3.2 billion, was just beginning to take off in 2015 when it ran into a scaling problem: It was growing so quickly, the tools it had built to process marketing data on its platform were starting to outgrow the original system design.

Inaction would cause the company to hit a technology wall, managers feared. Every early-stage startup craves growth and Segment was no exception, but it also needed to begin thinking about how to make its data platform more resilient or reach a point where it could no longer handle the data it was moving through the system. It was — in a real sense — an existential crisis for the young business.

The project that came out of their efforts was called Centrifuge, and its purpose was to move data through Segment’s data pipes to wherever customers needed it quickly and efficiently at the lowest operating cost.

Segment’s engineering team began thinking hard about what a more robust and scalable system would look like. As it turned out, their vision would evolve in a number of ways between the end of 2015 and today, and with each iteration, they would take a leap in terms of how efficiently they allocated resources and processed data moving through its systems.

The project that came out of their efforts was called Centrifuge, and its purpose was to move data through Segment’s data pipes to wherever customers needed it quickly and efficiently at the lowest operating cost. This is the story of how that system came together.

Growing pains

The systemic issues became apparent the way they often do — when customers began complaining. When Tido Carriero, Segment’s chief product development officer, came on board at the end of 2015, he was charged with finding a solution. The issue involved the original system design, which like many early iterations from startups was designed to get the product to market with little thought given to future growth and the technical debt payment was coming due.

“We had [designed] our initial integrations architecture in a way that just wasn’t scalable in a number of different ways. We had been experiencing massive growth, and our CEO [Peter Reinhardt] came to me maybe three times within a month and reported various scaling challenges that either customers or partners of ours had alerted him to,” said Carriero.

The good news was that it was attracting customers and partners to the platform at a rapid clip, but it could all have come crashing down if the company didn’t improve the underlying system architecture to support the robust growth. As Carriero reports, that made it a stressful time, but having come from Dropbox, he was actually in a position to understand that it’s possible to completely rearchitect the business’s technology platform and live to tell about it.

“One of the things I learned from my past life [at Dropbox] is when you have a problem that’s just so core to your business, at a certain point you start to realize that you are the only company in the world kind of experiencing this problem at this kind of scale,” he said. For Dropbox that was related to storage, and for Segment it was processing large amounts of data concurrently.

In the build-versus-buy equation, Carriero knew that he had to build his way out of the problem. There was nothing out there that could solve Segment’s unique scaling issues. “Obviously that led us to believe that we really need to think about this a little bit differently, and that was when our Centrifuge V2 architecture was born,” he said.

Building the imperfect beast

The company began measuring system performance, at the time processing 8,442 events per second. When it began building V2 of its architecture, that number had grown to an average of 18,907 events per second.

Oct
09
2020
--

How Roblox completely transformed its tech stack

Picture yourself in the role of CIO at Roblox in 2017.

At that point, the gaming platform and publishing system that launched in 2005 was growing fast, but its underlying technology was aging, consisting of a single data center in Chicago and a bunch of third-party partners, including AWS, all running bare metal (nonvirtualized) servers. At a time when users have precious little patience for outages, your uptime was just two nines, or less than 99% (five nines is considered optimal).

Unbelievably, Roblox was popular in spite of this, but the company’s leadership knew it couldn’t continue with performance like that, especially as it was rapidly gaining in popularity. The company needed to call in the technology cavalry, which is essentially what it did when it hired Dan Williams in 2017.

Williams has a history of solving these kinds of intractable infrastructure issues, with a background that includes a gig at Facebook between 2007 and 2011, where he worked on the technology to help the young social network scale to millions of users. Later, he worked at Dropbox, where he helped build a new internal network, leading the company’s move away from AWS, a major undertaking involving moving more than 500 petabytes of data.

When Roblox approached him in mid-2017, he jumped at the chance to take on another major infrastructure challenge. While they are still in the midst of the transition to a new modern tech stack today, we sat down with Williams to learn how he put the company on the road to a cloud-native, microservices-focused system with its own network of worldwide edge data centers.

Scoping the problem

Jan
05
2019
--

How Trulia began paying down its technical debt

As every software company knows, over time as code ages and workarounds build on work-arounds, the code base becomes bloated. It becomes ever more difficult to get around the technical debt that you’ve built up over time. It’s really impossible to avoid this phenomenon, but at some point, companies realize that the debt is so great that it’s limiting their ability to build new functionality. That’s precisely what Trulia faced in 2017 when it began a process of paying down that debt and modernizing its architecture.

Trulia is a real estate site founded way back in 2005, an eternity ago in terms of technology. The company went public in 2012 and was acquired by Zillow in 2014 for $3.5 billion, but has continued to operate as an independent brand under the Zillow umbrella. It understood that a lot had changed technologically in the 12 years since its inception when engineering began thinking about this. The team knew it had a humongous, monolithic code base that was inhibiting the ability to update the site.

While they tried to pull out some of the newer functions as services, it didn’t really make the site any more nimble because these services always had to tie back into that monolithic central code base. The development team knew if it was to escape this coding trap, it would take a complete overhaul.

Brainstorming broad change

As you would expect, a process like this doesn’t happen overnight, taking months to plan and implement. It all started back in 2017 when the company held what they called an “Innovation Week” with the entire engineering team. Groups of engineers came up with ideas about how to solve this problem, but the one that got the most attention was one called Project Islands, which involved breaking out the different pieces of the site as individual coding islands that could operate independently of one another.

It sounds simple, but in practice it involved breaking down the entire code base into services. They would use Next.js and React to rebuild the front end and GraphQL, an open source graph database technology to rebuild the back end.

Deep Varma, Trulia’s VP of engineering, pointed out that as a company founded in 2005, the site was built on PHP and MySQL, two popular development technologies from that time. Varma says that whenever his engineers made a change to any part of the site, they needed to do a complete system release. This caused a major bottleneck.

What they really needed to do was move to a completely modern microservices architecture that allowed engineering teams to work independently in a continuous delivery approach without breaking any other team’s code. That’s where the concept of islands came into play.

Islands in the stream

The islands were actually microservices. Each one could communicate to a set of central common services like authentication, A/B testing, the navigation bar, the footer — all of the pieces that every mini code base would need, while allowing the teams building these islands to work independently and not require a huge rebuild every time they added a new element or changed something.

Cousine island. Seychelles. Photo: Martin Harvey/Getty Images

The harsh reality of this kind of overhaul came into focus as the teams realized they had to be writing the new pieces while the old system was still in place and running. In a video the company made describing the effort, one engineer likened it to changing the engine of a 747 in the middle of a flight.

Varma says he didn’t try to do everything at once, as he needed to see if the islands approach would work in practice first. In November 2017, he pulled the first engineering team together, and by January it had built the app shell (the common services piece) and one microservice island. When the proof of concept succeeded, Varma knew they were in business.

Building out the archipelago

It’s one thing to build a single island, but it’s another matter to build a chain of them and that would be the next step. By last April, engineering had shown enough progress that they were able to present the entire idea to senior management and get the go-ahead to move forward with a more complex project.

Photo of Rock Islands, Palau, Micronesia: J.W.Alker/Getty Images

First, it took some work with the Next.js development team to get the development framework to work the way they wanted. Varma said he brought in the Next.js team to work with his engineers. He said that they needed to figure out how to stitch the various islands together and resolve dependencies among the different services. The Next.js team actually changed its development roadmap for Trulia, speeding up delivery of these requirements, understanding that other companies would have similar issues.

By last July, the company released Neighborhoods, the first fully independent island functionality on the site. Recently, it moved off-market properties to islands. Off-market properties, as the name implies, are pages with information about properties that are no longer on the market. Varma says that these pages actually make up a significant portion of the company’s traffic.

While Varma would not say just how much of the site has been moved to islands at this point, he said the goal is to move the majority to the new platform in 2019. All of this shows that a complete overhaul of a complex site doesn’t happen overnight, but Trulia is taking steps to move off the original system it created in 2005 and move to a more modern and flexible architecture it has created with islands. It may not have paid down its technical debt in full in 2018, but it went a long way on laying the foundation to do so.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com