Have you ever had a month that felt like every day was Monday?
What happened:
We were excited as we were preparing to roll out a new upgrade to your ResponsiBid. Beta testers have been using the new bidding system and they’ve been loving it. Before we launched the new upgrade we thought, “Hey! We should use a new server configuration that will make our system bullet proof!” About a month ago, we had researched and tested out a new “server cluster” system that would make everything on ResponsiBid run faster, smoother, and safe from any downtime.
In theory.
As some of you may have noticed, we had more downtime in the last month than we have had in the entire history of ResponsiBid. When we first did our testing on our test servers we were so tickled! Everything ran fast, smooth, and if a server was to go down (we even tested breaking servers on purpose) the whole system went on without skipping a beat! It was so neat to watch as it fixed itself and just kept on humming along. It was an upgrade that we were excited to bring to our users, but we thought it was almost sad that no one would even notice the elegance of the infrastructure. It was destined to be a crown jewel that lived beneath the surface.
Until we realized that it was more of a dragon than a crown jewel.
We noticed after about a week that one server had gone down and not healed. We had to manually heal it, and thought “that was weird”. And then the oddities just kept on going.
So we did what all good development teams do: We called the founders of the technology we were using and asked them to consult with us on the system. They told us that it all looked pretty good, but gave us a few tips (that contradicted what their documentation told us to do) and we thought how lucky we were to get the “inside scoop”.
I don’t think I can even bring myself to talk about the trouble it was to keep it going. It was trial after trial. Of course all of our users were experiencing their spring rush amid this server trouble, but we felt that we were on the cusp. It just had to work… after all, the biggest companies in the world are using infrastructure like this. We just knew it had to work. We’d hit some sticky spots with resulting slowness, but we were going to make it through.
But then Friday the 22nd came.
And we got hit with a seized up server. It wasn’t behaving at all, and we ended up going down for about an hour as we sorted it out. We thought we had just pushed our server to the limit… in hindsight, that was kind of silly to think, since we were using servers about 10 times more powerful than most people would use for the load ResponsiBid has… but we just thought it was the spring rush!
But it wasn’t.
Over the weekend we tested out even bigger servers, and it worked so good. We even used the words “This will melt your face off” kind of speed. So our plan was to upgrade to the new server on Monday night. But we didn’t make it to Monday night. We made it to about Noon Pacific Time. Our alarms started going off, and we thought we could simply make the switch early. But the problem was that the infrastructure we had used had been merging our databases to keep them in sync, and somewhere along the way there was a discrepancy in the syncing. And this caused something that experts would call “data corruption”. It happens every once in a while and it’s something that can be fixed, but can also act like a little “gremlin” in the code until you find it and fix it. We had to go through a process to clean the data which made it so that we had to take everything offline for server maintenance all afternoon.
While our users were trying to work.
It was humiliating, frustrating, and intense. The pressure was so thick that we were feeling sick to our stomachs. I was replying as fast as I could to people who were concerned about what was going on, while trying to get the cleaning process over with. We also decided to leave the cloud infrastructure behind and ditch the cluster setup. We got 2 new servers and went back to what we knew worked. We are using 1 server for the database and a second one for our code. It’s simple, it works well, it’s stable, and we have used it for years. We upgraded to a faster setup than we had previously used (hopefully you are noticing the speed!) and we feel confident that everything is back to the good ‘ol days.
But we are gun shy.
We think we have resolved all the conflicts in the database from the old syncing that we did throughout the last month. But remember, the database is more than 2 GB of data. It’s more than a human could physically read though in an entire lifetime… so we are going to be watching like hawks. And we hope you never have to experience that ever again. I know that I lost about 5 years of my life in this process… and I don’t have much more to give.
So the bottom line:
We researched it, tested it, and pushed the server cluster into production. We tried to make something really special for our users. But it just didn’t work out. We tried to make it work, and we did everything we could to do it. But in the end, as of this moment (which I pray is the end) we have gone back to the structure that we know well, and have cleaned up the damage that the server cluster caused. We have the fastest and most robust system we’ve ever had, and came back to the architecture that we have proven to be rock solid.
We love ResponsiBidders! We want you to have the best. And this time the best isn’t necessarily the newest.
We thank all of you for your patience with us, and we hope you know that we only want the best of you. We work as hard as we can to be your partners in success. You are the engine of America, and we want to help propel you forward!
Happy ResponsiBidding,
Curt