this is a very personal post to explain and send an apology to the community for our very sloppy server response over the last month or so. I will also rise a case for how the website works and for how we are envisioning our newcoming webservices.
I know some of you will appreciate this explanations, since there have been comments in the forum and requests for someone’s toes to be cut (probably mine). It is not that we haven’t been doing anything about it, we just didn’t communicate properly about what we were doing. Mostly because we were too busy getting it done. Therefore I apologize, I am a grown up bearded man with no problems saying I am sorry.
As many of you know, the Arduino project is growing quick. We are not experiencing the exponential growth we saw during 2010, but it respects a somehow lineal growth that makes my hair get a little grayer every time I look at the statistics on how our website is used. We are about to reach 1.000.000 unique visitors per month, adding up to over 53.000.000 hits on the website each month. At our previous server (because we upgraded last week) the 8 cores were 90% of the time at 100% of performance. Since we were renting a VPS server at ServInt, we could go over the use of the 8 cores. In other words, we could easily reach a peak of 9.5 virtual cores in weekdays.
This started to happen during late November 2011. Something unexpected since the end of the year has traditionally been a month when our stats would slow down until March, the time when student projects magically happen at very many education centers around the world. But, somehow, the deals made last year throughout the Arduino distribution network, brought much more traffic to the website and took me a little by surprise.
When I first noticed this issue, I took two decisions:
1) we need to make more periodic backups of the website, to assure we have everything there whenever we migrate the website
2) we need to upgrade the server asap
About the backups
I wrote a script to make daily backups of the website, it will put the DB down for a max of 10m early in the European morning. That is the message that some of our most committed forum participants experience when they go on answering questions at 6AM (man that is commitment).
You guys might come to the website at the time the script is backing up the whole database. In the future we will make this work differently. We are making a whole new backup server that will be making copies of everything while it happens and it will not stop the server of operating as it is now.
You might think … but ServInt is an internationally recognized provider, isn’t it giving you a proper backup tool? Our provider is in fact giving us a great backup system for free, but we want to have our own on top … you never know when it will get handy. We have 7 years of interaction in the forum and website that we don’t want to loose. Our knowledge base (our as in yours and ours) has been built collaboratively among the Arduino users since 2005 and we have the responsibility of keeping it alive as well as safe so that it doesn’t get lost.
About the upgrade
We have finally moved away from a VPS to our own machine!! This is exciting, the demand of the Arduino community is so high that we have been in the situation where we have to hire two machines in a data center to have the community up and running. It also means that we had to make a migration as we never did before.
I spent over a week discussing with ServInt how to get this done. I got promised they could make it in 30m. I did an upgrade all by myself early 2011 with only 5m downtime, so I thought they were being conservative in their approach. I like when people don’t try to lie to you to get you in, so I believed they could make our upgrade with 30m downtime.
I scheduled it for January 7th 4AM CET, a really good time, since it is the time we have the lowest traffic rate. Servint came in a day earlier informing they had run out of stock on the machine I wanted for us all to use. It took me a couple of hours to react to this and order the next machine in their line (more expensive, but more resources in it) what delayed the upgrade to happen on Sunday the 8th at 4AM CET, the second best time.
Unfortunately, the worst prediction from ServInt wasn’t as bad as reality. Our server was so badly overloaded that it took them over 20h to make the full migration. I assisted to three shift changes on the technicians side and I also witnessed how they accidentally turn off the server off during a 6h interval, when they shouldn’t have done it. This translated in many of you experiencing the server to be down when it shouldn’t. I got a formal apology from their side, we are good customers to them and they treated us nicely. I appreciated their honesty and took for granted they wouldn’t fail a second time. Despite the issues we faced, we have had a really good relationship to this provider over the last 3 years or so, and I am willing to keep it going as long as they keep the good work from now on.
About the Forum being dead slow
The forum makes use of a database. Every request you guys make, specific searches or posts or whatever, translates in a database query that uses CPU time. It is not about harddrive and it is not about memory … it is about CPU. All the requests made by you last the CPU, and the more visitors we get the more CPU time it consumes. Since we are still growing, the consumption of CPU power is our main concern, as everyone likes to interact with the forum. Correct me if I am wrong, but it is a helpful resource, both for looking for advice, but also for recreational purposes … people come, comment, joke, complaint … and we like it to be like that.
So the upgrade was necessary to keep up with the forum’s growth. But if something can go wrong, it might do so … and in this case it did. The forum’s database got a field modified what broke the Personal Messaging feature. At the same time, there were so many files uploaded in the forum’s folders, that we reached a protection quota I established about a year ago … both issues came at the very same time, probably with one day difference, but you experienced it as if it was the same source of problems. Of course, for the server to render that error would also take CPU time, slowing down the server’s reactions to your queries.
Both issues are now fixed and should not bother you for quite some time. The forum has about 100GB in uploads, personal pictures, etc. But the new machine has harddrive enough to handle this.
And now what?
Now, we will keep on working on things as usual. We will move the forum to a machine by itself at some point this year. We will implement database replication to totally cut away those downtimes in the European mornings and we will make sure the backups of the folders are taking as little CPU as possible using differential backup systems (that we aren’t using yet). I will work in bringing in some of the features many of you have requested for the forum over the last months and we will try to better integrate the different parts in the website with each other (yes, some more graphic design will help).
In the meanwhile I have to credit Cristian (aka Mr. Vacuum) for is very helpful contribution to the Forum and Identification system, and all of you for your patience during this last month or so. We will work in anticipating things much better from now on.
If you want to contribute to the Arduino website, you can use the Playground, which is an open wiki, but you can also be part of the Forum. You can help by volunteering to moderate a forum in your language (we can create boards in any language in the universe, just ask) or translating the Arduino reference to your mother tongue. Send an email to web [at] arduino DOT cc or comment on this board in the forum with your suggestions.