Ran across an excellent synopsis of YouTube lessons learned in scaling their application over at HighScalability.
Four of the "Lessons" really struck me as things every architect on a large software application should learn. Copied for convenience:
Approximate Correctness - Cheat a Little
Another favorite technique. The state of the system is that which it is reported to be. If a user can’t tell a part of the system is skewing and inconsistent, then it’s not.
A real world example. If you write a comment and someone loads the page at the same time, they might not get it for 300-400ms, the user who is reading won’t care. The writer of the comment will care, so you make sure the user who wrote the comment will see it. So you cheat a little bit. Your system doesn’t have to have globally consistent transactions. That would be super expensive and overkill. Not every comment is a financial transaction. So know when you can cheat.
Expert Knob Twiddling
Ask, what do you know about your consistency model? For comments is eventually consistent good enough? Renting a movie is different. When renting there’s money so we’ll do the best we can to never lose that. Different consistency models are needed depending on the data.
Jitter - Add Entropy Back into Your System
Hot word in their group all of the time. If your system doesn’t jitter then you get thundering herds. Distributed applications are really weather systems. Debugging them is as deterministic as predicting the weather. Jitter introduces more randomness because surprisingly, things tend to stack up.
For example, cache expirations. For a popular video they cache things as best they can. The most popular video they might cache for 24 hours. If everything expires at one time then every machine will calculate the expiration at the same time. This creates a thundering herd.
By jittering you are saying randomly expire between 18-30 hours. That prevents things from stacking up. They use this all over the place. Systems have a tendency to self synchronize as operations line up and try to destroy themselves. Fascinating to watch. You get slow disk system on one machine and everybody is waiting on a request so all of a sudden all these other requests on all these other machines are completely synchronized. This happens when you have many machines and you have many events. Each one actually removes entropy from the system so you have to add some back in.
Cheating - Know How to Fake Data
Awesome technique. The fastest function call is the one that doesn’t happen. When you have a monotonically increasing counter, like movie view counts or profile view counts, you could do a transaction every update. Or you could do a transaction every once in awhile and update by a random amount and as long as it changes from odd to even people would probably believe it’s real. Know how to fake data.
Video: Scalability at YouTube
Originating Article: 7 Years Of YouTube Scalability Lessons In 30 Minutes
86b3a34f-61c4-4bea-99fb-87ab754a8028|0|.0
Optimization, Scalability
Scalability