Todd Hoff's picture

Welcome to High Scalability

We started High Scalability to help you build successful scalable websites. This site tries to bring together all the lore, art, science, practice, and experience of building scalable websites into one place so you can learn how to build your system with confidence. Hopefully this site will move you further and faster along the learning curve of success. Please Start Here.

Alternative Memcache Usage: A Highly Scalable, Highly Available, In-Memory Shard Index

While working with Memcache the other night, it dawned on me that it’s usage as a distributed caching mechanism was really just one of many ways to use it. That there are in fact many alternative usages that one could find for Memcache if they could just realize what Memcache really is at its core – a simple distributed hash-table – is an important point worthy of further discussion.

To be clear, when I say “simple”, by no means am I implying that Memcache’s implementation is simple, just that the ideas behind it are such. Think about that for a minute. What else could we use a simple distributed hash-table for, besides caching? How about using it as an alternative to the traditional shard lookup method we used in our Master Index Lookup scalability strategy, discussed previously here.

Todd Hoff's picture

Paper: MapReduce: Simplified Data Processing on Large Clusters

Update: MapReduce and PageRank Notes from Remzi Arpaci-Dusseau's Fall 2008 class . Collects interesting facts about MapReduce and PageRank. For example, the history of the solution to searching for the term "flu" is traced through multiple generations of technology.

With Google entering the cloud space with Google AppEngine and a maturing Hadoop product, the MapReduce scaling approach might finally become a standard programmer practice. This is the best paper on the subject and is an excellent primer on a content-addressable memory future.

Some interesting stats from the paper: Google executes 100k MapReduce jobs each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory.

One common criticism ex-Googlers have is that it takes months to get up and be productive in the Google environment. Hopefully a way will be found to lower the learning curve and make programmers more productive faster.

From the abstract:

Todd Hoff's picture

Strategy: Understanding Your Data Leads to the Best Scalability Solutions

In article Building Super-Scalable Web Systems with REST Udi Dahan tells an interesting story of how they made a weather reporting system scale for over 10 million users. So many users hitting their weather database didn't scale. Caching in a straightforward way wouldn't work because weather is obviously local. Caching all local reports would bring the entire database into memory, which would work for some companies, but wasn't cost efficient for them.

So in typical REST fashion they turned locations into URIs. For example: http://weather.myclient.com/UK/London. This allows the weather information to be cached by intermediaries instead of hitting their servers. Hopefully for each location their servers will be hit a few times and then the caches will be hit until expiry.

Scalability Perspectives #5: Werner Vogels – The Amazon Technology Platform

Scalability Perspectives is a series of posts that highlights the ideas that will shape the next decade of IT architecture. Each post is dedicated to a thought leader of the information age and his vision of the future. Be warned though – the journey into the minds and perspectives of these people requires an open mind.

Werner Vogels

Dr. Werner Vogels is Vice President & Chief Technology Officer at Amazon.com where he is responsible for driving the company’s technology vision, which is to continuously enhance the innovation on behalf of Amazon’s customers at a global scale. Prior to joining Amazon, he worked as a researcher at Cornell University where he was a principal investigator in several research projects that target the scalability and robustness of mission-critical enterprise computing systems. He is regarded as one of the world's top experts on ultra-scalable systems and he uses his weblog to educate the community about issues such as eventual consistency. Information Week recently recognized Vogels for this educational and promotional role in Cloud Computing with the 2008 CIO/CTO of the Year award.

Service-Oriented Architecture, Utility Computing and Internet Level 3 Platform in practice

Amazon has built a loosely coupled service-oriented architecture on an inter-planetary scale. They are the pioneers of Utility Computing and Internet Platforms discussed earlier in Scalability Perspectives. Amazon's CTO, Werner Vogels is undoubtedly a thought leader for the coming age of cloud computing.

100% on Amazon Web Services: Soocial.com - a lesson of porting your service to Amazon

Simone Brunozzi, technology evangelist for Amazon Web Services in Europe, describes how Soocial.com was fully ported to Amazon web services.

----------------
This period of the year I decided to dedicate some time to better understand how our customers use AWS, therefore I spent some online time with Stefan Fountain and the nice guys at Soocial.com, a "one address book solution to contact management", and I would like to share with you some details of their IT infrastructure, which now runs 100% on Amazon Web Services!

In the last few months, they've been working hard to cope with tens of thousands of users and to get ready to easily scale to millions. To make this possible, they decided to move ALL their architecture to Amazon Web Services. Despite the fact that they were quite happy with their previous hosting provider, Amazon proved to be the way to go.
-----------------

Read the rest of the article here.

Platform virtualization - top 25 providers (software, hardware, combined)

In this article they present the companies which offers means (mainly, the software and hardware) which powers most of the cloud computing hosting providers, namely virtualization solutions.

Read the entire article about Platform virtualization - top 25 providers (software, hardware, combined) at MyTestBox.com - web software reviews, news, tips & tricks.

Todd Hoff's picture

Paper: Spamalytics: An Empirical Analysisof Spam Marketing Conversion

Under the philosophy that the best method to analyse spam is to become a spammer, this absolutely fascinating paper recounts how a team of UC Berkely researchers went under cover to infiltrate a spam network. Part CSI, part Mission Impossible, and part MacGyver, the team hijacked the botnet so that their code was actually part of the dark network itself. Once inside they figured out the architecture and protocols of the botnet and how many sales they were able to tally. Truly elegant work.

Two different spam campaigns were run on a Storm botnet network of 75,800 zombie computers. Storm is a peer-to-peer botnet that uses spam to creep its tentacles through the world wide computer network. One of the campains distributed viruses in order to recruit new bots into the network. This is normally accomplished by enticing people to download email attachments. An astonishing one in ten people downloaded the executable and ran it, which means we won't run out of zombies soon. The downloaded components include: Backdoor/downloader, SMTP relay, E-mail address stealer, E-mail virus spreader, Distributed denial of service (DDos) attack tool, pdated copy of Storm Worm dropper. The second campaign sent pharmacuticle spam ("libido boosting herbal remedy”) over the network.

Haven't you always wondered who clicks on spam and how much could spammers possibly make? In the study only 28 sales resulted from 350 million spam e-mail messages sent over 26 days. A conversion rate of well under 0.00001% (typical advertising campaign might have a conversion of 2-3%). The average purchase price was about $100 for $2,731.88 in total revenue. The reserchers estimate total daily revenue attributable to Storm’s pharmacy campaign is about $7000 and that they pick up between 3500 and 8500 new bots per day through their Trojan distribution system. And this is with only 1.5% of the entire network in use.

So, the spammers would take in total revenue about $3.5 million a year from one product from one network. Imagine the take with multiple products and multiple networks? That's why we still have spam. And since the conversion rate is already so low, it seems spam will always be with us.

As fascinating as all the spamonomics are, the explanation of the botnet architecture is just as fascinating. Storm uses a three-level self-organizing hierarchy pictured here:

How to Organize a Database Table’s Keys for Scalability

The key (no pun intended) to understanding how to organize your dataset’s data is to think of each shard not as an individual database, but as one large singular database. Just as in a normal single server database setup where you have a unique key for each row within a table, each row key within each individual shard must be unique to the whole dataset partitioned across all shards.

There are a few different ways we can accomplish uniqueness of row keys across a shard cluster. Each has its pro’s and con’s and the one chosen should be specific to the problems you’re trying to solve.

The I.H.S.D.F. Theorem: A Proposed Theorem for the Trade-offs in Horizontally Scalable Systems

Successful software design is all about trade-offs. In the typical (if there is such a thing) distributed system, recognizing the importance of trade-offs within the design of your architecture is integral to the success of your system. Despite this reality, I see time and time again, developers choosing a particular solution based on an ill-placed belief in their solution as a “silver bullet”, or a solution that conquers all, despite the inevitable occurrence of changing requirements. Regardless of the reasons behind this phenomenon, I’d like to outline a few of the methods I use to ensure that I’m making good scalable decisions without losing sight of the trade-offs that accompany them. I’d also like to compile (pun intended) the issues at hand, by formulating a simple theorem that we can use to describe this oft occurring situation.

Second Life Architecture - The Grid

Update:Presentation: Second Life’s Architecture. Ian Wilkes, VP of Systems Engineering, describes the architecture used by the popular game named Second Life. Ian presents how the architecture was at its debut and how it evolved over years as users and features have been added.

Second Life is a 3-D virtual world created by its Residents. Virtual Worlds are expected to be more and more popular on the internet so their architecture might be of interest. Especially important is the appearance of open virtual worlds or metaverses.
What happens when video games meet Web 2.0? What happens is the metaverse.

Information Sources

Platform

What's Inside?

The Stats

  • ~1M active users
  • ~95M user hours per quarter
  • ~70K peak concurrent users (40% annual growth)
  • ~12Gbit/sec aggregate bandwidth (in 2007)

Gigaspaces curbs latency outliers with Java Real Time

Today, most banks have migrated their internal software development from C/C++ to the Java language because of well-known advantages in development productivity (Java Platform), robustness & reliability (Garbage Collector) and platform independence (Java Bytecode). They may even have gotten better throughput performance through the use of standard architectures and application servers (Java Enterprise Edition). Among the few banking applications that have not been able to benefit yet from the Java revolution, you find the latency-critical applications connected to the trading floor. Why? Because of the unpredictable pauses introduced by the garbage collector which result in significant jitter (variance of execution time).

In this post Frederic Pariente Engineering Manager at Sun Microsystems posted a summary of a case study on how the use of Sun Real Time JVM and GigaSpaces was used in the context of of a customer proof-of-concept this summer to ensure guaranteed latency per message under 10 msec, with no code modification to the matching engine.

Risk Analysis on the Cloud (Using Excel and GigaSpaces)

Every day brings news of either more failures of the financial systems or out-right fraud, with the $50 billion Bernard Madoff Ponzi scheme being the latest, breaking all records. This post provide a technical overview of a solution that was implemented for one of the largest banks in China. The solution illustrate how one can use Excel as a front end client and at the same time leverage cloud computing model and mapreduce as well as other patterns to scale-out risk calculations. I'm hoping that this type of approach will reduce the chances for seeing this type of fraud from happening in the future.

tmielika's picture

Ringo - Distributed key-value storage for immutable data

Ringo is an experimental, distributed, replicating key-value store based on consistent hashing and immutable data. Unlike many general-purpose databases, Ringo is designed for a specific use case: For archiving small (less than 4KB) or medium-size data items (<100MB) in real-time so that the data can survive K - 1 disk breaks, where K is the desired number of replicas, without any downtime, in a manner that scales to terabytes of data. In addition to storing, Ringo should be able to retrieve individual or small sets of data items with low latencies (<10ms) and provide a convenient on-disk format for bulk data access.

Ringo is compatible with the map-reduce framework Disco and it was started at Nokia Research Center Palo Alto.

Scalability Strategies Primer: Database Sharding

This article is a primer, intended to shine some much needed light on the logical, process oriented implementations of database scalability strategies in the form of a broad introduction. More specifically, the intent is to elaborate on the majority of these implementations by example.

[ANN] New Open Source Cache System

The SHOP.COM Cache System is now available at http://code.google.com/p/sccache/

The SHOP.COM Cache System is an object cache system that...
* is an in-process cache and external, shared Cache
* is horizontally scalable
* stores cached objects to disk
* supports associative keys
* is non-transactional
* can have any size key and any size data
* does auto-GC based on TTL
* is container and platform neutral

It was built in-house at SHOP.COM (by me) and has powered our website for years. We are open-sourcing it in the hope that it will be useful to others and to get some help in its maintenance.

This is our first open source attempt and we'd appreciate any help and comments.

Todd Hoff's picture

Facebook is Hiring

I thought with the job situation these days that people might be interested in some open jobs at Facebook. Here's what's available:




Facebook is hiring! We are looking for a Systems Engineer/Architect and Site Reliability Engineer. I have attached the job descriptions below. If you are interested, please contact Michelle Bostock mbostock-at-facebook.com. Thanks and Happy Holidays!

Systems Architect
Palo Alto, CA

Description
Facebook is seeking a seasoned Systems Architect to join the Operations team. The position is full-time and is based in our main office in downtown Palo Alto and will report to the Manager of Systems Operations.

Responsibilities

* Analyze application flow and infrastructure design to improve performance and scalability of the site
* Collaborate on design of services infrastructure from servers to networking
* Monitor, analyze, and make recommendations as appropriate to improve site stability and availability
* Evaluate hardware and software technologies to improve site efficiency and performance
* Troubleshoot and solve issues with hardware, applications, and network components
* Lead team efforts from design to implementation, prioritize tasks and resources while interacting with Engineering and Operations
* Document current and future configuration processes and policies
* Participate in 24x7 on-call support

Requirements

* B.S. in Computer Science or equivalent experience
* 4+ years of experience in Operations with large web farms
* Extensive knowledge of web architecture and technologies, including Linux, Apache, MySQL, PHP, TCP/IP, security, HTTP, LDAP and MTAs
* Strong background/interest in application and infrastructure design
* Scripting and programming skills
* Excellent verbal and written communication skills



Site Reliability Engineer
Palo Alto, CA

Description
Facebook is seeking talented operations engineers to join the Site Reliability Engineering team. The ideal candidate will have strong communication skills, a passion for tinkering with Linux, and an almost insane fondness for fast-paced, seat-of-your-pants troubleshooting and crisis management. The position is full-time and is based in our main office in downtown Palo Alto. This position reports to the Manager of Site Reliability Engineering.

Responsibilities

* Monitor the stability and performance of the website
* Remotely troubleshoot and diagnose hardware problems
* Debug issues with Linux software, applications and network
* Resolve technical challenges encountered in LAMP technologies
* Develop and maintain monitoring tools and automation systems
* Predict and respond to utilization variances across multiple datacenters
* Identify and triage all outage related events
* Facilitate communication, coordinate escalation, and work with subject matter experts to implement critical fixes
* Automate and streamline processes
* Track issues and run reports

Requirements

* 2-3 years+ Linux support/sys admin experience in an Internet operations environment
* BA/BS in Computer Science or a related field, or equivalent experience
* Working knowledge of Linux, Cisco, TCP/IP, Apache and mySQL
* Experience working with network management systems and monitoring tools, such as Nagios, Ganglia and Cacti
* Competency in Shell, PHP, Perl or Python. C is a plus
* Solid understanding of web services architecture and commonly employed technologies
* A sense of urgency in responding to and resolving critical issues that relate to the performance of the site and/or core infrastructure
* Excellent verbal and written communication skills
* Participation in a shifted coverage schedule, including working nights and on-call rotations

Scaling MySQL on a 256-way T5440 server using Solaris ZFS and Java 1.7

How to scale MySQL on a 32 core system with 256 threads? Diagonal scalability in a box.
An impressive benchmark that achieved more than 79,000 SQL queries per second on a single 4 RU server! Is this real? If so what is the role of good old horizontal scalability?

The goals of the benchmark:

  1. Reach a high throughput of SQL queries on a 256-way Sun SPARC Enterprise T5440
  2. Do it 21st century style i.e. with MySQL and ZFS , not 20th century style i.e with OraSybInf... and VxFS
  3. Do it with minimal tuning i.e as close as possible as out-of-the-box
Todd Hoff's picture

Strategy: Facebook Tweaks to Handle 6 Time as Many Memcached Requests

Our latest strategy is taken from a great post by Paul Saab of Facebook, detailing how with changes Facebook has made to memcached they have:


...been able to scale memcached to handle 200,000 UDP requests per second with an average latency of 173 microseconds. The total throughput achieved is 300,000 UDP requests/s, but the latency at that request rate is too high to be useful in our system. This is an amazing increase from 50,000 UDP requests/s using the stock version of Linux and memcached.

To scale Facebook has hundreds of thousands of TCP connections open to their memcached processes. First, this is still amazing. It's not so long ago you could have never done this. Optimizing connection use was always a priority because the OS simply couldn't handle large numbers of connections or large numbers of threads or large numbers of CPUs. To get to this point is a big accomplishment. Still, at that scale there are problems that are often solved.

Some of the problem Facebook faced and fixed:

  • Per connection consumption of resources. What works well at low number of inputs can totally kill a system as inputs grow. Memcached uses a per-connection buffer which adds up to a lot of memory that could be used to store data. Nothing wrong with this design choice, but Facebook made changes to use a per-thread shared connection buffer and reclaimed gigabytes of RAM on each server.
  • Kernel lock contention. Facebook discovered under load there was lock contention when transmitting through a single UDP socket from multiple threads. Sockets are data structures too and they are subject to the usual lock contention issues. Facebook got around this issue by maintaining separate reply sockets in different threads so they would not contend with the receive sockets. They found another bottleneck in Linux’s “netdevice” layer that sits in-between IP and device drivers. They changed the dequeue algorithm to batch dequeues so more work was done when they had the CPU.
  • Application lock contention. Nothing brings out lock issues like moving to more cores. Facebook found when they moved to 8 core machines a global lock protecting stats collection used 20-30% of CPU usage. In application that require little processing per request, as does memcached, this is not unexpected, but doing real work with your CPU is a better idea. So they collected stats on a per thread basis and then calculated a global view on demand.
  • Interrupt floods and starvation. With so much traffic directed at a single server the hardware can flood the CPU(s) with interrupts and keep the CPU from doing "real" work. To get around this problem Facebook implements some complicated strategies to load balance IO across all the cores. As I am less clever I might try more network cards with a TCP Offload engine.

    When you read Paul's article keep in mind all the incredible number of man hours that went into profiling the system, not just their application, but the entire software hardware stack. Then add in the research, planning, and trying different solutions to see if anything changed for the better. It's a lot of work. Notice using a nifty new parallel language or moving to a cloud wouldn't have made a bit difference. It's complete mastery of their system that made the difference.

    A summary of potential strategies:

  • Profile everything. Problems are always specific. The understanding of the problem must be specific. The fix must be specific.
  • Burn profiling into your regression tests. Detect when and where performance tanks as a regular part of your build.
  • Use resources in proportion to what grows slowest. This requires multiplexing, but at least your resource usage is more predictable and bounded.
  • Batch work. When you have the CPU do all the work you possibly can in the quantum or the whole system grinds to a halt in processing overhead.
  • Do work and maintain resources per task. Otherwise locking for shared resources takes more and more time when there's less and less time to do the work that needs to be done.
  • Change algorithms. Sometimes you simply need to do things differently. Tweaking will only get you so far.

    You can find their changes on github, the hub that says "git."

  • Todd Hoff's picture

    Latency is Everywhere and it Costs You Sales - How to Crush it

    Update 3:Nati Shalom's Take on this article. Lots of good stuff on designing architectures for latency minimization.
    Update 2: Why Latency Lags Bandwidth, and What it Means to Computing by David Patterson. Reasons: Moore's Law helps BW more than latency; Distance limits latency; Bandwidth easier to sell; Latency help BW, but not vice versa; Bandwidth hurts latency; OS overhead hurts latency more than BW. Three ways to cope: Caching, Replication, Prediction. We haven't talked about prediction. Games use prediction, i.e, project where a character will go, but it's not a strategy much used in websites.
    Update: