Building Five Labs

Infrastructure
Posted on Jul 1st, 2014

What we Learned by Generating 200 Million Personalities in One Week

Five Labs is an experiment from my startup, Five.com. We used a scientific model to predict your personality from your Facebook wall posts. The model might sound complicated, but the web app is easy to understand. You log in with your Facebook account, then we analyze (but never store) your wall posts with machine learning and predict your personality. The stored result looks like this: { openness: 80, conscientiousness: 59, extraversion: 30, agreeableness: 40, neuroticism: 50 }.

The app we built went viral and generated personalities for over 200 million people. This post describes what strategies worked for me and what allowed us to handle, at the peak, over 12,000 new accounts per second while experiencing no downtime. (We had coffee breaks, but the app never went offline.)

Infrastructure

Five Labs is built on Amazon Web Services. What we built is not very complicated, and there is room for improvement, but it worked well for us. I had two primary goals:

  1. Cache everything.
  2. What cannot be cached should be easy to scale. Growth should be handled by adding more of the same stuff.

The tertiary goal was to log everything. We accomplished this through a mixture of server-side and client-side logging. On the server-side we used Graylog2. On the client-side we utilized a combination of Mixpanel, Google Analytics, and custom handling of error logs.

By caching (almost) everything (we used Cloudfront's site-wide caching), we were able to use less servers and handle larger volumes of traffic. The end result looks something like this:

Five Infrastructure

For the majority of requests, the process ends at step 3a. We accomplish this with HTTP caching. For the other steps, the biggest bottleneck is the prediction service.

App and Prediction Server

While most things are cacheable, personality prediction is not. The personality prediction server lives on the same instance as the NodeJS app server, along with RabbitMQ. The share page image generation process (using PhantomJS) also lives here. Each instance contains the entire stack. The personality prediction server is not publicly accessible - it can only be directly accessed locally.

We use RabbitMQ as an RPC message queue. This allows the NodeJS app server to send a request to the Python Prediction server without the server becoming overloaded if too many messages come in too quickly.

This whole process (NodeJS <- RabbitMQ -> Prediction Server) is self contained in each instance. We can handle growth by simply spinning up more instances. Adding more capacity requires no additional configuration or overhead. We have an Amazon Machine Image ( AMI - an image of a server ) created for the instance, so when we need more bandwidth we simply spin up more instances and tear them down when we don't need them. Scaling this component is fast; it's a short script which boots a new instance from an AMI, specifies the security group, and adds it to the load balancer. We use the "frozen pizza" model, where the AMI contains everything needed for the server. Everything is baked into the AMI, so we don't need to pull any code or updates. As soon as the instance boots up (or the pizza is heated up), it's ready to be added to the load balancer.

Alternatively, the prediction component could be its own service and have a separate RabbitMQ machine. For many applications this would likely be a better solution, but the architecture should fit the use case. For us, the extra overhead, scaling, and maintenance of a separate service would have been overkill.

MongoDB

With little configuration, MongoDB worked great for us. One consideration early on was to use short field names (e.g., "u" instead of "username"). This is painful, but the extra letters end up taking a lot of unnecessary space when there are hundreds of millions of records. Unfortunately, it's a micro optimization for us. We also store your friends list as an array for each user (our data model could be improved slightly, but either way we have to store the social graph for each user). Overall, the total database size is over 130gb, but since our indices still fit into memory we haven't noticed any performance hits in the app.

Eventually, we would need to shard the database if the app continued to grow by around an order of magnitude, but we knew up front what the limitations and goals were. For our case, sharding would be a bit of over architecture. It's helpful to balance architecture solutions in terms of the product's goals - a large part of me wanted to build the most robust and "scalable" system, but it's not always the wisest solution when it doesn't fit the use case.

Redis

As Redis is a single point of failure for us, it's probably the most vulnerable part of our infrastructure. However, Redis is awesome. We use it used exclusively for caching and storing sessions. Redis is blazingly fast, and takes significant load off our database servers. While it is possible to shard and replicate Redis, a single node was adequate for our needs. With little configuration out of the box, it performed flawlessly, has yet to crash, and has never been a bottleneck.

Frontend to Backend

Building a scalable backend requires building a scalable frontend. "Cache everything." You'll see this phrase throughout this post, as it's the core concept that allowed us to scale cheaply.

Separate API calls from Page Rendering

Five Labs is a single page Javascript app (built with Backbone + Marionette). The app server runs NodeJS. Rendering pages does not require a user's session (although some API endpoints do). For instance, the page labs.five.com/barackobama can be requested by any client regardless of session, making caching extremely easy. Only after the page has been loaded does it make an ajax request to /api/user, an endpoint which does require a user's session ID to be passed in (e.g., via a cookie). Essentially, all pages a user visits are completely cacheable. The front end app makes a separate API request for user specific content when necessary (e.g., to login), but many API calls are also cached (e.g., http://labs.five.com/api/users/barackobama).

Cache All The Things

Five Labs was architectured around the idea that everything should be as cacheable as possible. So, using site-wide caching with CloudFront was easy. CloudFront will honor whatever HTTP headers you set. This means the first request to http://labs.five.com/barackobama hits the database and gets cached, but every other request to that endpoint is returned directly from the CDN, without ever hitting our application servers.

In general, there are three levels you can cache at.

  1. (Low level) The function level (functions / methods within your application code) with memoization. An example would be storing the results of calls in a recursive function.
  2. (Higher level) The application level - e.g., storing variables in memory or caching database results with something like Redis.
  3. (Outside of your app) HTTP caching, which you can use to avoid hitting your app at all. A good caching tutorial.

A mixture of all three is ideal.

HTTP Caching

If your app is making many requests every time a client requests a page, you have to throw more servers and more money at the problem to keep the site up and responsive. There's another way. Even if your server has to receive a handwritten letter from the White House when you request Barack Obama's Profile, if it can do it once and not every time his profile is accessed then you're not in terrible shape. Sure, the first person that requests the profile might have to wait a little while. However, as soon as server sends the response, it becomes cached for every subsequent call to that endpoint.

HTTP Caching Overview

One of the biggest things that allowed us to scale cheaply was making heavy use of HTTP Caching. HTTP caching gives many wins.

First, any revisits to your site will be faster for the client because the browsers will use the locally cached content (also, taking load off your servers).

Second (importantly for us), you can use an HTTP reverse proxy like Varnish (or Cloudfront site-wide caching) to avoid your application servers being hit at all after the initial request was made. We use Cloudfront, which not only does this, but also gives us a Content Delivery Network (CDN). With the CDN, servers are geographically closer to end users, making the site even faster.

A reverse proxy works like this:

Reverse Proxy

A request is made to your domain (Step 1). If the same resource has already been requested and exists in the cache, the response is immediately returned without hitting your app server (Step 2a). If the response does not yet exist in the cache, the request is sent to your app server (Step 2b). Your server generates a response, saves it in the cache (Step 3), then sends the response to the client (Step 4).

The tricky part is cache invalidation. How long should the proxy server keep that page in the cache? If you want the cache to last for 7 days but change the page on day 4, you have to wait 3 days until the proxy server will hit the server for the freshest version. Here are a couple of ways to get around this:

  1. Set a short cache expiration for things that may change, and a long expiration for things that never change. This works, but is a manual process and prone to error.
  2. Set a short cache expiration for the main page, and add a query parameter to all the resources that page references. The resources themselves have a long expiration. e.g, index.html could have an expiration of a few minutes, and all references to resources would have a query parameter (e.g., <link href="style.css?v=1"). Anytime you upload a new version of the resource, simply change the query parameter (e.g., to v=2, v=3, etc.). This way, the resources themselves can have a long expiration and you don't need to worry about invalidating the resource itself - a new resource is requested because it has a new reference.

CloudFront Site-Wide Caching

I mentioned Cloud Front site-wide caching a few times. It worked fantastically for us. You can set custom error pages by status code, and even issue invalidation requests if you need to. You can also set up logging, allowing more visibility into what's going on. Lastly, you can set custom behaviors per route(s).

Custom behaviors allow you to define how certain routes are handled. You can pass in cookies or query parameters to some routes but not others, force a route to have a specific TTL, etc. Forwarding cookies is important so you can have an endpoint like /api/user to fetch your own information without seeing another user's info. However, passing in cookies to a route also causes you to lose a huge benefit of caching. The request is cached based on the cookie, so it will only be cached on a per user level - other clients making a request with a different cookie will hit the server.

Setting the TTL specifies how long the resource should live on CloudFront before hitting your servers. To specify how long the resource should live on CloudFront, we set the Cache-Control header with a public, max-age=X property (where X is a number in seconds). For instance, to have a resource live in CloudFront for 15 minutes, we would set the Cache-Control header to public, max-age=900. Postman is a useful Chrome extension that makes it easy to send any kind of HTTP request to an endpoint and view / set headers (Chrome's Network tab in the Developer Tools is helpful for this as well).

Conclusion

Initially, designing an app and the infrastructure for it to support 200 million profiles seemed daunting. However, proper planning early on makes it easier, even within a short timeframe. Leveraging HTTP caching was the right solution for us, and making it easy to spin up and down EC2 instances allowed us to smoothly scale to handle traffic.

See anything that could be improved? Reach me on twitter @enoex.

Five Labs was an experiment to demonstrate how people express themselves online. We'll be taking down Five Labs on July 20th and removing all stored personality data. You can download your generated personality as a PDF before all profiles are deleted. We believe online interactions can be better, which is why we're building Five.com.

NEXT | Visualizing Nebulous Data
PREVIOUS | Installing CUDA, OpenCL, and PyOpenCL on AWS EC2
All Posts

Engage

Reddit

Comments

blog comments powered by Disqus