Dogfooding: How we run our own Website on Giant Swarm

On By Marian Steinbach in tech

Our product is created by developers for developers. To make our product great, we believe that eating one’s own dog food is required. In this article, I’d like to share with you how our own website, giantswarm.io, is currently set up on Giant Swarm. It provides you with an example for a real world use of our product. Additionally, it might provide you with some good examples on how to solve your own problems, given that you intend to use some of the building blocks we do.

A disclaimer upfront: Some of our team members feel it’s a bit odd to expose internals of our website here, which we aren’t particularly proud of in a technical sense (read till the end to find some suggestions for improvements). It has been built with limited resources, as every developer at Giant Swarm focuses on our core product. Yet, it does the job. We decided in favor of exposing ourselves because it helps us talk about the status of our product. Plus, it might even help us while searching for our first dedicated web developer.

To follow this article, a little background knowledge on how our product is designed comes handy. You should know that “running an application on Giant Swarm” means to run a collection of Docker containers on our cluster. Any Docker container plus some additional configuration forms a “component”. Components can be scaled for parallel execution, thus we speak of “component instances” when deployed. Components are grouped in services. Multiple services ultimately form an application. To read more about our basic concepts, I recommend our introductory article What is Giant Swarm.

So let’s take a look at the building blocks of our website. It’s a Giant Swarm application currently built from four distinct components:

  • A Python Flask webapp (from here on called webapp)
  • A Redis Cache (redis-cache)
  • A Redis Session store (redis-sessions)
  • An nginx proxy in front of the webapp (nginx)

When visiting https://giantswarm.io/ you connect to the nginx proxy component. Below is an overview that shows how the components work together.

What’s missing from the overview above is some external APIs we are communicating with, mainly our CRM system (when you hit “Request Invite” or you get invited to create an account) and our account system (when you actually create an account).

Also not represented in detail is the content delivery network (CDN). It helps us keep requests to static resources from our own servers and instead lets them be handled by servers in data centers which are potentially closer to you, hopefully reducing the roundtrip and download times and giving you a better experience.

Scaled component instances

Another detail not visible in the overview above is the fact that we run more than one instance of the webapp component. The following schema is closer to the actual setup.

The diagram reveals that there are actually three instances of the webapp component. The webapp, as you might guess, is the only component here which carries the highest load. We run three identical instances of it for two reasons. The first reason is load balancing. The second one is that it simplifies updates without downtime. We get into that in more detail further below.

Giant Swarm allows you to look at an application in different levels of abstraction. While the first diagram in the very top resembles how the application is set up in configuration, the second diagram (the one right above) shows the scaled components (using more than one instance per component), how it might be configured dynamically in operation. And there is a third view which I don’t have to care about as the owner of the site, which represents the actual communication between the instances.

We could call this third representation the “Behind the Scenes” view. It reminds us that scaling the webapp component to more than one instance means that nginx needs a way to discover the webapp instances and has to distribute requests to all of them. This function, simply represented by dots in the visualization, is automatically provided by Giant Swarm without users having to think about it. It actually consists of several services, one acting as an ambassador, another one acting as load balancer. These invisible services inbetween also automatically take care of instances being unavailable and being moved to different machines, resulting in internal IP address changes.

Now let’s take a closer look into the individual components. Then it will become more clear how the interplay of components is handled in configuration.

The nginx HTTP proxy component

We use nginx as a proxy in front of our web application. This allows us to solve a few things we want to keep off our Python app, like:

  • HTTP connection handling: General handling of many connections in parallel, HTTP keep-alive etc.
  • GZip: compression of HTML, JavaScript, and CSS responses.
  • Expires Headers: Setting the HTTP expires header for static resources to a date in the future, allowing clients and CDN to cache them and prevent future requests.
  • Error pages: Serving custom error pages, e. g. for 404 errors.
  • Caching of static resources, so that the webapp doesn’t have to deal with serving static content too much.
  • Logging: Creating a common log for all requests, including time spent to fulfil the request.

Some of these functions resemble capabilities of a CDN. However, for us it feels nice to have some control over how these things are handled and to be independent of a specific CDN provider.

As said in the introduction, on Giant Swarm, as of now, the heart of each component is a Docker container. For the nginx component, we use the official nginx docker image as a basis and create our own image from it. Here is our Dockerfile as of the writing of this article:

FROM nginx
MAINTAINER Marian Steinbach <marian@giantswarm.io>
 
ADD content /content
RUN chown nginx -R /content
 
ADD nginx.conf /etc/nginx/nginx.conf
RUN rm /etc/nginx/conf.d/*
 
EXPOSE 80
 
ENTRYPOINT ["nginx"]

The content directory we add to the container holds some static files like robots.txt, favorite icons and error pages. We also add our custom nginx.conf configuration file.

We have a Gistset up for you that contains the nginx.conf and the Dockerfile for your convenience.

Have a closer look at the nginx configuration file in the Gist. Note that it contains only one proxy_pass directive and only one upstream directive to our web application:

upstream WEBAPP {
    server webapp:8000;
}
…
location / {
    proxy_pass http://WEBAPP;
    …
}

So, although we have multiple instances of webapp running, from the nginx perspective it’s as if we simply connect to one host named “webapp” with port 8000 exposed.

How do we know there is a host with the name “webapp”? Here the concept of “dependencies” comes into play (see ourdocumentation). Our nginx component requires the webapp component to work, so it has a dependency configured for it. With this dependency in place, Giant Swarm connects the nginx component with the webapp component in a way that requests are forwarded to one or several instances of the component, depending on how many there are set up by the configuration. This automatically results in round-robin load balancing between all available instances. Nice, eh? And this dependency also provides us with a hostname configuration (like an /etc/hosts entry) we can use to connect to. Those familiar with Docker container linking, this is the same concept. We also provide the auto-generated environment variables you’re used to from Docker links.

That’s about it as for Giant Swarm specialties in our nginx component. As a minor thing, if you decide to use nginx in a Giant Swarm application, you might also want to take note of these tweaks to logging. In the http context, we use this:

log_format custom '"$request" '
  '$status $body_bytes_sent $request_time'
  '"$remote_addr" "$http_x_forwarded_for"'
  '"$http_user_agent" "$http_referer"';
access_log /dev/stdout custom;

This ensures that normal log entries are written to STDOUT and can be read using our swarm command line interface (CLI). Additionally, since log entries in Docker are automatically prefixed with a timestamp, the log format defined above doesn’t have an addition timestamp field configured.

To log errors to STDERR, one additional configuration line is required on the top level:

error_log stderr warn;

The rest is pretty much up to your taste. Test, tweak, and share your thoughts as you like.

The webapp component

This component represents the heart of our website, our main web application. It is written in Python, using the Flask web application framework (or microframework, as they sometimes say).

Similar to other frameworks, Flask comes with its own web server, which is pretty useful for development, but not recommended for production use. Among the invaluable features is the automatic re-initialization of the app whenever some file is changed. In addition, the Flask server provides great debugging capabilities. You definitely want to make use of these. At the same time, you want your application to run inside the docker container that is used for production, even during development (Dev/prod parity, remember?). This way you save yourself from surprises that happen when you deploy your app to a server and find out that a few things (like your Python interpreter, a Python package, or some external binary) were different from the development machine.

For production use, we use gunicornto run our Flask app, which is a popular choice, proven in many setups.

So what we need are two different ways to run our Docker container. We go with the following: The default command of our Docker container (see theDockerfile) runs the gunicorn process, which results in “production mode”. When in development, we override the command, instead starting a little Python script devserver.py, with not much in it:

import webapp
webapp.flaskapp.run(debug=True, host='0.0.0.0', port=8000)

To get the full benefit of the development mode and have the server restart upon every file change, we need an additional tweak. We mount our webapp code directory as a Docker volume inside the container.

The most convenient way to run all our components linked together locally is using docker-compose (formerly fig). Here is our docker-compose YAML file we use for development:

webapp:
  image: registry.giantswarm.io/giantswarm/giantswarmio-webapp
  links:
   - rediscache:rediscache
   - redissessions:redissessions
  ports:
   - "8000:8000"
  environment:
    WEBAPP_DEV: "1"
    BASE_URL: "http://192.168.59.103:8000/"
  volumes:
   - webapp/:/webapp/
  command: "python -u devserver.py"
rediscache:
  image: redis
redissessions:
  image: redis

Every web developer who has ever tried to run an app with database and maybe even more additional attached resources, make sure to get involved with Docker and docker-compose. I currently can’t think of an easier and faster way to bring up two redis servers (both listening on their default port) with my webapp. A nice detail: The log output of all processes is printed to the terminal and stopping everything takes only Ctrl+C.

With the environment variable WEBAPP_DEV we provide a switch to our Flask app in order to decide which configuration to load. There are still a few differences we make between development and production, for example the log level or the combining of CSS and JavaScript assets. Thanks to Flask configuration objects, we don’t have to duplicate the entire configuration. Instead we can make both the production configuration and the dev configuration inherit from a base configuration and only tweak a few settings.

Coming up next we show our two Redis server components. After that, we will come back to the webapp and show how we link webapp, rediscache and redissessions together.

The Redis cache component

For performance reasons it’s a great idea to make a web application store certain frequently used things in a cache. In our Flask webapp, we cache the response of most views, effectively preventing the application from parsing Jinja2, YAML and Markdown with every request when in fact nothing has changed.

Of course, for fastest access the webapp could use a built-in RAM cache. Since there are multiple instances of the webapp, this would result in a decreasing cache hit rate with every additional instance, plus redundant memory consumption for each instance. A common, external cache, in contrast, stores each key only once and provides for maximum hit rate.

The most popular choice for this sort of service is probably Memcached, however I had a bad experience with Memcached in combination with Flask before (or rather with the library accessing Memcached). It is picky when it comes to the encoding of keys (unicode? utf8?) and understanding TTL settings. We decided to go with Redis once more and haven’t regretted it so far.

We treat our cache as ephemeral, we don’t expect the data to survive server restarts. Redis still automatically persists data to disk if not explicitly told not do so.

Redis turns out to be pretty easy to deal with in a dockerized environment. We simply run the official Redis Docker image and link the webapp to it. No config file required.

In development, the only thing to take care of is that we don’t get confused by cached data. To empty the cache, I simply delete and re-create the rediscache container whenever needed, which is a matter of seconds. The alternative would be to completely disable the cache in development. For my taste, this would be a too big difference between the dev and production setup, leaving the risk that some problems can’t be discovered during development.

The Redis session store component

Almost everything said above applies for the session store. It has to be external from the actual web application so that multiple instances have access to the same data. Giant Swarm’s internal load balancing is stateless, so you never know which instance is hit by a certain request. As a result, you have to make your web application stateless, too. At least internally.

You might have wondered why we use two different Redis components for cache and session store. Couldn’t that be handled by one Redis server?

Sure it could, but there is a good reason to make them separate. Our cache database has a limited size and the Redis server is expected to dispose keys that haven’t been used for a long time. It’s also ephemeral, as mentioned above. Losing the cache data is no big deal, so when the Redis server needs to be restarted, we don’t expect the data to persist.

The session store has different requirements. Keys (sessions) should live as long as a session is still valid and are only removed by the garbage collector of the Flask session module. Data has to survive a server restart, which means that persisting data to disk is a requirement.

An additional thought: In a dockerized environment, it’s hardly more work to set up two independent Redis components instead of one. However, having them separate gives us the additional benefit that we can easily monitor the cache independently from the session store. When one of them fails, the other one might still work. And if we want to replace one without the other, it’s easier when they are separated already. This is a piece of microservices thinking that I am getting more and more used to and which we are really trying to foster at Giant Swarm.

Remember the disclaimer upfront? The Redis components are a major aspect when it comes to possible improvements to our website architecture. Both Redis services are currently single points of failure. There are ways to set up Redis clusters based on Docker containers, however, this will have to wait for another time. Losing an entire cache is a risk we can easily accept right now. Even a loss of our session store wouldn’t be a critical issue since the only transaction we have to handle right now is the creation of a new account, which takes several minutes at most and would probably affect a handful of users at most. Once we have users logging in and expecting to stay logged in for a while, we will have to improve this.

Connecting everything locally

To make use of the Redis session store and cache, the webapp has to know how to connect to them.

We already explained above how the nginx component learns about the webapp component via “dependencies”. The same works here. In our application configuration, which we’ll reveal further down, we define that our webapp depends on two components: rediscache and redissessions. This again gives us an according host name to be used in the webapp component. So all we have to do inside our Flask app is to create two Redis connections to the appropriate host names and ports.

Again, to see how this linking is done locally, see the docker-compose YAML above.

We will now have a look how things are configured for the Giant Swarm deployment.

The application configuration

Here is how the actual application configuration that we’re using on Giant Swarm looks like.

{
  "app_name": "giantswarmio",
  "services": [
    {
      "service_name": "default",
      "components": [
        {
          "component_name": "nginx",
          "image": "registry.giantswarm.io/giantswarm/giantswarmio-nginx:latest",
          "ports":[80],
          "domains": {
            "$domain": 80
          },
          "dependencies": [
            {
              "name": "webapp",
              "port": 8000
            }
          ]
        },
        {
          "component_name": "webapp",
          "image": "registry.giantswarm.io/giantswarm/giantswarmio-webapp:$webapp_tag",
          "ports": [8000],
          "scaling_policy": {"min": $webapp_scaling_min},
          "env": [
            "BASE_URL=$scheme://$domain"
          ],
          "dependencies": [
            {
              "name": "rediscache",
              "port": 6379
            },
            {
              "name": "redissessions",
              "port": 6379
            }
          ]
        },
        {
          "component_name": "rediscache",
          "image": "redis",
          "ports": [6379]
        },
        {
          "component_name": "redissessions",
          "image": "redis",
          "ports": [6379]
        }
      ]
    }
  ]
}

Take special note of how we use a special kind of variables (keys starting with $) in the swarm.json file. These variables are defined in swarmvars.json file with different values depending on the environment. For example, during testing we might not want the webapp to scale to 3 instances, use plain HTTP instead of HTTPS and make use of a different Docker image than in production.

{
  "giantswarm/production": {
    "scheme": "https",
  	"domain": "giantswarm.io",
    "webapp_tag": "latest",
    "webapp_scaling_min": 3
  },
  "giantswarm/testing": {
    "scheme": "http",
    "domain": "website-testing.giantswarm.io",
    "webapp_tag": "signup9",
    "webapp_scaling_min": 1
  }
}

To find out more about our application configuration options, I’d recommend our documentation.

Rolling updates

As I said, we have multiple instances of our Flask web application component for load balancing purposes, but also to perform rolling updates. Meaning: Updating the webapp component without taking down the site.

Creating a new release of a component means to build a new Docker image and then push it to our registry. We use the common :latest tagging scheme. So whenever a component is started, the registry is checked for the image with the :latest tag. It gets pulled and then started. That happens when you issue swarm start.

Here is a catch: swarm stop stops all instances of a component. swarm start starts all instances, respectively. So how are rolling updates done?

Full disclosure: Rolling updates are not yet possible for normal users of our product. It’s simply not implemented, yet. We know it has to be possible and we promise we will work on it, but it’s not there, yet. Bear with us.

To update the instances one by one, I log into a machine on our production cluster via SSH and run fleetctl stop and fleetctl start on any of the three instances, of course waiting for each one to finish before proceeding with the next one. This way it’s guaranteed that at least two instances remain functional during the update, which is plenty given our current amount of traffic.

What’s next?

We already talked about the fact that we will have to make our session store redundant. Currently we aren’t aware of any solution that would allows us to use several identical instances, just as we do with the webapp. The webapp is much simpler, since there doesn’t have to be any communication between the instances. Database clusters are different in that usually their members need a way to communicate with their peers. Maybe an intermediate step will be to explicitly define several database nodes which are linked to each other.

A different topic on the list is a message queue. We want to make communication with external APIs more robust, allowing for retries and in some cases make it possible for the webapp not to have to wait for a result (e. g. when sending an email). We will use Celery for that purpose, with RabbitMQ as a broker and likely another Redis as a result store.

With this alone, our app will have at least the following additional components:

  • RabbitMQ as Celery message broker
  • A celery worker component, which we can scale to as many instances as we need to handle the load
  • Flower, the Celery admin interface, for monitoring and occasional manual actions

We’ll likely share more insights as we go along. Feel free to comment on what we built so far, ask questions or give suggestions on what we might improve. And if you want to try it out yourself, request an invite now if you haven’t yet.

Picture of Marian Steinbach
Marian Steinbach
As everybody at Giant Swarm, Marian is responsible for your user experience with our products. As the design guy in the team he's constantly coping with two brain spheres battling for domination.

Let’s start your journey towards microservices

Contact us Now