Architecture on Jason Wilder's Blog

Docker Service Discovery Using Etcd and Haproxy

Tue, 15 Jul 2014 00:00:00 UTC

In a previous post, I showed a way to create an automated nginx reverse proxy for docker containers running on the same host. That setup works fine for front-end web apps, but is not ideal for backend services since they are typically spread across multiple hosts.

This post describes a solution to the backend service problem using service discovery for docker containers.

The architecture we’ll build is modelled after SmartStack, but uses etcd instead Zookeeper and two docker containers running docker-gen and haproxy instead of nerve and synapse .

How It Works

Similar to SmartStack, we have components to serve as a registry (etcd), a registration side-kick process (docker-register), discovery side-kick process (docker-discover), some backend services (whoami) and finally a consumers (ubuntu/curl).

The registration and discovery components work as appliances alongside the the application containers so there is no embedded registration or discovery code in the backend or consumer containers. They just listen on ports or connect to other local ports.

Service Registry - Etcd

Before anything can be registered, we need some place to track registration entries (i.e. IP and ports of services). We’re using etcd because it has a simple programming model for service registration and supports TTLs for keys and directories.

Usually, you would run 3 or 5 etcd nodes but I’m just using one to keep things simple.

There is no reason why we could not use Consul or any other storage option that supports TTL expiration.

To start etcd:

$ docker run -d --name etcd -p 4001:4001 -p 7001:7001 coreos/etcd

Service Registration - docker-register

Registering service containers is handled by the jwilder/docker-register container. This container registers other containers running on the same host in etcd. Containers we want registered must expose a port. Containers running the same image on different hosts are grouped together in etcd and will form a load-balanced cluster. How containers are groups is somewhat arbitrary and I’ve chosen the container image name for this walkthrough. In a real deployment, you would likely want to group things by environment, service version, or other meta-data.

(The current implementation only supports one port per container and assumes it is TCP currently. There is no reason why multiple ports and types could not be supported as well as different grouping attributes.)

docker-register uses docker-gen along with a python script as a template. It dynamically generates a script that, when run, will register each container’s IP and PORT under a /backends directory.

docker-gen takes care of monitoring docker events and calling the generated script on an interval to ensure TTLs are kept up to date. If docker-register is stopped, the registrations expire.

To start docker-register, we need to pass in the host’s external IP where other hosts can reach it’s containers as well as the address of your etcd host. docker-gen requires access to the docker daemon in order to call it’s API so we bind mount the docker unix socket into the container as well.

$ HOST_IP=$(hostname --all-ip-addresses | awk '{print $1}')
$ ETCD_HOST=w.x.y.z:4001
$ docker run --name docker-register -d -e HOST_IP=$HOST_IP -e ETCD_HOST=$ETCD_HOST -v /var/run/docker.sock:/var/run/docker.sock -t jwilder/docker-register

Service Discovery - docker-discover

Discovering services is handled by the jwilder/docker-discover container. docker-discover polls etcd periodically and generates an haproxy config with listeners for each type of registered service.

For example, containers running jwilder/whoami are registered under /backends/whoami/<id> and are exposed on host port 8000.

Other containers that need to call the jwilder/whoami service, can send requests to docker bridge IP:8000 or host IP:8000.

If any of the backend services goes down, haproxy health checks remove it from the pool and will retry the request on a healthy host. This ensure that backend services can be started and stopped as needed as well as handling inconsistencies in the the registration information while ensuring minimal client impact.

Finally, stats can be viewed by accessing port 1936 on the docker-discover container.

To run docker-discover:

$ ETCD_HOST=w.x.y.z:4001
$ docker run -d --net host --name docker-discover -e ETCD_HOST=$ETCD_HOST -p 127.0.0.1:1936:1936 -t jwilder/docker-discover

We’re using --net host so that the container uses the host’s network stack. When this container binds port 8000, it’s actually binding on the host’s network. This simplifies the proxy setup.

AWS Demo

We’ll run the full thing on four AWS hosts: an etcd host, a client host and two backend hosts. The backend service is a simple Golang HTTP server that returns it’s hostname (container ID).

Etcd Host

First we start our etcd registry:

$ hostname --all-ip-addresses | awk '{print $1}'
10.170.71.226

$ docker run -d --name etcd -p 4001:4001 -p 7001:7001 coreos/etcd

Our etcd address is 10.170.71.226. We’ll use that on the other hosts. If we were running this is a live environment, we could assign an EIP and DNS address to it to make it easier to configure.

Backend Hosts

Next, we start the the services and docker-register on each host. The service is configured to listen on port 8000 in the container and we let docker publish it on an random host port.

On backend host 1:

$ docker run -d -p 8000:8000 --name whoami -t jwilder/whoami
736ab83847bb12dddd8b09969433f3a02d64d5b0be48f7a5c59a594e3a6a3541
$ docker run --name docker-register -d -e HOST_IP=$(hostname --all-ip-addresses | awk '{print $1}') -e ETCD_HOST=10.170.71.226:4001 -v /var/run/docker.sock:/var/run/docker.sock -t jwilder/docker-register
77a49f732797333ca0c7695c6b590a64a7d75c14b5ffa0f89f8e0e21ae47ae3e

$ docker ps
CONTAINER ID        IMAGE                            COMMAND                CREATED             STATUS              PORTS                     NAMES
736ab83847bb        jwilder/whoami:latest            /app/http              48 seconds ago      Up 47 seconds       0.0.0.0:49153->8000/tcp   whoami
77a49f732797        jwilder/docker-register:latest   "/bin/sh -c 'docker-   28 minutes ago      Up 28 minutes                                 docker-register

On backend host 2:

$ docker run -d -p 8000:8000 --name whoami -t jwilder/whoami
4eb0498e52076275ee0702d80c0d8297813e89d492cdecbd6df9b263a3df1c28
$ docker run --name docker-register -d -e HOST_IP=$(hostname --all-ip-addresses | awk '{print $1}') -e ETCD_HOST=10.170.71.226:4001 -v /var/run/docker.sock:/var/run/docker.sock -t jwilder/docker-register
832e77c83591cb33bba53859153eb91d897f5a278a74d4ec1f66bc9b97deb221

$ docker ps
CONTAINER ID        IMAGE                            COMMAND                CREATED             STATUS              PORTS                     NAMES
4eb0498e5207        jwilder/whoami:latest            /app/http              59 seconds ago      Up 58 seconds       0.0.0.0:49154->8000/tcp   whoami
832e77c83591        jwilder/docker-register:latest   "/bin/sh -c 'docker-   34 minutes ago      Up 34 minutes                                 docker-register

Client Host

On the client host, we need to start docker-discover and a client container. For the client container, I’m using Ubuntu Trusty and will make some curl requests.

First start docker-discover:

$ docker run -d --net host --name docker-discover -e ETCD_HOST=10.170.71.226:4001 -p 127.0.0.1:1936:1936 -t jwilder/docker-discover

Then, start a sample client container and pass in a HOST_IP. We’re using the eth0 address but could also use docker0 IP. We’re passing this in as an environment variable since it is configuration that will vary between deploys.

$ docker run -e HOST_IP=$(hostname --all-ip-addresses | awk '{print $1}') -i -t ubuntu:14.04 /bin/bash
$ root@2af5f52de069:/# apt-get update && apt-get -y install curl

Then, make some requests to the whoami service port 8000 to see them load-balanced.

$ root@2af5f52de069:/# curl $HOST_IP:8000
I'm 4eb0498e5207
$ root@2af5f52de069:/# curl $HOST_IP:8000
I'm 736ab83847bb
$ root@2af5f52de069:/# curl $HOST_IP:8000
I'm 4eb0498e5207
$ root@2af5f52de069:/# curl $HOST_IP:8000
I'm 736ab83847bb

We can start some more instances on the backends:

$ docker run -d -p :8000 --name whoami-2 -t jwilder/whoami
$ docker run -d -p :8000 --name whoami-3 -t jwilder/whoami

$ docker ps
CONTAINER ID        IMAGE                            COMMAND                CREATED             STATUS              PORTS                     NAMES
5d5c12c96192        jwilder/whoami:latest            /app/http              3 seconds ago       Up 1 seconds        0.0.0.0:49156->8000/tcp   whoami-2
bb2a408b8ec5        jwilder/whoami:latest            /app/http              21 seconds ago      Up 20 seconds       0.0.0.0:49155->8000/tcp   whoami-3
4eb0498e5207        jwilder/whoami:latest            /app/http              2 minutes ago       Up 2 minutes        0.0.0.0:49154->8000/tcp   whoami
832e77c83591        jwilder/docker-register:latest   "/bin/sh -c 'docker-   36 minutes ago      Up 36 minutes                                 docker-register

And make some requests again on the client hosts:

$ root@2af5f52de069:/# curl $HOST_IP:8000
I'm 736ab83847bb
$ root@2af5f52de069:/# curl $HOST_IP:8000
I'm 4eb0498e5207
$ root@2af5f52de069:/# curl $HOST_IP:8000
I'm bb2a408b8ec5
$ root@2af5f52de069:/# curl $HOST_IP:8000
I'm 5d5c12c96192
$ root@2af5f52de069:/# curl $HOST_IP:8000
I'm 736ab83847bb

Finally, we can shutdown some some containers and routes will be updated. This kills everything on backend2.

$ docker kill 5d5c12c96192 bb2a408b8ec5 4eb0498e5207

$ root@2af5f52de069:/# curl $HOST_IP:8000
I'm 736ab83847bb
$ root@2af5f52de069:/# curl $HOST_IP:8000
I'm 67c3cccbb8ba
$ root@2af5f52de069:/# curl $HOST_IP:8000
I'm 736ab83847bb
$ root@2af5f52de069:/# curl $HOST_IP:8000
I'm 67c3cccbb8ba

If we wanted to see how haproxy is balancing traffic or monitor for errors, we can access the client’s host port 1936 in a web browser.

Wrapping Up

While there are a lot of different ways to implement service discovery, SmartStack’s sidekick style of registration and proxying keeps application code simple and easy to integrate in a distributed environment and really fits well with Docker containers.

Similarly, Docker’s event and container APIs facilitate service registration and discovery with registration services such as etcd.

The code for docker-register and docker-discover is on github. While both are functional there is a lot that can be improved. Please feel fee to submit or suggest improvements.

Open-Source Service Discovery

Tue, 04 Feb 2014 00:00:00 UTC

Service discovery is a key component of most distributed systems and service oriented architectures. The problem seems simple at first: How do clients determine the IP and port for a service that exist on multiple hosts?

Usually, you start off with some static configuration which gets you pretty far. Things get more complicated as you start deploying more services. With a live system, service locations can change quite frequently due to auto or manual scaling, new deployments of services, as well as hosts failing or being replaced.

Dynamic service registration and discovery becomes much more important in these scenarios in order to avoid service interruption.

This problem has been addressed in many different ways and is continuing to evolve. We’re going to look at some open-source or openly-discussed solutions to this problem to understand how they work. Specifically, we’ll look at how each solution uses strong or weakly consistent storage, runtime dependencies, client integration options and what the tradeoffs of those features might be.

We’ll start with some strongly consistent projects such as Zookeeper, Doozer and Etcd which are typically used as coordination services but are also used for service registries as well.

We’ll then look at some interesting solutions specifically designed for service registration and discovery. We’ll examine Airbnb’s SmartStack, Netflix’s Eureka, Bitly’s NSQ, Serf, Spotify and DNS and finally SkyDNS.

The Problem

There are two sides to the problem of locating services. Service Registration and Service Discovery.

Service Registration - The process of a service registering its location in a central registry. It usually register its host and port and sometimes authentication credentials, protocols, versions numbers, and/or environment details.
Service Discovery - The process of a client application querying the central registry to learn of the location of services.

Any service registration and discovery solution also has other development and operational aspects to consider:

Monitoring - What happens when a registered service fails? Sometimes it is unregistered immediately, after a timeout, or by another process. Services are usually required to implement a heartbeating mechanism to ensure liveness and clients typically need to be able to handle failed services reliably.
Load Balancing - If multiple services are registered, how do all the clients balance the load across the services? If there is a master, can it be deteremined by a client correctly?
Integration Style - Does the registry only provide a few language bindings, for example, only Java? Does integrating require embedding registration and discovery code into your application or is a sidekick process an option?
Runtime Dependencies - Does it require the JVM, Ruby or something that is not compatible with your environment?
Availability Concerns - Can you lose a node and still function? Can it be upgraded without incurring an outage? The registry will grow to be a central part of your architecture and could be a single point of failure.

General Purpose Registries

These first three registries use strongly consistent protocols and are actually general purpose, consistent datastores. Although we’re looking at them as service registries, they are typically used for coordination services to aid in leader election or centralized locking with a distributed set of clients.

Zookeeper

Zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. It’s written in Java, is strongly consistent (CP) and uses the Zab protocol to coordinate changes across the ensemble (cluster).

Zookeeper is typically run with three, five or seven members in the ensemble. Clients use language specific bindings in order to access the ensemble. Access is typically embedded into the client applications and services.

Service registration is implemented with ephemeral nodes under a namespace. Ephemeral nodes only exist while the client is connected so typically a backend service registers itself, after startup, with its location information. If it fails or disconnects, the node disappears from the tree.

Service discovery is implemented by listing and watching the namespace for the service. Clients receive all the currently registered services as well as notifications when a service becomes unavailable or new ones register. Clients also need to handle any load balancing or failover themselves.

The Zookeeper API can be difficult to use properly and language bindings might have subtle differences that could cause problems. If you’re using a JVM based language, the Curator Service Discovery Extension might be of some use.

Since Zookeeper is a CP system, when a partition occurs, some of your system will not be able to register or find existing registrations even if they could function properly during the partition. Specifically, on any non-quorum side, reads and writes will return an error.

Doozer

Doozer is a consistent, distributed data store. It’s written in Go, is strongly consistent and uses Paxos to maintain consensus. The project has been around for a number of years but has stagnated for a while and now has close to 160 forks. Unfortunately, this makes it difficult to know what the actual state of the project is and whether is is suitable for production use.

Doozer is typically run with three, five or seven nodes in the cluster. Clients use language specific bindings to access the cluster and, similar to Zookeeper, integration is embedded into the client and services.

Service registration is not as straightforward as with Zookeeper because Doozer does not have any concept of ephemeral nodes. A service can register itself under a path but if the service becomes unavailable, it won’t be removed automatically.

There are a number of ways to address this issue. One option might be to add a timestamp and heartbeating mechanism to the registration process and handle expired entries during the discovery process or with another cleanup processes.

Service discovery is similar to Zookeeper in that you can list all the entries under a path and then wait for any changes to that path. If you use a timestamp and heartbeat during registration, you would ignore or delete any expired entries during discovery.

Like Zookeeper, Doozer is also a CP system and has the same consequences when a partition occurs.

Etcd

Etcd is a highly-available, key-value store for shared configuration and service discovery. Etcd was inspired by Zookeeper and Doozer. It’s written in Go, uses Raft for consensus and has a HTTP+JSON based API.

Etcd, similar to Doozer and Zookeeper, is usually run with three, five or seven nodes in the cluster. Clients use a language specific binding or implement one using an HTTP client.

Service registration relies on using a key TTL along with heartbeating from the service to ensure the key remains available. If a services fails to update the key’s TTL, Etcd will expire it. If a service becomes unavailable, clients will need to handle the connection failure and try another service instance.

Service discovery involves listing the keys under a directory and then waiting for changes on the directory. Since the API is HTTP based, the client application keeps a long-polling connection open with the Etcd cluster.

Since Etcd uses Raft, it should be a strongly-consistent system. Raft requires a leader to be elected and all client requests are handled by the leader. However, Etcd also seems to support reads from non-leaders using this undocumented consistent parameter which would improve availabilty in the read case. Writes would still need to be handled by the leader during a partition and could fail.

Single Purpose Registries

These next few registration services and approaches are specifically tailored to service registration and discovery. Most have come about from actual production use cases while others are interesting and different approaches to the problem. Whereas Zookeeper, Doozer and Etcd could also be used for distributed coordination, these solutions don’t have that capability.

Airbnb’s SmartStack

Airbnb’s SmartStack is a combination of two custom tools, Nerve and Synapse that leverage haproxy and Zookeeper to handle service registration and discovery. Both Nerve and Synapse are written in Ruby.

Nerve is a sidekick style process that runs as a separate process alongside the application service. Nerve is reponsible for registering services in Zookeeper. Applications expose a /health endpoint, for HTTP services, that Nerve continuously monitors. Provided the service is available, it will be registered in Zookeper.

The sidekick model eliminates the need for a service to interact with Zookeeper. It simply needs a monitoring endpoint in order to be registered. This makes it much easier to support services in different languages where robust Zookeeper binding might not exist. This also provides many of benefits of the Hollywood principle.

Synapse is also a sidekick style process that runs as a separate process alongside the service. Synapse is responsible for service discovery. It does this by querying Zookeeper for currently registered services and reconfigures a locally running haproxy instance. Any clients on the host that need to access another service always accesses the local haproxy instance which will route the request to an available service.

Synapse’s design simplifies service implementations in that they do not need to implement any client side load balancing or failover and they do not need to depend on Zookeepr or its language bindings.

Since SmartStack relies on Zookeeper, some registrations and discovery may fail during a partition. They point out that Zookeepr is their “Achilles heel” in this setup. Provided a service has been able to discover the other services, at least once, before a partition, it should still have a snapshot of the services after the partition and may be able to continue operating during the partition. This aspect improves the availability and reliability of the overall system.

Update: If you’re intested in a SmartStack style solution for docker containers, check out docker service discovery.

Netflix’s Eureka

Eureka is Netflix’s middle-tier, load balancing and discovery service. There is a server component as well as a smart-client that is used within application services. The server and client are written in Java which means the ideal use case would be for the services to also be imlemented in Java or another JVM compatible language.

The Eureka server is the registry for services. They recommend running one Eureka server in each availability zone in AWS to form a cluster. The servers replicate their state to each other through an asynchronous model which means each instance may have a slightly, different picture of all the services at any given time.

Service registration is handled by the client component. Services embed the client in their application code. At runtime, the client registers the service and periodically sends heartbeats to renew its leases.

Service discovery is handled by the smart-client as well. It retrieves the current registrations from the server and caches them locally. The client periodically refreshes its state and also handles load balancing and failovers.

Eureka was designed to be very resilient during failures. It favors availabilty over strong consistency and can operate under a number of different failure modes. If there is a partition within the cluster, Eureka transitions to a self-preservation state. It will allow services to be discovered and registered during a partition and when it heals, the members will merge their state again.

Bitly’s NSQ lookupd

NSQ is a realtime, distributed messaging platform. It’s written in Go and provides an HTTP based API. While it’s not a general purpose service registration and discovery tool, they have implemented a novel model of service discovery in their nsqlookupd agent in order for clients to find nsqd instances at runtime.

In an NSQ deployment, the nsqd instances are essentially the service. These are the message stores. nsqlookupd is the service registry. Clients connect directly to nsqd instances but since these may change at runtime, clients can discover the available instances by querying nsqlookupd instances.

For service registration, each nsqd instance periodically sends a heartbeat of its state to each nsqlookupd instance. Their state includes their address and any queues or topics they have.

For discovery, clients query each nsqlookupd instance and merge the results.

What is interesting about this model is that the nsqlookupd instances do not know about each other. It’s the responsibility of the clients to merge the state returned from each stand-alone nsqlookupd instance to determine the overal state. Because each nsqd instance heartbeats its state, each nsqlookupd eventually has the same information provided each nsqd instance can contact all available nsqlookupd instances.

All the previously discussed registry components all form a cluster and use strong or weakly consistent consensus protocols to maintain their state. The NSQ design is inherently weakly consistent but very tolerant to partitions.

Serf

Serf is a decentralized solution for service discovery and orchestration. It is also written in Go and is unique in that uses a gossip based protocol, SWIM for membership, failure detection and custom event propogation. SWIM was designed to address the unscalability of traditional heart-beating protocols.

Serf consists of a single binary that is installed on all hosts. It can be run as an agent, where it joins or creates a cluster, or as a client where it can discover the members in the cluster.

For service registration, a serf agent is run that joins an existing cluster. The agent is started with custom tags that can identify the hosts role, env, ip, ports, etc. Once joined to the cluster, other members will be able to see this host and it’s metadata.

For discovery, serf is run with the members command which returns the current members of the cluster. Using the members output, you can discover all the hosts for a service based on the tags their agent is running.

Serf is a relatively new project and is evolving quickly. It is the only project in this post that does not have a central registry architectural style which makes it unique. Since it uses a asynchronous, gossip based protocol, it is inherently weakly-consistent but more fault tolerant and available.

Spotify and DNS

Spotify described their use of DNS for service discovery in their post In praise of “boring” technology. Instead of using a newer, less mature technology they opted to build on top of DNS. Spotify views DNS as a “distributed, replicated database tailored for read-heavy loads.”

Spotify uses the relatively unknown SRV record which is intended for service discovery. SRV records can be thought of as a more generalized MX record. They allow you to define a service name, protocol, TTL, priority, weight, port and target host. Basically, everything a client would need to find all available services and load balance against them if necessary.

Service registration is complicated and fairly static in their setup since they manage all zone files under source control. Discovery uses a number of different DNS client librarires and custom tools. They also run DNS caches on their services to minimize load on the root DNS server.

They mention at the end of their post that this model has worked well for them but they are starting to outgrow it and are investigating Zookeeper to support both static and dynamic registration.

SkyDNS

SkyDNS is a relatively new project that is written in Go, uses RAFT for consensus and also provides a client API over HTTP and DNS. It has some similarities to Etcd and Spotify’s DNS model and actually uses the same RAFT implementation as Etcd, go-raft.

SkyDNS servers are clustered together, and using the RAFT protocol, elect a leader. The SkyDNS servers expose different endpoints for registration and discovery.

For service registration, services use an HTTP based API to create an entry with a TTL. Services must heartbeat their state periodically. SkyDNS also uses SRV records but extends them to also support service version, environment, and region.

For discovery, clients use DNS and retrieve the SRV records for the services they need to contact. Clients need to implement any load balancing or failover and will likely cache and refresh service location data periodically.

Unlike Spotify’s use of DNS, SkyDNS does support dynamic service registration and is able to do this without depending on another external service such as Zookeeper or Etcd.

If you are using docker, skydock might be worth checking out to integrate your containers with SkyDNS automatically.

Overall, this is an interesting mix of old (DNS) and new (Go, RAFT) technology and will be interesting to see how the project evolves.

Summary

We’ve looked at a number of general purpose, strongly consistent registries (Zookeeper, Doozer, Etcd) as well as many custom built, eventually consistent ones (SmartStack, Eureka, NSQ, Serf, Spotify’s DNS, SkyDNS).

Many use embedded client libraries (Eureka, NSQ, etc..) and some use separate sidekick processes (SmartStack, Serf).

Interestingly, of the dedicated solutions, all of them have adopted a design that prefers availability over consistency.

Name	Type	AP or CP	Language	Dependencies	Integration
Zookeeper	General	CP	Java	JVM	Client Binding
Doozer	General	CP	Go		Client Binding
Etcd	General	Mixed (1)	Go		Client Binding/HTTP
SmartStack	Dedicated	AP	Ruby	haproxy/Zookeeper	Sidekick (nerve/synapse)
Eureka	Dedicated	AP	Java	JVM	Java Client
NSQ (lookupd)	Dedicated	AP	Go		Client Binding
Serf	Dedicated	AP	Go		Local CLI
Spotify (DNS)	Dedicated	AP	N/A	Bind	DNS Library
SkyDNS	Dedicated	Mixed (2)	Go		HTTP/DNS Library

(1) If using the consistent parameter, inconsistent reads are possible

(2) If using a caching DNS client in front of SkyDNS, reads could be inconsistent

Fluentd vs Logstash

Tue, 19 Nov 2013 00:00:00 UTC

Fluentd and Logstash are two open-source projects that focus on the problem of centralized logging. Both projects address the collection and transport aspect of centralized logging using different approaches.

This post will walk through a sample deployment to see how each differs from the other. We’ll look at the dependencies, features, deployment architecture and potential issues. The point is not to figure out which one is the best, but rather to see which one would be a better fit for your environment.

The example setup we’ll walk through is collecting web server logs on multiple hosts and archiving them to S3:

This type of architecture would be suitable for archival or processing with Hive or Pig.

Another common architecture is storing logs in ElasticSearch to make them searchable with Kibana or Graylog2. Setting that up is somewhat independent of using Logstash or Fluentd so I’ve left that out to keep things simple.

Installation Requirements

Logstash

Logstash is a JRuby based application which requires the JVM to run. Since it runs on the JVM, it can run anywhere the JVM does, which is usually means Linux, Mac OSX, and Windows. The package is shipped as single executable jar file which makes it very easy to install.

One of the downsides of depending on the JVM is that it’s memory footprint can be higher than you would want for transporting logs. Fortunately, Lumberjack can be run on individual hosts to collect and ship logs and Logstash can be run on the centralized log hosts.

Lumberjack is a Go based project with a much smaller memory and CPU footprint. Deployment is still straightforward as Logstash and is basically installing a single binary. The project provides deb and rpm packages to make it easier to deploy. An SSL certificates is required to setup authentication between Lumberjack and Logstash which is a little more complicated, but a nice benefit that encrypted transport is the default.

Fluentd

Fluentd is a CRuby application which requires Ruby 1.9.2 or later. There is an open-source version, fluentd, as well as a commercial version, td-agent. Fluentd runs on Linux and Mac OSX, but does not run on Windows currently.

For larger installs, they recommend using jemalloc to avoid memory fragmentation. This is included in the deb and rpm packages but needs to be installed manually if using the open-source version.

If you use the open-source version, you’ll need to install Fluentd from source or via gem install. Since Fluentd is primarily developed by a commercial company, their deb and rpm packages are configured to send data to their hosted centralized logging platform.

Apart from Ruby, they also recommend running ntpd which you should be running anyways.

Feature Comparison

Logstash supports a number of inputs, codecs, filters and outputs. Inputs are sources of data. Codecs essentially convert an incoming format into an internal Logstash representation as well as convert back out to an output format. These are usually used if the incoming message is not just a single line of text. Filters are processing actions on events and allow you to modify events or drop events as they are processed. Finally, outputs are destinations where events can be routed.

Fluentd is similar in that it has inputs and outputs and a matching mechanism to route log data between destinations. Internally, log messages are converted to JSON which provides structure to an unstructered log message. Messages can be tagged and then routed to different outputs.

Both projects have very similar capabilities and highlighting the difference between them from a feature standpoint is difficult. They both have plugin models that allow you to extend their functionality if needed. They also have rich repository of plugins already available.

Probably the most significant difference between Fluentd and Logstash is their design focus. Logstash emphasizes flexibility and interoperability whereas Fluentd prioritizes simplicity and robustness. This does not mean that Logstash is not robust or Fluentd is not flexible, rather each has prioritized features differently.

Fluentd has fewer inputs than Logstash, out of the box, but many of the inputs and outputs have built-in support for buffering, load-balancing, timeouts and retries. These types of features are necessary for ensuring data is reliably delivered.

For example, the out_forward plugin used to transfer logs from one fluentd instance to another has many robustness options that can be configured to ensure messages are delivered reliably.

Architecture comparison

From a deployment architecture standpoint, both frameworks are very similar. With Logstash, each web server would be configured to run Lumberjack and tail the web server logs. Lumberjack would forward the logs to a server running Logstash with a Lumberjack input. The Logstash server would also have an output configured using the S3 output. Since Lumberjack requires SSL certs, the log transfers would be encrypted from the web server to the log server.

With fluentd, each web server would run fluentd and tail the web server logs and forward them to another server running fluentd as well. This server would be configured to receive logs and write them to S3 using the S3 plugin. Fluentd does not support encryption so logs would be transferred unencrypted.

Update: @repeatedly pointed me to the fluent-plugin-secure-forward that some companies are using for encrypted transport.

Improving Availability

One central log server is a single point of failure. What happens if we wanted to have more than one central log server?

Lumberjack can be configured to use multiple servers but will only send logs to one until that one fails. If that happens, previously collected log data won’t be accessible until that host is resurrected. Essentially, it supports a master with hot-standby servers.

Fluentd on the other hand can forward two copies of the logs to each server if needed, load-balance between multiple hosts or have a master with a hot-standy in case of failure. There are lot of options for not only improving availabilty but also scalability if your log volume increases substantially. Keep in mind, that if you forward multiple copies, this could create duplicate logs in S3 which might need to be handled when you analyze them.

Potential Issues

The Logstash docs suggests using Redis as the receiving output if you run Logstash (not Lumberjack) on each host. This setup is based on Redis Lists and/or Pub/Sub which can lose messages if the receiver dies after removing the message from Redis and before it has had a chance to forward it along. Additionally, Redis would need to be configured with AOF to minimize the chance of lost messages if Redis were to fail.

There is a document describing the life of an event that discusses some of the failure modes and how Logstash addresses them. One important point is that outputs are responsible for retrying events in the case of errors. There are also internal, ephemeral queues within Logstash that can hold up to 20 events. Depending on the failure, there is a window for messages to be lost.

If you absolutely cannot risk losing messages, make sure you investigate all the failure modes and whether the plugins you are using are implemented correctly to handle them.

Update: LOGSTASH-1631 is a bug that demonstrates one way messages can be lost. It appears the internal messaging is going to be replaced with a more reliable implementation in the future.

Conclusion

Both Logstash and Fluentd are viable centralized logging frameworks that can transfer logs from multiple hosts to a central location. Logstash is incredibly flexible with many input and output plugins whereas fluentd provides fewer input and output sources but provides multiple options for reliably and robust transport.

Centralized Logging Architecture

Tue, 16 Jul 2013 00:00:00 UTC

In Centralized Logging, I covered a few tools that help with the problem of centralized logging. Many of these tools address only a portion of the problem which means you need to use several of them together to build a robust solution.

The main aspects you will need to address are: collection, transport, storage, and analysis. In some special cases, you may also want to have an alerting capability as well.

Collection

Applications create logs in different ways, some log through syslog, others log directly to files. If you consider a typical web application running on a linux hosts, there will be a dozen or more log files in /var/log as well as a few application specific logs in home directories or other locations.

If you are supporting a web based application and your developers or operations staff need access to log data quickly in order to troubleshoot live issues, you need a solution that is able to monitor changes to log files in near real-time. If you are using a file replication based approach where files are replicated to a central server on a fixed schedule, then you can only inspect logs as frequently as the replication runs. A one minute rsync cron job might not be fast enough when your site is down and you are waiting for the relevant log data to be replicated.

On the other hand, if you need to analyze log data offline for calculating metrics or other batch related work, a file replication strategy might be a good fit.

Transport

Log data can accumulate quickly on multiple hosts. Transporting it reliably and quickly to your centralized location may need additional tooling in order to effectively transmit it and ensure data is not lost.

Frameworks such as Scribe, Flume, Heka, Logstash, Chukwa, fluentd, nsq and Kafka are designed for transporting large volumes of data from one host to another reliably. Although each of these frameworks addresses the transport problem, they do so quite differently.

For example, Scribe, nsq and Kafka, require clients to log data via their API. Typically, application code is written to log directly to these sources which allows them to reduce latency and improve reliability. If you want to centralize typical log file data, you would need something to tail and stream the logs via their respective APIs. If you control the app that is logging the data you want to collect, these can be much more efficient.

Logstash, Heka, fluentd and Flume provide a number of input sources but also support natively tailing files and transporting them reliably. These are a better fit for more general log collection.

While rsyslog and Syslog-ng are typically thought of as the defacto log collector, not all applications use syslog.

Storage

Now that your log data is being transfered, it needs a destination. Your centralized storage system needs to be able to handle the growth in data over time. Each day will add a certain amount of storage that is relative to the number of hosts and processes that are generating log data.

How you store things depends on a few things:

How long should it be stored - If the logs are for long-term, archival purposes and do not require immediate analysis, S3, AWS Glacier, or tape backup might be a suitable option since they provide relatively low cost for large volumes of data. If you only need a few days or months worth of logs, storing them on some form distributed storage systems such as HDFS, Cassandara, MongoDB or ElasticSearch also works well. If you only need a few hours worth of retention for real-time analysis, Redis might work as well.
Your environments data volume. - A days worth of logs for Google is much different than a days worth of logs for ACME Fishing Supplies. The storage system you chose should allow you to scale-out horizontally if your data volume will be large.
How will you need to access the logs - Some storage is not suitable for real-time or even batch analysis. AWS Glacier or tape backup can take hours to load a file. These don’t work if you need log access for production troubleshooting. If you plan to do more interactive data analysis, storing log data in ElasticSearch or HDFS may allow you work with the raw data more effectively. Some log data is so large that it can only be analyzed in more batch oriented frameworks. The defacto standard is this case is Apache Hadoop along with HDFS.

Analysis

Once your logs are stored on a centralized storage platform, you need a way to analyze them. The most common approach is a batch oriented process that runs periodically. If you are storing log data in HDFS, Hive or Pig might help analyzing the data easier than writing native MapReduce jobs.

If you need a UI for analysis, you can store parsed log data in ElasticSearch and use a front-end such as Kibana or Graylog2 to query and inspect the data. The log parsing can be handled by Logstash, Heka or applications logging with JSON directly. This approach allows more real-time, interactive access to the data but is not really suited for a mass batch processing.

Alerting

The last component that is sometimes nice to have is the ability to alert on log patterns or calculated metrics based on log data. Two common uses for this are error reporting and monitoring.

Most log data is not interesting but errors almost always indicate a problem. It’s much more effective to have the logging system email or notify respective parties when errors ocurr instead of having someone repeatedly watch for the events. There are several services that solely provide application error logging such as Sentry or HoneyBadger. These can also aggregate repetitve exceptions which can give you and idea of how frequently an error is occuring.

Another use case is monitoring. For example, you may have hundreds of web servers and want to know if they start returning 500 status codes. If you can parse your web log files and record a metric on the status code, you can then trigger alerts when that metric crosses a certain threshold. Riemann is designed for detecting scenarios just like this.

Hopefully this helps provide a basic model for designing a centralized logging solution for your environment.

Centralized Logging

Tue, 03 Jan 2012 00:00:00 UTC

Logs are a critical part of any system, they give you insight into what a system is doing as well what happened. Virtually every process running on a system generates logs in some form or another. Usually, these logs are written to files on local disks. When your system grows to multiple hosts, managing the logs and accessing them can get complicated. Searching for a particular error across hundreds of log files on hundreds of servers is difficult without good tools. A common approach to this problem is to setup a centralized logging solution so that multiple logs can be aggregated in a central location.

So what are your options?

File Replication

A simple approach is to setup file replication of your logs to a central server on a cron schedule. Usually rsync and cron are used since they are simple and straightforward to setup. This solution can work for a while but it doesn’t provide timely access to log data. It also doesn’t aggregate the logs and only co-locates them.

Syslog

Another option that you probably already have installed is syslog. Most people use rsyslog or syslog-ng which are two syslog implementations. These daemons allow processes to send log messages to them and the syslog configuration determines how the are stored. In a centralized logging setup, a central syslog daemon is setup on your network and the client logging dameons are setup to forward messages to the central daemon. A good write-up of this kind of setup can be found at: Centralized Logging Use Rsyslog

Syslog is great because just about everything uses it and you likely already have it installed on your system. With a central syslog server, you will likely need to figure out how to scale the server and make it highly-available.

syslog-ng
rsyslog

Distributed Log Collectors

A new class of solutions that have come about have been designed for high-volume and high-throughput log and event collection. Most of these solutions are more general purpose event streaming and processing systems and logging is just one use case that can be solved using them. All of these have their specific features and differences but their architectures are fairly similar. They generally consist of logging clients and/or agents on each specific host. The agents forward logs to a cluster of collectors which in turn forward the messages to a scalable storage tier. The idea is that the collection tier is horizontally scalable to grow with the increase number of logging hosts and messages. Similarly, the storage tier is also intended to scale horizontally to grow with increased volume. This is gross simplification of all of these tools but they are a step beyond traditional syslog options.

Scribe - Scribe is scalable and reliable log aggregation server used and released by Facebook as open source. Scribe is written in C++ and uses Thrift for the protocol encoding. Since it uses thrift, virtually any language can work with it.
Flume - Flume is an Apache project for collecting, aggregating, and moving large amounts of log data. It stores all this data on HDFS.
logstash - logstash lets you ship, parse and index logs from any source. It works by defining inputs (files, syslog, etc.), filters (grep, split, multiline, etc..) and outputs (elasticsearch, mongodb, etc..). It also provides a UI for accessing and searching your logs. See Getting Started
Chukwa - Chukwa is another Apache project that collects logs onto HDFS.
fluentd - Fluentd is similar to logstash in that there are inputs and outputs for a large variety of sources and destination. Some of it’s design tenets are easy installation and small footprint. It doesn’t provide any storage tier itself but allows you to easily configure where your logs should be collected.
kafka - Kafka was developed at LinkedIn for their activity stream processing and is now an Apache incubator project. Although Kafka could be used for log collection this is not it’s primary use case. Setup requires Zookeeper to manage the cluster state.
Graylog2 - Graylog2 provides a UI for searching and analyzing logs. Logs are stored in MongoDB and/or elasticsearch. Graylog2 also provides the GELF logging format to overcome some issues with syslog message: 1024 byte limit and unstructured log messages. If you are logging long stacktraces, you may want to look into GELF.
splunk - Splunk is commercial product that has been around for several years. It provides a whole host of features for not only collecting logs but also analyzing and viewing them.

Update: I wrote a post comparing Fluentd vs Logstash.

Hosted Logging Services

There are also several hosted “logging as a service” providers as well. The benefit of them is that you only need to configure your syslog forwarders or agents and they manage the collection, storage and access to the logs. All of the infrastructure that you have to setup and maintain is handled by them, freeing you up to focus on your application. Each service provide a simple setup (usuallysyslog forwarding based), an API and a UI to support search and analysis.

I go into more detail how all of these fit together in Centralized Logging Architecture.