Deploying Jaeger with CloudFormation via Bazel

October 31, 2018

by Zachary DiCesare (@ZacharyDiCesare)

We run a fully AWS-based tech stack consisting of dozens of ElasticBeanstalk applications and ECS services. About a year ago, we began migrating some of our infrastructure to AWS CloudFormation, allowing us to manage it as code, within version control and our central repository.

Recently, we began a push to instrument our applications with distributed tracing. This workstream was a great project for our CloudFormation setup, and our experience can hopefully be helpful for others looking at deploying Jaeger on AWS.

CloudFormation and Bazel

CloudFormation is an AWS service that allows you to manage your infrastructure as code. It does so via templates, defined as JSON or YAML, that describe the configurations of individual AWS resources. When a template is uploaded, AWS will launch the infrastructure from that template to a Stack. If you later make changes to that template, the Stack will be updated accordingly.

At Vistar, we use Bazel, the open source version of Google's build system Blaze, as our build tool. Nearly all of our projects (written in Go, Java, Scala, JavaScript, and Python) are built using Bazel.

Bazel provides us fast, reproducible builds, and allows us to easily share dependencies between different projects and languages. It does so via rules defined for each language that describe the creation of various outputs from certain inputs.

Bazel also allows you to create your own rules. This allowed us to create Bazel rules for generating CloudFormation templates instead of needing to manage JSON or YAML by hand.

rules_cloudformation

A CloudFormation template consists of AWS resources defined as JSON together with any references between them. For instance, the definition of an ECS service may reference an Application Load Balancer.

We began defining Bazel rules for each resource, each one taking the possible attributes as inputs, and emitting a JSON file as output.

For example, here's a small Bazel rule that generates CF JSON for an ECR repository:

def _ecr_repository_impl(ctx):
  repo_name = ctx.attr.repo_name or ctx.label.name

  props = struct(
    RepositoryName = repo_name,
  )
  resource = named_resource(ctx, "AWS::ECR::Repository", props)
  ctx.file_action(ctx.outputs.json, resource.to_json())

  return struct(
    resources = [resource],
  )

ecr_repository = rule(
  _ecr_repository_impl,
  attrs = {
    'repo_name': attr.string(),
  },
  outputs = {'json': '%{name}.json'},
)
"""Reference: http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-ecr-repository.html"""

This rule takes in the repo_name as an attribute and uses the named_resource function to generate an object in the format CF expects.

def named_resource(ctx, type, properties, opts = {}):
  """Create a named resource based on the rule label"""
  return struct(
    name = ctx.label.name,
    resource = struct(
      Type = type,
      Properties = properties,
      **opts
    ),
  )

Then, in a BUILD file, we can say:

ecr_repository(
  name = 'example',
  repo_name = 'example-repo'
)

We can pass this resulting definition (a Bazel target) to other CF rules that take an ECR repository. The final CF template will contain all resources linked with their dependencies.

Additionally, we wrote Go binaries to support validating templates and uploading them to AWS.

The result is that we can define various pieces of infrastructure in our BUILD files, automatically aggregate them into a CF template, and create or update that stack during deploys.

Jaeger Tracing

As our stack has continued to expand, we began to feel the need for greater visibility into the different components of our systems. Jaeger is an OpenTracing (OT)-compliant distributed tracing system from the CNCF, consisting of client libraries in various languages as well as infrastructure for collecting and visualizing traces.

Initially, we experimented with using AWS X-Ray as a storage backend and trace visualizer, while using OpenTracing via Jaeger for instrumentation. We eventually decided on going 'full Jaeger' as the complexity of converting from OT to X-Ray's proprietary format grew.

Deploying the Jaeger infrastructure to AWS provided a solid, complete use case for our new CF rules. Jaeger consists of several distinct components:

Jaeger has great support for deploying the infrastructure with container orchestration tools like Kubernetes and Openshift. Deploying on AWS is mostly straightforward and requires just a little legwork.

Our plan:

We began by adding new CF primitives in Bazel for the Elasticsearch service, allowing us to define a domain directly like so:

es_cluster(
    name = "tracing-cluster",
    allow_explicit_index = True,
    cache_size = 40,
    elasticsearch_version = "6.2",
    encrypted_at_rest = True,
    instance_count = 2,
    instance_type = "m4.large.elasticsearch",
    record_set_name = "elasticsearch.somedomain.com",
    ttl = "60",
    volume_size = 100,
    volume_type = "gp2",
)

We already supported ECS services, but needed some small tweaks to support configuring an ECS service with a Network Load Balancer instead of an Application Load Balancer for the collector service. This was needed because the agents communicate with the collector over TCP, and ALBs only support HTTP.

The result:

service(
    name = "jaeger-collector",
    container_port = 14267,
    cpu_share = 15,
    # We want tasks in at least two AZs
    desired_count = 2,
    domain_names = ["collector.somedomain.com"],
    image = ":collector",
    memory = 256,
    private = True,
    protocol = "TCP",
)

service(
    name = "jaeger-query",
    container_port = 8080,
    cpu_share = 10,
    desired_count = 1,
    domain_names = ["tracing.somedomain.com"],
    image = ":query",
    memory = 1024,
    policies = [":jaeger-query-policy"],
)

Each of these targets registers an ECS service with an associated load balancer, a DNS record, a ECR repository with the Jaeger images, and any needed IAM policies. The image attributes reference the official Jaeger docker images.

The collector service and ES cluster run within a VPC. The collector is fronted by an internal NLB, while the query service (coupled with OAuth) uses an ALB.

For the agents, we needed to support applications running on ElasticBeanstalk and on ECS. AWS recently released CF support for daemon-scheduled services, so we were able to run the Jaeger agent as an ECS service as well.

ecs_container_port_mapping(
    name = "jaeger-agent-port",
    container_port = 6831,
    host_port = 6831,
    protocol = "udp",
)

service(
    name = "jaeger-agent",
    cpu_share = 15,
    image = ":agent",
    memory = 256,
    port_mappings = [":jaeger-agent-port"],
    scheduling_strategy = "DAEMON",
)

For ElasticBeanstalk, we packaged the compiled binary onto our images to be able to run them directly with our applications from a single container.

All together, we're able to generate a CF template for a Jaeger deployment consisting of the collector, storage backend, agent, and query service.

Problems

We encountered a handful of tricky spots getting Jaeger running on AWS.

Our collector service runs in a VPC, and is fronted by an NLB. Internal NLBs face some limitations on AWS, one of which is that they don't support loopback. For us, this means an agent can not communicate with a collector running on the same ECS instance. AWS says a fix for this is on their roadmap, but we circumvent this by running collector instances on two ECS container instances.

Packaging the agent's binary directly with our application's containers is also not ideal, as the Jaeger agent makes for a natural sidecar or daemon. ElasticBeanstalk applications can be run in a multi-container configuration, which could allow the main application to report spans to a sidecar agent. We're planning on moving our Beanstalk applications to ECS, so we haven't acted on this.

Conclusion

We're continuing to expand our instrumentation in our applications, but tracing with Jaeger has already become an invaluable tool for monitoring our services. We've used it to identify various performance degradations as well as with root cause analysis during critical production problems.

We've been running Jaeger in production in multiple mission-critical services with no issues. Additionally, the value provided by distributed tracing will only increase as we continue to grow our services and our team.

Our CloudFormation rules for Bazel allow us to more transparently manage our infrastructure, and adding a Jaeger deployment to them was straightforward. In this post, we gave just a high-level overview of these rules.

We're making a push to open source more of our work, which can be found at our Github. A high priority is to begin open sourcing our CloudFormation rules for Bazel.