The benefits of Apache Avro for your data ecosystem

Posted on Apr 15, 2020

Data formats play an important role in a data ecosystem in terms of processing performance and data exploitability. While columnar formats like parquet and ORC are often the right choice for storing huge amounts of analytical data that needs to be queried efficiently, they’re not best suited for moving and processing data.

In this post I will talk about the benefits of Apache Avro as the format for data in transit and how Avro schemas can become the central source of your data in terms of structure, metadata and documentation.

Introduction to Avro

Apache Avro is an open-source data serialization library based on schemas that allows to exchange data between systems in binary format. A schema is simply a document in JSON format defining the structure of a given type of data (which fields it contains and their type, if they’re nullable, their default value, etc). A lot of primitive and complex data types are supported inside schemas.

For example, when working with clickstream data we might define a basic schema for Click events that looks like the following:

{
  "name": "ClickEvent",
  "namespace": "org.mycompany",
  "fields": [
    {
      "name": "user_id",
      "type": "string",
      "doc": "ID of the user initiating the click"
    },
    {
      "name": "page_id",
      "type": "int",
      "doc": "ID of the web page where the click occurred"
    }
  ]
}

An example click event in JSON format matching this schema would be:

{
    "user_id": "bob",
    "page_id": 12345
}

The backend service that publishes click data uses the schema to serialize each individual event to Avro format; if the event matches the schema then the serialization succeeds and the output is the same event in Avro format (i.e. a bunch of binary data). On the other hand, if the serialization fails (for instance if page_id is provided by mistake as a string instead of an integer) the event is blocked from entering the data ecosystem because it’s not conforming to its intended structure.

Conversely, any consumer of the click data can use the schema to deserialize the Avro event to transform it back to JSON format. Client libraries are available for many different programming languages to manage serialization and deserialization (Java, C++, Python and PHP, among others).

Schema evolution

One of the most powerful features of Avro is the support for schema evolution which means updating the schema by adding, removing or changing fields in response to business needs. When evolving a schema, different levels of compatibility can be achieved and enforced:

forward compatibility: events written with the new schema can be read with the old schema
backward compatibility: events written with the old schema can be read with the new schema
full compatibility: the change is fully compatible

Achieving full compatibility requires things like only adding new fields which have default values and never removing a required field. It is a really handy property to have, as it assures that evolving a schema will not break any downstream processes.

Benefits of using Avro

You might be wondering why you should go through all that trouble to serialize your data in the producer only to deserialize it back in the consumers. It turns out that working with Avro data brings a number of advantages to a data ecosystem.

Decoupling applications in a safe way

In a system of distributed microservices, applications communicate with each other through events carrying actionable information related to the business. This is a great way to decouple services with different concerns and scale them independently, but it introduces an overhead complexity in the communication among different applications.

For instance, imagine a microservice listening to our Click events to keep track of the total number of clicks per page over time. When working with simple JSON events, if the producer were to change the click events structure in some way, it could break our consumer unexpectedly. Even worse, it could break any of the consumer applications listening to that specific event. An high degree of synchronization would be needed to update all these services at the same time in order to keep everything working without downtime. This is clearly an anti-pattern when working with microservices.

This is not the case when working with Avro and enforcing schema evolution with full (or at least forward) compatibility. Events emitted with a new (compatible) schema would never break the consumer applications, which could be updated independently when most convenient.

Data quality

Working with schemas ensures that data coming into the platform has the expected structure. This means that downstream consumers of that data, be it other software applications or humans, will know exactly what to expect. This can be especially useful for data analysts and data scientists trying to make sense of the data to build new products on top of it.

Performance

Avro data is in binary format, which is much more compact than plain formats like JSON. In my experience, an event in Avro format can be as little as 30% the size of the same event in JSON. When working with huge amounts of data this adds up to a lot of saved network traffic (while data is in transit) and disk space (while data is at rest). It’s true that there’s an overhead in CPU usage to handle serialization and deserialization, but it is pretty contained thanks to Avro’s efficiency.

Documentation

As part of the schemas, it is possible (and advisable) to add a doc field with the description of each field to explain what that information is supposed to represent. Furthermore, it is possible to specify a logical type to clarify the field type (for example, that a long field actually represents a timestamp in milliseconds). This is valuable information for data analysts that need to work with that data based on what it means on a semantic level.

Data governance

The importance of metadata is often overlooked in a data ecosystem, but it actually plays a very important role. Knowing what data you have and what it represents is crucial to get the most out of it.

One of the most common forms of metadata comes in the form of tags, simple keywords attached to pieces of data. Tags can be used for instance to identify PII (Personally Identifiable Information) so that private user information can be properly anonymized in compliance with data privacy laws such as GDPR and CCPA.

From what we have seen so far, Avro schemas can help us define the structure of our data but they don’t give us any more information about what that data means or represents. Sure, the documentation field lets us add a description of what is inside a specific field, but that’s unstructured text not easily exploitable by a computer program.

This can be achieved by introducing custom attributes containing useful metadata. Consider the following example:

{
  "name": "ClickEvent",
  "namespace": "org.myorg",
  "fields": [
    {
      "name": "user_id",
      "type": "string",
      "doc": "ID of the user initiating the click",
      "tags": [ "USER_ID" ]
    },
    {
      "name": "page_id",
      "type": "int",
      "doc": "ID of the web page where the click occurred"
    },
    {
      "name": "location",
      "type": "string",
      "doc": "GPS coordinates of the user",
      "tags": [ "USER_GEOLOCATION" ]
    },
  ]
}

Here we are adding a custom attribute called tags to enrich fields with keywords. This information could be parsed from the schema (a custom library would be needed to do that) and used for a variety of business use cases.

Consider for example a data governance service which is in charge of anonymizing click data before passing it along to downstream processes. Tag-based anonymization rules could be used to define the action to apply to fields with specific tags, for instance:

USER_ID: hash (obfuscate the id passing it through a hash function)
USER_GEOLOCATION: round (transform the exact coordinates to an approximate location)

Combining tag-based rules with tagged Avro schemas would yield a simple and yet powerful solution to this kind of use cases.

Integration with other open-source technologies

Avro is very well integrated with the major open-source technologies for data processing and transport, both for real-time and batch use.

For instance it has full support in the Hadoop ecosystem and it’s available as an official data source in the Apache Spark framework for distributed data processing (here is a small contribution I made to improve the integration between Spark and schema evolution).

It is also a first-class citizen in the Apache Kafka-based Confluent ecosystem, which provides a Schema Registry for managing Avro schemas and full support in Kafka Connect and KSQL to handle serialization and deserialization transparently for the user.

Disadvantages

As any technology, there are also some drawbacks that should be considered:

since Avro messages are binary, it is not possible to inspect them directly as they would look like gibberish if opened with regular text editors. Tools are needed in order to deserialize them and inspect their content, which can be a bit of a pain while debugging issues
as mentioned before, CPU usage and latency can increase slightly due to the overhead of data serialization and deserialization
migrating a production data system to using Avro can be quite complex due not only to the migration of software components but also to the complexity of creating schemas for all the (usually many) events being produced

Conclusion

We have seen what are the main advantages and disadvantages of using Avro as the go-to format for structured data in transit. Although it might introduce some overhead in the ease of development, I believe it is a fair price to pay considering all the benefits in terms of data quality and data governance.