Apache Avro

Igor Verentsov

By Igor Verentsov · Updated Jun 4, 2026

Key idea:

Apache Avro — row-oriented binary data format, developed by ASF (2009). Foundation for serialization in Kafka and streaming systems. Key feature: schema-first (JSON-defined) + schema evolution (add/drop fields backward-compatible). Compact wire format, fast serialize/deserialize. Used by: Confluent Kafka Schema Registry, Apache Pulsar, Airbyte.

Below: details, example, related terms, FAQ.

Free online tool — HTTP header checker: instant results, no signup.

Check your site →

Details

Row-oriented (vs Parquet columnar) — better for streaming, individual row writes
Schema: JSON-defined, embedded in every file / Kafka message
Schema Registry: shared central schemas for Kafka (Confluent standard)
Evolution: add optional fields backward-compatible
Languages: Java, Python, C++, Go, Rust, JavaScript

Example

{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "id", "type": "long"},
    {"name": "email", "type": "string"},
    {"name": "age", "type": ["null", "int"], "default": null}
  ]
}

# Python (fastavro)
import fastavro
with open('users.avro', 'wb') as out:
  fastavro.writer(out, schema, records)

Related Terms

TL;DR: What is Apache Avro?

Apache Avro is a schema-first data serialization framework primarily used in big data applications. It enables the serialization of data in a compact binary format, ensuring efficient data exchange and storage. Avro schemas are defined using JSON, allowing for easy integration and evolution of data structures. This makes it particularly suitable for use with Apache Hadoop and Kafka, where data processing and streaming are critical.

Understanding Apache Avro's Schema-First Approach

Apache Avro employs a schema-first approach, meaning that data is serialized according to a predefined schema. This design choice offers several advantages:

Data Integrity: By enforcing a schema, Avro ensures that the data conforms to a specific structure, minimizing errors during data processing.
Backward and Forward Compatibility: Avro supports schema evolution, allowing for changes in the data structure without breaking compatibility with previous versions.
Efficient Serialization: Avro's binary format is compact, leading to reduced storage requirements and faster serialization/deserialization processes.

To define an Avro schema, you use JSON syntax. Here’s a basic example:

{
  "type": "record",
  "name": "User",
  "fields": [
    { "name": "name", "type": "string" },
    { "name": "age", "type": "int" },
    { "name": "emails", "type": { "type": "array", "items": "string" } }
  ]
}

This schema defines a User record with three fields: name (string), age (integer), and emails (array of strings).

Practical Use Cases of Apache Avro

Apache Avro is widely used in various data-intensive applications, particularly in conjunction with big data tools like Apache Hadoop and Apache Kafka. Here are some practical use cases:

Data Serialization in Streaming Applications: In a Kafka-based architecture, Avro is used to serialize messages being sent between producers and consumers. This ensures that the data format remains consistent across services. For example, when producing messages to a Kafka topic, you can use the Avro serializer:

KafkaProducer producer = new KafkaProducer(props);
producer.send(new ProducerRecord(topic, "key", user));

Data Storage in Hadoop: Avro files can be directly stored in HDFS (Hadoop Distributed File System), making it easy to work with large datasets. The command to convert a JSON file to Avro format using the Avro command-line tools is as follows:

java -jar avro-tools-1.10.2.jar fromjson --schema-file user.avsc user.json > user.avro

Interoperability Between Languages: Avro supports multiple programming languages, including Java, Python, and C++. This allows for seamless data exchange across different systems. For instance, an Avro-encoded data file can be read in Python as follows:

import fastavro
with open('user.avro', 'rb') as f:
    reader = fastavro.reader(f)
    for record in reader:
        print(record)

These examples demonstrate how Avro's schema-first design enhances data handling in modern applications, ensuring efficiency and reliability in data processing workflows.

Learn more

How-to

Glossary

What is CDC (Change Data Capture)

Research

Frequently Asked Questions

Avro vs Protobuf?

Avro: schema in message/file, dynamic. Protobuf: schema compiled into code. Protobuf more type-safe, but Avro better for streaming with changing schemas.

When Avro vs Parquet?

Avro: streaming (Kafka), one message = one record. Parquet: batch analytics, columnar scans. Complementary.

Need Schema Registry?

For Kafka prod — yes. Enforces schema evolution rules, prevents breaking changes. Confluent Cloud or self-host.

Try the live tool that powered this guide

Free plan — 10 monitors, checks every 5 min, no card required. Upgrade for 1-minute interval and multi-region monitoring.

Start free See pricing