Alvin's Big Data Notebook : Avro Schema Registry

It’s useless to send the schema of the data along with the data each time (as we do with JSON). It’s not memory and network efficient. It’s smarter to just send an ID along the data that the other parties will use to understand how are encoded the data.

On serialization: we contact the SR to register (if not already) the Avro schema of the data we’re going to write (to get a unique ID). We write this ID as the first bytes in the payload, then we append the data. A schema has a unique ID (so multiple messages use the same schema ID).

On deserialization: we read the first bytes of the payload to know what is the version of the schema that was used to write the data. We contact the SR with this ID to grab the Schema if we don’t have it yet, then we parse it to a org.apache.Schema and we read the data using the Avro API and this Schema(or we can read with another compatible Schema if we know it’s backward/forward compatible).

A subject represents a collection of compatible (according to custom validation rules) schemas in the SR. The schema registry depends on Zookeeper and looks for Kafka brokers. If it can’t find one, it won’t start.

By default, the client caches the schemas passing by to avoid querying the HTTP endpoint each time. The schema validation is done on the schema registry itself according to its configuration (none, backward, forward, full)

A Kafka message is a Key-Value pair. In consequence, a topic can have two schemas, one for the Key, one for the Value.
Avro fixes those issues:

Space and network efficiency thanks to a reduced payload.
Schema evolution intelligence and compatibility enforcing.
Schema registry visibility, centralization, and reutilization.
Kafka and Hadoop compliant.

Avro Serialization/Deserialization

Specific Avro classes mean that we use Avro's code generation to generate the object class from avro schema file, then populate it and produce to Kafka. Avro maven plugin is provided in pom.xml file, and run $ mvn clean package to generate object Java files.

However, Generic Avro, without code generation, only avro schema file is needed. It sends GenericRecord

Reference:

https://www.ctheu.com/2017/03/02/serializing-data-efficiently-with-apache-avro-and-dealing-with-a-schema-registry/

https://www.confluent.io/blog/schema-registry-kafka-stream-processing-yes-virginia-you-really-need-one/

http://docs.confluent.io/3.1.2/schema-registry/docs/serializer-formatter.html
https://softwaremill.com/dog-ate-my-schema/

Alvin's Big Data Notebook

Monday 24 April 2017

Avro Schema Registry

No comments:

Post a Comment