Alvin's Big Data Notebook : Kafka Connect

Kafka, being a streaming data platform, acts as a giant buffer that decouples the time-sensitivity requirements between producers and consumers. Kafka itself applies back-pressure on producers, and consumption rate is driven entirely by the consumers. If producer throughput exceeds that of the consumer, data will accumulate in Kafka until the consumer can catch up.

Kafka can provide "at least once" on its own, and "exactly once" when combined with an external data store that has a transactional model or unique keys.

ETL vs ELT

ETL: Extract-Transform-Load, the data pipeline is responsible for making modifications to the data as it passes through.

ELT: Extract-Load-Transform, the data pipeline does only minimal transformation(e.g. data type conversion), with the goal of making sure the data that arrives at the target is as similar as possible to the source data. Data-Lake architecture preserves as much of the raw data as possible and allow downstream apps to make their own decision regarding data processing and aggregation.

Kafka Connect vs. Client APIs

Use Connect, when you don't write or modify their code. Connect provides out the box features like configuration management, offset storage, parallelization, error handling, REST, etc.

Connector
The connector is responsible for:

How many tasks will run for the connector
How to split the data copying work between tasks
Get configuration for the tasks from workers

All tasks are initialized by receiving a context from the worker. Source context includes an object that allows the source task to store offsets of source records, retry and store offsets externally for exactly-once delivery.

Worker

Kafka Connect's worker processes are the container processes that execute the connectors and tasks.
If a worker process crashes, other workers will recognize that and will reassign the connectors and tasks that ran on that worker.

Connectors and tasks are responsible for the "moving data" part of data integration, while the workers are responsible for the REST API, configuration management, reliability, and load balancing.

Offset Management

For source connectors, the records that the connector returns to the Connect workers include a logical partition and a logical offset in the source system, e.g. a partition can be a file and an offset can be a line number, or a partition can be a database table and a offset can be an ID or a record.
The worker stores both the records and their offsets to Kafka topic. This allows connectors to start processing events from the most recent stored offset after a restart or a crash.

Alvin's Big Data Notebook

Sunday 30 April 2017

Kafka Connect

No comments:

Post a Comment