Monday 10 April 2017

Kafka Streams

Lambda vs Kappa Architecture


  • A scalable high-latency batch system that can process historical data and a low-latency stream processing system that can't reprocess results.
  • Use Kafka to retain the full log of the data you want to be able to reprocess Retaining large amounts of data in kafka is a perfectly natural and economical thing to do and won't hurt performance.


The KStreams DSL is composed of two main abstractions; the KStream and KTable interfaces.
  • A KStream is a record stream where each key-value pair is an independent record. Later records in the stream don’t replace earlier records with matching keys. 
  • A KTable on the other hand is a “changelog” stream, meaning later records are considered updates to earlier records with the same key.
Local State Store:

Kafka streams provide an efficent way to model the application state.
The state store is partitioned the same way as the application's key space. As a result, all the data required to serve the queries that arrive at a particular application instance are available locally in the state store shards. Fault tolerance for this local state store is provided by kafka streams by logging all updates made to the state store, transparently, to a highly-available and durable kafka topic.
Kafka streams uses kafka like a commit log for its local, embedded database.

The aggregation is done on each instance with a local state. Then, the results are written to a new topic with a single partition, which will be read by a single application instance. This type of multi-phase processing is very familiar to those  mapreduce.


Interactive Queries allow us to treat the stream processing layer as a lightweight, embedded database and directly query the state of your stream processing application, without needing to materialize that state to external databases or storage.

Kafka steams simplifies the stream processing architecture by eliminating the need for a separate cluster. These embedded databases act as materialized views of logs.
Materialized views provide better application isolation because they are part of an application's state, also provide better performance.

Interactive Queries enables faster and more efficient use of the application state. There is no duplication of data between the store doing aggregation for stream processing and the store answering queries.

Discovering any instance's stores

In kafka, each instance may expose its endpoint information metadata to other instances of the same application. The IQ APIs allow a developer to obtain the metadata for a given store name and key, and do that across the instances of an application. Then, we can discover where the store is that holds a particular key by examining the metadata.

Event streams are ordered, while records in a table are always considered unordered.
Events, once occured, can never be modified. Instead an additional event is written to the stream, recording a cancellation of previous transaction.

Streams contain a stream of events and each event caused a change. A table contains a current state of the world, a state that is a result of many changes.

If we can capture all the changes that happen to the database table in a stream of events, we can have our stream processing job listen to this stream and update the cache based on database change events.

Stream-table join: one of the streams represents changes to a locally cached table. One stream is joining a stream with a table to enrich all events with info in the table. This is similar to joining a fact table with a dimension.
Stream-stream join: Streams have the same key and happened in the same time windows, also called a windowed-join. They are partitioned on the same keys, which are also the join keys.

Kafka Streams scales by allowing multiple threads of executions within one instance of the application., and supporting load balancing between distributed instances of the application. The number of tasks is determined by streams engine, and depends on the number of partitions in the topics. Each task is responsible for a subset of the partitions. The developer of the application can choose the number of threads each application instance will execute. Each task will independently process events from those partitions and maintain its own local state with relevant aggregates.

Kafka assigns all the partitions needed for one join to the same task, so this task can consume from all the relevant partitions and perform the join independently. Kafka streams requries that all topics that participate in a join operation will hae the same number of partitions and be partitioned based on the join key. Instead of shuffling, Kafka streams re-partitions by writing the events to a new topic with the new keys and partitions. It reduce dependencies between different parts of a pipeline.

Global KTable is more service concept, while KTable is more about streaming concept:

All partitions replicated to all nodes.
Supports N-way join
Blocks until initialized
Doesn't trigger processing(reference resources)

By default KStreams keeps state store backed up on a secondary node.


32 comments:

  1. I wish to show thanks to you just for bailing me out of this particular trouble.As a result of checking through the net and meeting techniques that were not productive, I thought my life was done
    digital marketing training in tambaram

    ReplyDelete
  2. Resources like the one you mentioned here will be very useful to me ! I will post a link to this page on my blog. I am sure my visitors will find that very useful
    Click here:
    python training in tambaram
    Click here:
    python training in annanagar

    ReplyDelete
  3. Thanks for the informative article. This is one of the best resources I have found in quite some time. Nicely written and great info. I really cannot thank you enough for sharing.
    Blueprism training in Chennai

    Blueprism training in Bangalore

    Blueprism training in Pune

    ReplyDelete
  4. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    Data Science Training in Chennai
    Data science training in bangalore
    Data science online training
    Data science training in pune

    ReplyDelete
  5. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.
    blue prism Training in Electronic City

    ReplyDelete
  6. I learned World's Trending Technology from certified experts for free of cost. I Got a job in decent Top MNC Company with handsome 14 LPA salary, I have learned the World's Trending Technology from Python training in pune experts who know advanced concepts which can help to solve any type of Real-time issues in the field of Python. Really worth trying instant approval blog commenting sites

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete
  8. Inspiring writings and I greatly admired what you have to say , I hope you continue to provide new ideas for us all and greetings success always for you..Keep update more information..Your good knowledge and kindness in playing with all the pieces were very useful. I don’t know what I would have done if I had not encountered such a step like this.Data Science Training In Chennai

    Data Science Online Training In Chennai

    Data Science Training In Bangalore

    Data Science Training In Hyderabad

    Data Science Training In Coimbatore

    Data Science Training

    Data Science Online Training


    ReplyDelete
  9. I am impressed by the information that you have on this blog. It shows how well you understand this subject.
    artificial intelligence course in bangalore

    ReplyDelete
  10. Its as if you had a great grasp on the subject matter, but you forgot to include your readers. Perhaps you should think about this from more than one angle.
    artificial intelligence course in bangalore

    ReplyDelete
  11. Nice article and thanks for sharing with us. Its very informative.
    AWS Training in Hyderabad
    AWS Course in Hyderabad

    ReplyDelete
  12. Thanks for posting the best information and the blog is very important ��.data science interview questions and answers

    ReplyDelete
  13. Did you want to set your career towards Oracle? Then Infycle is with you to make this into reality. Infycle Technologies gives the combined and best Oracle course in Chennai, which offers various stages of Oracle such as Oracle PL/SQL, Oracle DBA, etc., along with 100% hands-on training guided by professional tutors in the field. Along with that, the mock interviews will be given to the candidates to face the interviews with complete confidence. Apart from all, the candidates will be placed in the top MNC's with an excellent salary package. To get it all, call 7502633633 and make this happen for your happy life.
    Best Oracle Course in Chennai | Infycle Technologies

    ReplyDelete
  14. This post is so interactive and informative.keep update more information...
    hadoop training in tambaram
    Big data training in chennai

    ReplyDelete
  15. This insightful comparison between Lambda and Kappa architectures in Kafka Streams demonstrates a deep understanding of stream processing. The detailed explanations of concepts like KStreams DSL and Interactive Queries are highly commendable. Well done.
    Data Analytics Courses In Dubai

    ReplyDelete
  16. An excellent explanation of Kafka Streams and the subtle differences between Lambda and Kappa architecture
    Data Analytics Courses in Agra

    ReplyDelete
  17. This was so helpful for me. Keep sharing more.
    Visit - Data Analytics Courses in Delhi

    ReplyDelete
  18. good blog
    Data Analytics Courses In Vadodara

    ReplyDelete
  19. Your blog covers a wide range of topics, making it an invaluable resource for both beginners and experienced data professionals. Keep up the great work.

    Digital marketing courses in illinois

    ReplyDelete
  20. Thanks for sharing incredible and outstanding explanation on Kafka Streams and the subtle differences between Lambda and Kappa architecture.
    data analyst courses in limerick

    ReplyDelete
  21. thanks for sharing such great blog post, very well written
    Digital marketing business

    ReplyDelete
  22. Thanks for that really useful blog post. This was just what I needed.

    Investment banking analyst jobs

    ReplyDelete