Wilson Mar bio photo

Wilson Mar

Hello. Hire me!

Email me Calendar Skype call 310 320-7878

LinkedIn Twitter Gitter Google+ Instagram Youtube

Github Stackoverflow Pinterest

How to make streaming scream


Overview

Kafka is an Apache project at https://kafka.apache.org/

Being Apache, it’s open source, used by many clients. LinkedIn and Netflix have publicly blogged about their use of Kafka for data collection and buffering for downstream systems like Elasticsearch, Amazon EMR, Mantis etc.

Kafka provides scalable stream processing applications that react to events in real-time.

Kafka sits as a “broker” between “Producers” generating event logs, monitoring data, suspicious user actions, etc. and “consumers” of services, much like a messaging system (RabbitMQ, etc.).

Kafka stores streams of data safely in a distributed, replicated, fault-tolerant cluster.

In other words, Kafka is a hybrid of a distributed database and a message queue.

The logo for Kafka is the Greek Lambda letter. “Lambda Architecture” is a data processing architecture and framework designed to address robustness of the scalability and fault-tolerance (human and machine) of big data systems. The lambda symbol has two legs. To a traditional batch layer processing master data is added a speed layer using Spark streaming. These communicate with a Serving layer.

See Pluralsight video course: Spark Kafka Cassandra - applying Lambda Architecture by Ahmad Alkilani

Aggregation

https://eng.uber.com/distributed-tracing/ by Yuri Shkuro February 2, 2017

Monitoring

The Linux utility Iostat can provide Reads/sec, Writes/sec, KiloBytes read/sec, KiloBytes write/sec, Average number of transactions waiting, Average number of active transactions, Average response time of transactions, Percent of time waiting for service, Percent of time disk is busy

Blog https://cwiki.apache.org/confluence/display/KAFKA/Performance+testing was orginally created by Jay Kreps and last modified by John Fung on Jan 29, 2013

There are more “robust” commercial (licensed for money) monitoring solutions, such as: https://www.manageengine.com/apache-kafka/monitor‎

Simulated traffic

Unlike traditional web sites, Kafka reads and writes streams of data.

Measurement of performance under load is important to quantify known risks:

  1. Kafka has a long message storing time (a week, by default), so what is the rate of disk space usage? How much disk space is needed based on projections of usage? Are mechanisms for clearing and managing disk and alerting about additional space adequate?
  2. Kafka was designed for high performance with its sequential I/O. But what is the capacity?
  3. Kafka can transfer data quickly, but at what sustained rate can Kafka process it by using Streaming API.
  4. Kafka makes clustering convenient to achieve “HA” (high availability). But how many nodes are adequate?
  5. How long does it take for traffic to a failed node to be replicated and distributed across the cluster?
  6. How does the system detect a denial of service attack? Can it really? And how quickly?

To simulate traffic enough to identify capacity issues: https://www.blazemeter.com/blog/apache-kafka-how-to-load-test-with-jmeter Dec. 6th, 2017