Wilson Mar bio photo

Wilson Mar

Hello. Hire me!

Email me Calendar Skype call 310 320-7878

LinkedIn Twitter Gitter Google+ Youtube

Github Stackoverflow Pinterest

This can, and does, get big


This page contains unorganized notes about scaling of the Elastic Stack.


Elasticsearch powers many of the most data-rich websites:

  • Wikipedia uses Elasticsearch to provide full-text search with highlighted search snippets, and search-as-you-type and did-you-mean suggestions.

  • The Guardian uses Elasticsearch to combine visitor logs with social -network data to provide real-time feedback to its editors about the public’s response to new articles.

  • Stack Overflow combines full-text search with geolocation queries and uses more-like-this to find related questions and answers.

  • GitHub uses Elasticsearch to query 130 billion lines of code.

  • Klout

  • AppDynamics monitoring



Registering for a free “Basic” license gets you Monitoring (formerly Marvel) and Logstash ArcSight module along with Dev Tools Search Profiler and Grok Debugger. And Upgrade Assistant APIs & UI.

Unlike Splunk, where it’s expensive (millions) after the first 500 MB of free.

Elastic prices subscriptions by the number of nodes managed and monitored at scale (less than Splunk and doesn’t run out of gas).

But Marvel is free until production.


  1. GitHub for documentation and source control.
  2. Linux Ubuntu (Amazon build) running on all servers for DEB packages installed by command dpkg
  3. STunnel to secure communications
  4. Docker to install packages on servers.
  5. Logstash Forwarder on all servers to direct log entry flow to a collector.

  6. JMeter to create load on the system artificially by using Java programs to emulate many real clients.

  7. Maven to package java

  8. Puppet to manage configurations

  9. NGINX to distribute among servers

  10. Logstash collects timestamped logs of various formats, from various sources, parse to filter out junk, index them, and normalize into JSON in a way that’s searchable in a central location. Better than awk, grep, etc. on individual machines.

  11. RabbitMQ queue services between Logstash producers and consumers to ensure scalability by absorbing spikes.

  12. Elasticsearch indexes (inverted) nested aggregations of data in Hadoop.
  13. Curator at https://github.com/elasticsearch/curator to manage Elasticsearch indexes by enabling admins to schedule operations to optimise, close, and delete indexes.

  14. Kibana does data discovery on elasticsearch cluster to identify “actionable insights” and presents visualization (a dashboard).


Instead of piping individual logs such as:

$ log_producer | grep ... | sed ... | awk ... | tee output \ | sort | uniq -c | sort -n

Elasticsearch provides consistency to different time stamp formats.

Kibana “democratizes” data by putting a front-end to access data in a searcheable in fast, meaningful ways.



There is a lighter edition of Logstash.

Kibana & Elasticsearch started as an open source project, built by devops people for devops people.

  • https://github.com/elastic/elasticsearch
  • https://github.com/elastic/kibana
  • https://github.com/elastic/logstash was begun by @jordansissel while at Dreamhost, who continues to create videos https://www.youtube.com/channel/UC1Hc-GPNTYax-vAVCH333ww as Elastic employee. His talk: https://www.youtube.com/watch?v=fwMnb4-t8vo

  • https://github.com/docker-library/kibana

Competitive Features

  • D3 JS library flexibility
  • Watcher -
  • Shield support for security

  • Bulk operations (for indexing and search operations)
  • Percolator (“reversed search” - alerts, classification)
  • Suggesters (“Did you mean …?”)
  • Index aliases (Grouping, filtering or “renaming” of indices every day)
  • Index templates (automatic index configuration)
  • Monitoring API (amount of memory used, number of operations, etc.)

  • Pie charts have nested levels

Elasticsearch Configure

On a Mac with Homebrew installed:

   brew install elasticsearch nginx

Indexes and Shards

Indexes are stored in two types of shards (Apache Lucene instances): primary and replica. Primary shards are where documents are stored. Five primary shards are created for each new index by default. This default can change but not AFTER it is created.

Each primary shard has one replica by default but that can be changed dynamically for scale out or to make an index more resilient.

Elasticsearch cleverly distributes shards across available nodes such that primary and replica shards for an index are not present on the same node that is automatically part of an Elasticsearch cluster.

Elasticsearch moves shards automatically from one node to another in the case of node failure or when new nodes are added.

   # line 32 - read the comments on why you might not want localhost here
# for dev box only
elasticsearch: "http://localhost:9200",

# enable cors for kibana3 + elasticsearch 1.4
vi /usr/local/Cellar/elasticsearch/1.4.3/config/elasticsearch.yml

# kibana 3 compatibility
http.cors.enabled: true
http.cors.allow-origin: http://localhost:8080

# the services command is from the brew/tap at the top, love it
$ brew services restart elasticsearch

# make sure nginx starts by itself
# nginx config is in /usr/local/etc/nginx/nginx.conf if you need to look at it
# it won't need any edits for kibana.  it's just js/html in a directory.

# browse to http://localhost:8080/kibana  (you should see a kibana page)
# Now, let's change the default page to logstash.

cd /usr/local/var/www/kibana/app/dashboards
mv default.json default.json.orig
cp logstash.json default.json

# refresh the kibana page.  It will be logstash's default now.

Docker package

For more scale, between intermediate brokers are

  • Storm
  • Spark cache
  • Samza

Flume can send to HDFS for es-hadoop



Sends notifications via PagerDuty


jargons first : indexes, types, documents , mapping, match, multimatch ,filters ,query-dsl etc

PROTIP: From the bottom up, docs are stored in shards which live in a node. Each node is deployed as a physical machine or VM node. Multiple nodes can run within a physical machine. Node configurations include file path, label, network interface. Node default name is a random Marvel Super Hero name.

On Windows -Des.node.name=

ES balances shards around a cluster of ES nodes.

Docs (in JSON format) from clients are routed to an index. An index is a container for docs (not just a list of pointers). Index configurations include shards, replicas, refresh rates, read only, etc.

Once in a shard, a document remains in that shard.

The number of fixed type shards is fixed at creation.

A replica type shard can be changed dynamically. Having replicas means that

ES is built on top of Apache Lucene which provides index and query execution

There is usually a load balancer to distribute load among nodes within a cluster. http. default ports 9200-9300 for API traffic. transport.tcp.port 9300 for long-lived connections among nodes.

If a machine has multiple network cards, ideally, use a different vlan between nodes.

ES nodes are stateless. ES provides no rollback capabilities like SQL servers.

Logging is done at the cluster.

Each node respond to API calls to set configuration, mapping, templates, aliases, and index setting.

How many:

curl -XPOST 'localhost:9200/myindex/_segments'

If ELK is used only to ingest logs:

curl -XPOST 'localhost:9200/myindex/_optimize?max_num_segments=2'

ES is a Java app. Use the JDK for thread dumps, etc.

In the ES folder:

  • config pointed from path.config in /etc/elasticsearch

  • data pointed from path.data, avoid $ES_HOME/data

  • logs pointed from path.logs in /var/log/elasticsearch, default to $ES_HOME/logs

  • plugins pointed from path. plugins

  • scripts pointed from path.scripts

Log rotation daily by default.

Maxium amount of space consumed.

ES_HEAP_SIZE sets other options for you.

  • Don’t set above 30gb or JVM won’t compress object pointers and loose addressible heap space.
  • Don’t alloc more than 50% of RAM on box to ES (for OS & file system cache).
  • Don’t let JVM swap out disk by setting –boottrap.mlocakall=true in command or ES YML file.

Within Lucene, docs indexed into a temporary buffer. Lucene flushes docs to a segment so it can be searched. It contains an inverted index. _ refreshed every second. Segments are immutable (cannot be changed). To expunge deleted, segments are merged.

Filters are good all the time, and re-indexing is not necessary.

Doc Values sits in separate file on disk At query time there is no memory overhead but is slower.

The elected Master node does the cluster-level configuration. By default all nodes are master eligible. A cluster can potentially be split into two sub clusters (a split brain).


Usually setup 3 dedicated master nodes after 10. This doesn’t affect performance.


http://jakege.blogspot.sg/2014/03/how-to-install-elasticsearch.html How to Install Elasticsearch

https://www.youtube.com/watch?v=G56aE1kiOZw Your Data, Your Search, Elasticsearch SpringDeveloper 60K views

https://www.youtube.com/watch?v=bMi2g3uEAxU ElasticSearch in Production: lessons learned LuceneSolrRevolution 43K views


This is one of a series on Elastic Stack and monitoring:

  1. Elastic Stack ecosystem of people, websites, tutorials
  2. Elastic Stack architecture and installation
  3. Elastic Scaling (the database engine)
  4. Elastic Query (via REST API)
  5. Elastic Kibana (the visualization engine, like Grafana)
  6. Elastic Logstash to assemble and filter data from Beats
  7. Elastic Beats to collect data from servers