Wilson Mar bio photo

Wilson Mar


Email me Calendar Skype call

LinkedIn Twitter Gitter Instagram Youtube

Github Stackoverflow Pinterest

It’s more relational than relational databases

US (English)   Español (Spanish)   Français (French)   Deutsch (German)   Italiano   Português   Estonian   اَلْعَرَبِيَّةُ (Egypt Arabic)   中文 (简体) Chinese (Simplified)   日本語 Japanese   한국어 Korean


The contribution of this article is a maticulously sequenced presentation that curates concise yet deep insights from the many resources about this topic.

NOTE: Content here are my personal opinions, and not intended to represent any employer (past or present). “PROTIP:” here highlight information I haven’t seen elsewhere on the internet because it is hard-won, little-know but significant facts based on my personal research and experience.

Graph databases: the latest thing

Graph type databases is the latest in the evolution of data storage mechanisms to handle complexity.


PROTIP: People using graph databases call themselves “Graphistas”.

Graphs can provide insights not easily found using other technologies.

Graph databases provides an alternative way to to store data. Instead of static predefined schemas which require shutdown to change, graph databases can be configured dynamically while running.

Graphs are important to visualizing AI/Machine Learning algorithms: neo4j-ai-graphs-823x589

Directed acyclic (one-way) graphs (DAGs) are used in Git, scheduling algorithms, and form the heart of many neuro network (Tensor) models in many other modern applications. Its representation of dependencies (precedence relationships) enable its use in the Airflow task processing app.

Simpler complex connections, naturally

It’s difficult for SQL to answer questions that were not already expected ahead of time.

SQL databases from Oracle, MySQL, etc. need to join physical tables together using foreign keys and link tables.*

Writing SQL to represent a social graph containing 1,000 persons averaging 50 friends each can be difficult due to the need for joins and “de-normalized” physical structures.

## Graph Faster

It is time consuming for traditional relational databases to process complex indexed queries (even if it’s all in cache). However, graph databases can process complex data structures efficiently because it uses pointers instead of table lookups (for “index free adjacency”). A comparison VIDEO:

 # personsquery time
Relational database1,000200 ms
Neo4j1,0002 ms
"Supernodes" in Neo4j1,000,0002 ms

## More relational than relational databases

Whereas SQL data is stored in separate tables joined together using complex queries, Graph databases are “white-board friendly” because it stores data the same way as illustrated by its data model. Graph database diagrams look like ER (Entity-Relation) diagrams for SQL databases. The example below uses data from Movielens database containing 62,000 movies with 25 million ratings and one million tag applications applied by 162,000 users:


Graph databases manage nodes (data entities) in relationship to other nodes.

Instead of elaborate joins, labeled relationships between red nodes defining movie titles and green nodes defining actor names. Red and green differentiate entity types. Titles and actor names labels to nodes. “ACTED_IN” and “DIRECTED” are attributes of relationships.

Nodes objects are also called “vertices” and relationships are also called “edges”.

In Neural Network Computation Graphs:

vertices are neurons (simple building blocks) and edges are tensors (data items).

Each vertex has an ID (identifier).

Each edge has a weight. In a graph of edges representing segments of a road being built, the Shortest Path (Djisktra’s) algorithm reveals the least-cost set of road segments.

Adjacency Lists makes sense for large, sparsely connected graphs.

Adjacency Sets makes sense for small, densely connected graphs.

To order all nodes that satisfies all precedence relationships, a topological sort, implemented using a simple iterative algorithm.

Spanning Tree Algorithms find a path through all nodes. The minimum spanning tree is one that has the lowest sum of weights. Prim’s (greedy) algorithm works only for connected (weighted undirected) graphs. Krushal’s algorithm works even for disconnected graphs.

Traversing graphs indirectly

The advantage of graph databases appears when working with complex indirect relationships.

Relationships and nodes can be associated with name/value pair properties used to narrow searches.


Third-party add-ons can add a GUID to each entity.

Which graph database and language?

This ranking by db-engines.com lists Neo4j as the most popular graph database, with Microsoft Cosmos catching up quickly. Notice that Cosmos and others are called “Multi-model” (providing a Document store, Key-value store, wide-column store as well as graph database).

The Gremlin language traversal machine (GTM) is to graph computing as what the Java virtual machine (JVM) is to general purpose computing. Gremlin was developed (beginning in 2009) by Apache TinkerPop of the Apache Software Foundation. Thus, it is Apache-2 licensed.

Cloud SaaS Graph database services

Instead of a local instance, if you’re working as a team of developers, consider always-on availability, on-demand scalability, and support:

  • GraphStory.com provides single-node Cloud Graph Neo4j databases from Azure, AWS, and GCP with their dashboard for $299/month (and up). $899/month and up with monitoring with HA multi-zone failover protection.

  • Microsoft’s Cosmos graph database running within a Azure HDInsight Spark cluster 2.0. VIDEO, slides

  • Apache TinkerPop Gremlin queries on DataStax Enterprise Graph

  • Neptune (named the ice planet in our solar system) is Amazon’s graph database managed cloud service (documentation). Pluralsight video course by Jeff Hoopper covers use of (non-prod) CloudFormation to establish a cluster of $250/month db.r4.large (or larger) EC2 instances in several availability zones within a region. IAM is used, but is accessible only via a VPC from a Lambda service. The read-only Reader can access Up to 16 read replicas behind separate IP addresses. The modules used in the course are release 2018.3:

    • apache-tinkerpop-gremlin-console-3.3.2 for Gremlin queries
    • eclipse-rd4jf-2.3.2 for SPARQL queries Triplestore

    AWS Neptune can (from S3) use curl to POST bulk-load graph UTF-8 data in several formats:

    • CSV
    • N-Tuples
    • N-Quads
    • RDF/XML
    • Turtle to load SPARQL

  • GraphStory can stand up Enterprise Neo4j Causal Clusters on Google Cloud Platform. Also on the GCP Marketplace. INTRO VIDEO

  • Neo4j’s own Aura cloud offering runs on GCP.

Gremlin language

The Gremlin language is implemented by a wide variety of vendors, including Neo4j. Gremlin is popular largely because it is supported by the open-source Apache Tinkerpop/TinkerGraph (docs TigerGraph analytics cloud). Tinkerpop is one of two open-source graph databases that include @JanusGraph (http://janusgraph.org), open-sourced in 2017 under The Linux Foundation, with participants from Google, Hortonworks, IBM, Amazon, GRAKN.AI, Expero Labs, etc. Its distributed graph database has multiple scalable storage backends:

  • Apache Cassandra®
  • Apache HBase®
  • Google Cloud Bigtable
  • Oracle BerkeleyDB

“It’s harder to get started with Gremlin than Neo4j’s Cypher. Gremlin has a SQL-like syntax (SELECT, WHERE, etc.). But Gremlin helps you understand graphs better than Cypher. And it’s available on free open-source software and most portable and available among vendors.” – John Ptacek [24:25] into “THAT Conference ‘19: Introduction to Graph Databases”


More about Python

This is one of a series about Python:

  1. Python install on MacOS
  2. Python install on MacOS using Pyenv
  3. Python install on Raspberry Pi for IoT

  4. Python tutorials
  5. Python Examples
  6. Python coding notes
  7. Pulumi controls cloud using Python, etc.
  8. Jupyter Notebooks provide commentary to Python

  9. Python certifications

  10. Test Python using Pytest BDD Selenium framework
  11. Test Python using Robot testing framework
  12. Testing AI uses Python code

  13. Microsoft Azure Machine Learning makes use of Python

  14. Python REST API programming using the Flask library
  15. Python coding for AWS Lambda Serverless programming
  16. Streamlit visualization framework powered by Python
  17. Web scraping using Scrapy, powered by Python
  18. Neo4j graph databases accessed from Python