It’s more relational than relational databases
The contribution of this article is a maticulously sequenced presentation that curates concise yet deep insights from the many resources about this topic.
NOTE: Content here are my personal opinions, and not intended to represent any employer (past or present). “PROTIP:” here highlight information I haven’t seen elsewhere on the internet because it is hard-won, little-know but significant facts based on my personal research and experience.
Graph databases: the latest thing
Graph type databases is the latest in the evolution of data storage mechanisms to handle complexity.
PROTIP: People using graph databases call themselves “Graphistas”.
Graphs can provide insights not easily found using other technologies.
Graph databases provides an alternative way to to store data. Instead of static predefined schemas which require shutdown to change, graph databases can be configured dynamically while running.
Directed acyclic (one-way) graphs (DAGs) are used in Git, scheduling algorithms, and form the heart of many neuro network (Tensor) models in many other modern applications. Its representation of dependencies (precedence relationships) enable its use in the Airflow task processing app.
Simpler complex connections, naturally
It’s difficult for SQL to answer questions that were not already expected ahead of time.
SQL databases from Oracle, MySQL, etc. need to join physical tables together using foreign keys and link tables.*
Writing SQL to represent a social graph containing 1,000 persons averaging 50 friends each can be difficult due to the need for joins and “de-normalized” physical structures.
## Graph Faster
It is time consuming for traditional relational databases to process complex indexed queries (even if it’s all in cache). However, graph databases can process complex data structures efficiently because it uses pointers instead of table lookups (for “index free adjacency”). A comparison VIDEO:
|# persons||query time|
|Relational database||1,000||200 ms|
|"Supernodes" in Neo4j||1,000,000||2 ms|
## More relational than relational databases
Whereas SQL data is stored in separate tables joined together using complex queries, Graph databases are “white-board friendly” because it stores data the same way as illustrated by its data model. Graph database diagrams look like ER (Entity-Relation) diagrams for SQL databases. The example below uses data from Movielens database containing 62,000 movies with 25 million ratings and one million tag applications applied by 162,000 users:
Graph databases manage nodes (data entities) in relationship to other nodes.
Instead of elaborate joins, labeled relationships between red nodes defining movie titles and green nodes defining actor names. Red and green differentiate entity types. Titles and actor names labels to nodes. “ACTED_IN” and “DIRECTED” are attributes of relationships.
Nodes objects are also called “vertices” and relationships are also called “edges”.
In Neural Network Computation Graphs:
vertices are neurons (simple building blocks) and edges are tensors (data items).
Each vertex has an ID (identifier).
Each edge has a weight. In a graph of edges representing segments of a road being built, the Shortest Path (Djisktra’s) algorithm reveals the least-cost set of road segments.
Adjacency Lists makes sense for large, sparsely connected graphs.
Adjacency Sets makes sense for small, densely connected graphs.
To order all nodes that satisfies all precedence relationships, a topological sort, implemented using a simple iterative algorithm.
Spanning Tree Algorithms find a path through all nodes. The minimum spanning tree is one that has the lowest sum of weights. Prim’s (greedy) algorithm works only for connected (weighted undirected) graphs. Krushal’s algorithm works even for disconnected graphs.
Traversing graphs indirectly
The advantage of graph databases appears when working with complex indirect relationships.
Relationships and nodes can be associated with name/value pair properties used to narrow searches.
Third-party add-ons can add a GUID to each entity.
Which graph database and language?
This ranking by db-engines.com lists Neo4j as the most popular graph database, with Microsoft Cosmos catching up quickly. Notice that Cosmos and others are called “Multi-model” (providing a Document store, Key-value store, wide-column store as well as graph database).
The Gremlin language traversal machine (GTM) is to graph computing as what the Java virtual machine (JVM) is to general purpose computing. Gremlin was developed (beginning in 2009) by Apache TinkerPop of the Apache Software Foundation. Thus, it is Apache-2 licensed.
Cloud SaaS Graph database services
Instead of a local instance, if you’re working as a team of developers, consider always-on availability, on-demand scalability, and support:
GraphStory.com provides single-node Cloud Graph Neo4j databases from Azure, AWS, and GCP with their dashboard for $299/month (and up). $899/month and up with monitoring with HA multi-zone failover protection.
Apache TinkerPop Gremlin queries on DataStax Enterprise Graph
Neptune (named the ice planet in our solar system) is Amazon’s graph database managed cloud service (documentation). Pluralsight video course by Jeff Hoopper covers use of (non-prod) CloudFormation to establish a cluster of $250/month db.r4.large (or larger) EC2 instances in several availability zones within a region. IAM is used, but is accessible only via a VPC from a Lambda service. The read-only Reader can access Up to 16 read replicas behind separate IP addresses. The modules used in the course are release 2018.3:
- apache-tinkerpop-gremlin-console-3.3.2 for Gremlin queries
- eclipse-rd4jf-2.3.2 for SPARQL queries Triplestore
AWS Neptune can (from S3) use curl to POST bulk-load graph UTF-8 data in several formats:
- Turtle to load SPARQL
Neo4j’s own Aura cloud offering runs on GCP.
The Gremlin language is implemented by a wide variety of vendors, including Neo4j. Gremlin is popular largely because it is supported by the open-source Apache Tinkerpop/TinkerGraph (docs TigerGraph analytics cloud). Tinkerpop is one of two open-source graph databases that include @JanusGraph (http://janusgraph.org), open-sourced in 2017 under The Linux Foundation, with participants from Google, Hortonworks, IBM, Amazon, GRAKN.AI, Expero Labs, etc. Its distributed graph database has multiple scalable storage backends:
- Apache Cassandra®
- Apache HBase®
- Google Cloud Bigtable
- Oracle BerkeleyDB
“It’s harder to get started with Gremlin than Neo4j’s Cypher. Gremlin has a SQL-like syntax (SELECT, WHERE, etc.). But Gremlin helps you understand graphs better than Cypher. And it’s available on free open-source software and most portable and available among vendors.” – John Ptacek [24:25] into “THAT Conference ‘19: Introduction to Graph Databases”
“Working with Graph Algorithms in Python” video tutorial on Pluralsigh by Janani Ravi explains sample Python 3.5.1 code (not using Neo4J or Gremlin).
More about Python
This is one of a series about Python:
- Python install on MacOS
- Python install on MacOS using Pyenv
- Python tutorials
- Python Examples
- Python coding notes
- Pulumi controls cloud using Python, etc.
- Test Python using Pytest BDD Selenium framework
- Test Python using Robot testing framework
- Python REST API programming using the Flask library
- Python coding for AWS Lambda Serverless programming
- Streamlit visualization framework powered by Python
- Web scraping using Scrapy, powered by Python
- Neo4j graph databases accessed from Python