Graph databases

It’s more relational than relational databases

Overview

The contribution of this article is a maticulously sequenced presentation that curates concise yet deep insights from the many resources about this topic.

NOTE: Content here are my personal opinions, and not intended to represent any employer (past or present). “PROTIP:” here highlight information I haven’t seen elsewhere on the internet because it is hard-won, little-know but significant facts based on my personal research and experience.

Graph databases: the latest thing

Graph type databases is the latest in the evolution of data storage mechanisms to handle complexity.

PROTIP: People using graph databases call themselves “Graphistas”.

Below is a (probably controversial) table comparing the different types of stores and their strengths and weaknesses.

Types:	Key-value	Column	Document	Relational	Graph
Complexity	none	low	low	moderate	high
Performance	high	high	high	high	variable
Scalability	high	high	high / variable	high	variable
Flexibility	high	moderate	high	high	high

Google Cloud Spanner is a cloud-based relational database that’s a horizontally scalable, globally immediate consistent. Released in 2017.

References:

Wikipedia: Graph_database
https://itnext.io/getting-started-with-graph-databases-azure-cosmosdb-with-gremlin-api-and-python-80e57cbd1c5e
https://www.c-sharpcorner.com/article/what-is-a-graph-database/
VIDEO: Neo4j on GKE via GCP Marketplace at https://bit.ly/2u6ryt3

Faster and Better?

It is time consuming for traditional relational databases to process complex indexed queries (even if it’s all in cache). However, graph databases can process complex data structures efficiently because it uses pointers instead of table lookups (for “index free adjacency”). A comparison VIDEO:

	# persons	query time
Relational database	1,000	200 ms
Neo4j	1,000	2 ms
"Supernodes" in Neo4j	1,000,000	2 ms

Where graph databases are needed

Graphs can provide insights not easily found using other technologies.

Graph databases provides an alternative way to to store data. Instead of static predefined schemas which require shutdown to change, graph databases can be configured dynamically while running.

Where graphs are important to visualizing AI/Machine Learning algorithms:

Directed acyclic (one-way) graphs (DAGs) are used in Git, scheduling algorithms, and form the heart of many neuro network (Tensor) models in many other modern applications. Its representation of dependencies (precedence relationships) enable its use in the Airflow task processing app.

https://www.techtarget.com/searchcio/tip/Expect-graph-database-use-cases-for-the-enterprise-to-take-off

Simpler complex connections, naturally

It’s difficult for SQL to answer questions that were not already expected ahead of time.

SQL databases from Oracle, MySQL, etc. need to join physical tables together using foreign keys and link tables.*

Writing SQL to represent a social graph containing 1,000 persons averaging 50 friends each can be difficult due to the need for joins and “de-normalized” physical structures.

More relational than relational databases?

Whereas SQL data is stored in separate tables joined together using complex queries, Graph databases are “white-board friendly” because it stores data the same way as illustrated by its data model. Graph database diagrams look like ER (Entity-Relation) diagrams for SQL databases. The example below uses data from Movielens database containing 62,000 movies with 25 million ratings and one million tag applications applied by 162,000 users:

Graph databases manage nodes (data entities) in relationship to other nodes.

Instead of elaborate joins, labeled relationships between red nodes defining movie titles and green nodes defining actor names. Red and green differentiate entity types. Titles and actor names labels to nodes. “ACTED_IN” and “DIRECTED” are attributes of relationships.

Nodes objects are also called “vertices” and relationships are also called “edges”.

In Neural Network Computation Graphs:

vertices are neurons (simple building blocks) and edges are tensors (data items).

Each vertex has an ID (identifier).

Each edge has a weight. In a graph of edges representing segments of a road being built, the Shortest Path (Djisktra’s) algorithm reveals the least-cost set of road segments.

Adjacency Lists makes sense for large, sparsely connected graphs.

Adjacency Sets makes sense for small, densely connected graphs.

To order all nodes that satisfies all precedence relationships, a topological sort, implemented using a simple iterative algorithm.

Spanning Tree Algorithms find a path through all nodes. The minimum spanning tree is one that has the lowest sum of weights. Prim’s (greedy) algorithm works only for connected (weighted undirected) graphs. Krushal’s algorithm works even for disconnected graphs.

Traversing graphs indirectly

The advantage of graph databases appears when working with complex indirect relationships.

Relationships and nodes can be associated with name/value pair properties used to narrow searches.

neo4j-co-example-1154x345

Third-party add-ons can add a GUID to each entity.

Which graph database and language?

This ranking by db-engines.com lists Neo4j as the most popular graph database, with Microsoft Azure Cosmos catching up quickly.

https://db-engines.com/en/system/GraphDB%3BMicrosoft+Azure+Cosmos+DB%3BNeo4j

Notice that Cosmos and others are called “Multi-model” (providing a Document store, Key-value store, wide-column store as well as graph database).

Gremlin language

The Gremlin language traversal machine (GTM) is to graph computing as what the Java virtual machine (JVM) is to general purpose computing. The Gremlin language is implemented by a wide variety of vendors, including Neo4j. Gremlin is popular largely because it is supported (beginning in 2009) by the open-source Apache Tinkerpop/TinkerGraph (docs

“It’s harder to get started with Gremlin than Neo4j’s Cypher. Gremlin has a SQL-like syntax (SELECT, WHERE, etc.). But Gremlin helps you understand graphs better than Cypher. And it’s available on free open-source software and most portable and available among vendors.” – John Ptacek [24:25] into “THAT Conference ‘19: Introduction to Graph Databases”

References:

https://www.codeproject.com/Articles/1066378/Introduction-to-Graph-Databases-using-Neo-J-and-it

TigerGraph

TigerGraph analytics cloud calls itself “the most scalable graph database-as-a-service for your connected data analytics”. It is Apache-2 licensed. It’s purpose-built for loading terabytes of data in hours and, in real-time, analyzing 10+ hops deep in to relationships.

https://www.tigergraph.com/google-cloud/
https://docs.tigergraph.com/tigergraph-server/current/getting-started/cloud-images/gcp

JanusGraph on GKE with Cloud Bigtable

http://janusgraph.org was open-sourced in 2017 under The Linux Foundation, with participants from Google, Hortonworks, IBM, Amazon, GRAKN.AI, Expero Labs, etc. Named customers include Ebay and Target.

The Janusgraph distributed graph database has multiple scalable storage backends:

Apache Cassandra®
Apache HBase®
Google Cloud Bigtable
Oracle BerkeleyDB

https://db-engines.com/en/system/Google+Cloud+Datastore%3BGraphDB%3BNeo4j Google Cloud Datastore

JanusGraph uses a pluggable indexing backend to provide full-text indexing for vertex and edge properties.

Janusgraph runs within GKE with ElasticSearch as the indexing backend running in Pods in a StatefulSet.

Bigtable as the underlying backend storage layer.

For fast and deep graph traversals among relationships, JanusGraph stores data as an adjacency list.

The above diagram[1] shows the logical storage structure for a small graph fragment with two vertex rows. Two example rows represent two vertices. The first vertex is labeled with a single vertex property and is related to two other vertices by two separate edges. The second vertex holds columns containing two properties and one edge.

Each row represents a vertex, any adjacent vertices (edges), and property metadata about the vertices and edges. A row key is the unique identifier for each vertex. Each relationship between the vertex and another vertex and any properties that further define the relationship are stored as an edge or edge-property column. Both the column qualifier and column value store data that defines the edge, in accordance with Bigtable best practices. Each vertex property is stored as a separate column, again using both the column qualifier and the column value to define the property.

RDF Triplestore

RDF stands for Resource Description Framework and is a standard for describing web resources and data interchange, developed and standardized with the World Wide Web Consortium (W3C). While there are many conventional tools for dealing with data and more specifically for dealing with the relationships between data, RDF is the easiest, most powerful and expressive standard design.

RDF enables effective data integration from multiple sources, detaching data from its schema. This allows multiple schemas to be applied, interlinked, queried as one and modified without changing the data instances.

RDF is a standard for data interchange used for representing highly interconnected data. Each RDF statement is a three-part structure consisting of resources where every resource is identified by a URI. Representing data in RDF allows information to be easily identified, disambiguated and interconnected by AI systems.

https://www.ontotext.com/knowledgehub/fundamentals/what-is-rdf/

References:

[1] https://connectsaurabhmishra.medium.com/how-google-cloud-deals-with-graph-databases-19790cd1d43e
https://connectsaurabhmishra.medium.com/neo4j-aura-graph-database-extension-on-google-cloud-1b2c7f11fbd2
https://cloud.google.com/architecture/running-janusgraph-with-bigtable

Neo4j Aura:

https://console.cloud.google.com/marketplace/product/endpoints/prod.n4gcp.neo4j.io?project=glass-bridge-285601
https://connectsaurabhmishra.medium.com/neo4j-aura-graph-database-extension-on-google-cloud-1b2c7f11fbd2

Cloud SaaS Graph database services

References:

https://www.g2.com/categories/graph-databases
https://www.g2.com/categories/graph-databases#grid

Instead of a local instance, if you’re working as a team of developers, consider always-on availability, on-demand scalability, and support:

Neo4j’s own Aura cloud offering runs on GCP.
GraphStory.com provides single-node Cloud Graph Neo4j databases from Azure, AWS, and GCP with their dashboard for $299/month (and up). $899/month and up with monitoring with HA multi-zone failover protection. GraphStory can stand up Enterprise Neo4j Causal Clusters on Google Cloud Platform. Also on the GCP Marketplace. INTRO VIDEO
Microsoft’s Cosmos graph database processes Gremlin queries.
Neo4j VMs on MS Azure Marketplace in partnership VIDEO: Graph DataConnect
Apache TinkerPop Gremlin queries on DataStax Enterprise Graph
AWS Neptune (named the ice planet in our solar system) is Amazon’s graph database managed cloud service (documentation). Pluralsight video course by Jeff Hoopper covers use of (non-prod) CloudFormation to establish a cluster of $250/month db.r4.large (or larger) EC2 instances in several availability zones within a region. IAM is used, but is accessible only via a VPC from a Lambda service. The read-only Reader can access Up to 16 read replicas behind separate IP addresses. The modules used in the course are release 2018.3:
- apache-tinkerpop-gremlin-console-3.3.2 for Gremlin queries
- eclipse-rd4jf-2.3.2 for SPARQL queries Triplestore
AWS Neptune can (from S3) use curl to POST bulk-load graph UTF-8 data in several formats:
- CSV
- N-Tuples
- N-Quads
- RDF/XML
- Turtle to load SPARQL
AnzoGraph DB from Cambridge Semantics - a Massively Parallel Processing (MPP) native graph database built for data harmonization and analytics. Horizontally scalable graph database built for online analytics and data harmonization. Take on data harmonization and linked data challenges. It uses SPARQL*/OWL for semantic graphs but also supports Labeled Property Graphs (LPGs). VIDEO
ArangoDB earned high performance scores from Gartner. Natively store data for graph, document and search needs. Utilize feature-rich access with one query language. Map data natively to the database and access it with the best patterns for the job – traversals, joins, search, ranking, geospatial, aggregations – you name it. Polyglot persistence without the costs. Easily design, scale and adapt your architectures to changing needs and with much less effort. Combine the flexibility of JSON with semantic search and graph technology for next generation feature extraction even for large datasets.

Microsoft Cosmos

Microsoft’s Cosmos graph database running within a Azure HDInsight Spark cluster 2.0. VIDEO, slides

https://learn.microsoft.com/en-us/azure/cosmos-db/gremlin/introduction
https://learn.microsoft.com/en-us/azure/cosmos-db/gremlin/support
https://learn.microsoft.com/en-us/azure/cosmos-db/gremlin/modeling
https://learn.microsoft.com/en-us/azure/cosmos-db/gremlin/partitioning
https://www.tomsawyer.com/graph-database-browser/azure
https://towardsdatascience.com/getting-started-with-graph-databases-in-azure-cosmos-db-cbfbf708cda5

Microsoft Graph 365

Microsoft “Graph” product stores user metadata from its 365 product (documents, emails, etc).

It has an API.

https://azure.microsoft.com/en-us/products/graph-data-connect
https://github.com/microsoftgraph/dataconnect-solutions
https://blog.pragmaticworks.com/what-is-microsoft-graph
https://www.infoworld.com/article/3231658/making-sense-of-microsofts-graph-database-strategy.html
https://blog.victoriaholt.co.uk/2018/06/microsoft-graph-useful-tool.html
https://laurakokkarinen.com/the-ultimate-beginners-guide-to-microsoft-graph/
https://blog.ciaops.com/2019/04/17/using-interactive-powershell-to-access-the-microsoft-graph/
https://www.red-gate.com/simple-talk/development/dotnet-development/getting-started-with-microsoft-graph-api/
https://www.epcgroup.net/microsoft-graph-data-connect-pricing-and-features-guide/
https://blog.codewithdan.com/getting-started-calling-the-microsoft-graph-api/

Microsoft SQL Server

SQL Server offers graph database capabilities to model many-to-many relationships. The graph relationships are integrated into Transact-SQL and receive the benefits of using SQL Server as the foundational database management system.

Transact-SQL Graph extensions are fully integrated in Microsoft’s SQL Server engine. Use the same storage engine, metadata, query processor, etc. to store and query graph data. Query across graph and relational data in a single query. Combining graph capabilities with other SQL Server technologies like columnstore, HA, R services, etc. SQL graph database also supports all the security and compliance features available with SQL Server.

All graph operations supported on relational tables are supported on node or edge table. A node table is collection of similar type of nodes.

Transact-SQL Graph extensions allow users to create node or edge tables. Here is an example:

CREATE TABLE Person (ID INTEGER PRIMARY KEY, Name VARCHAR(100), Age INT) AS NODE;
CREATE TABLE friends (StartDate date) AS EDGE;

A graph is a collection of node and edge tables.
One graph per database. Node or edge tables can be created under any schema in the database, but they all belong to one logical graph.

A more complex diagram using Visio and PowerBI:

MATCH clause

The new MATCH clause supports pattern matching and multi-hop navigation through the graph. It uses ASCII-art style syntax for pattern matching:

-- Find friends of John
SELECT Person2.Name 
FROM Person Person1, Friends, Person Person2
WHERE MATCH(Person1-(Friends)->Person2)
AND Person1.Name = 'John';

The Person node table holds all the Person nodes belonging to a graph. An edge table is a collection of similar type of edges. A Friends edge table holds all the edges that connect a Person to another Person.

Since nodes and edges are stored in tables, most of the operations supported on regular tables are supported on node or edge tables.

Both nodes and edges can have properties associated to them.

Sample code

https://www.sqlshack.com/introduction-sql-server-2017-graph-database/

QUESTION: Sample template code for a web app built using C# .NET referencing graph SQL?

https://guyinacube.com/2020/10/08/visualize-graph-data-in-power-bi/

SHORTEST_PATH

The SHORTEST_PATH function finds the shortest path between any 2 nodes in a graph or starting from a given node to all the other nodes in the graph. It can also be used to find a transitive closure or for arbitrary length traversals in the graph.

Edge constraints

To enforce data integrity and specific semantics on the edge tables in SQL Server graph database, use edge constraints to edge tables</a> so that when a new edge is added, the Database Engine enforces that the nodes which the edge is trying to connect, exist in the proper node tables. It is also ensured that a node cannot be dropped, if it is still referenced by an edge. Example:

CONSTRAINT EC_BOUGHT CONNECTION (Customer TO Product, Customer TO Product)

EC_BOUGHT is the constraint_name.
Within parenteses are clauses that define the edge constraint.
https://bvisual.net/2018/02/09/using-visio-and-powerbi-with-graphdatabase-in-sqlserver/

Jobs

https://www.linkedin.com/jobs/microsoft-graph-science-jobs/

Resources

“Working with Graph Algorithms in Python” video tutorial on Pluralsight by Janani Ravi explains sample Python 3.5.1 code (not using Neo4J or Gremlin).
See my Neo4j Cypher language tutorial
QUESTION: Sample code for a web app using Neo4j and Python? Stack Overflow
VIDEO Neo4j CTO Jim Webber

Wilson Mar

Graph databases

Graph databases: the latest thing

Faster and Better?

Where graph databases are needed

Simpler complex connections, naturally

More relational than relational databases?

Traversing graphs indirectly

Which graph database and language?

Gremlin language

TigerGraph

JanusGraph on GKE with Cloud Bigtable

RDF Triplestore

Cloud SaaS Graph database services

Microsoft Cosmos

Microsoft Graph 365

Microsoft SQL Server

MATCH clause

Sample code

SHORTEST_PATH

Edge constraints

Jobs

Resources

You might also enjoy (View all posts)