It’s more relational than relational databases
The contribution of this article is a maticulously sequenced presentation that curates a concise yet deep tidbits from the many resources about this topic.
Graph databases: the newest thing
Graph type databases is the latest in the evolution of data storage mechanisms to handle complexity.
PROTIP: People using graph databases call themselves “Graphistas”.
Graph databases provides an alternative to use of SQL tables, rows, columns, or NOSQL documents to store data. SQL makes it difficult to answer questions that were not already expected ahead of time. But instead of static predefined schemas which require shutdown to change, graph databases can be configured dynamically while running.
Graphs can provide insights not easily found using other technologies.
Simpler complex connections, naturally
SQL databases from Oracle, MySQL, etc. need to join physical tables together using foreign keys and link tables.*
Writing SQL to represent a social graph containing 1,000 persons averaging 50 friends each can be difficult due to the need for joins and “de-normalized” physical structures.
## Graph Faster
Moreover, it is also time consuming for traditional relational databases to process complex indexed queries, even if it’s all in cache. However, Neo4j can process complex data structures efficiently because it uses pointers instead of table lookups (for “index free adjacency”). A comparison VIDEO:
|# persons||query time|
|Relational database||1,000||200 ms|
|"Supernodes" in Neo4j||1,000,000||2 ms|
## More relational than relational databases
Whereas SQL data is stored in separate tables joined together using complex queries, Neo4j is “white-board friendly”. Neo4J data is stored the same way as illustrated by its data model. Graph database diagrams look like ER (Entity-Relation) diagrams for SQL databases. The example below uses data from Movielens database containing 62,000 movies with 25 million ratings and one million tag applications applied by 162,000 users:
Graph databases manage nodes (data entities) in relationship to other nodes.
Instead of elaborate joins, labeled relationships between red nodes defining movie titles and green nodes defining actor names. Red and green differentiate entity types. Titles and actor names labels to nodes. “ACTED_IN” and “DIRECTED” are attributes of relationships.
Some (academics) call nodes “vertices” or objects and relationships “edges”.
Traversing graphs indirectly
Relationships and nodes can be associated with name/value pair properties used to narrow searches.
Third-party add-ons can add a GUID to each entity.
The advantage of Neo4j appears when working with complex indirect relationships.
Which graph database and language?
This ranking by db-engines.com lists Neo4j as the most popular graph database, with Microsoft Cosmos catching up quickly. Notice that Cosmos and others are called “Multi-model” (providing a Document store, Key-value store, wide-column store as well as graph database).
Gremlin traversal machine is to graph computing as what the Java virtual machine (JVM) is to general purpose computing. Gremlin was developed (beginning in 2009) by Apache TinkerPop of the Apache Software Foundation. Thus, it is Apache-2 licensed.
Cloud SaaS Graph database services
Instead of a local instance, if you’re working as a team of developers, consider always-on availability, on-demand scalability, and support:
GraphStory.com provides single-node Cloud Graph Neo4j databases from Azure, AWS, and GCP with their dashboard for $299/month (and up). $899/month and up with monitoring with HA multi-zone failover protection.
Apache TinkerPop Gremlin queries on DataStax Enterprise Graph
Neptune (named the ice planet in our solar system) is Amazon’s graph database managed cloud service (documentation). Pluralsight video course by Jeff Hoopper covers use of (non-prod) CloudFormation to establish a cluster of $250/month db.r4.large (or larger) EC2 instances in several availability zones within a region. IAM is used, but is accessible only via a VPC from a Lambda service. The read-only Reader can access Up to 16 read replicas behind separate IP addresses. The modules used in the course are release 2018.3:
- apache-tinkerpop-gremlin-console-3.3.2 for Gremlin queries
- eclipse-rd4jf-2.3.2 for SPARQL queries Triplestore
AWS Neptune can (from S3) use curl to POST bulk-load graph UTF-8 data in several formats:
- Turtle to load SPARQL
Neo4j’s own Aura cloud offering runs on GCP.
The Gremlin language is implemented by a wide variety of vendors, including Neo4j. Gremlin is popular largely because it is supported by the open-source Apache Tinkerpop/TinkerGraph (docs TigerGraph analytics cloud). Tinkerpop is one of two open-source graph databases that include @JanusGraph (http://janusgraph.org), open-sourced in 2017 under The Linux Foundation, with participants from Google, Hortonworks, IBM, Amazon, GRAKN.AI, Expero Labs, etc. Its distributed graph database has multiple scalable storage backends:
- Apache Cassandra®
- Apache HBase®
- Google Cloud Bigtable
- Oracle BerkeleyDB
“It’s harder to get started with Gremlin than Neo4j’s Cypher. Gremlin has a SQL-like syntax (SELECT, WHERE, etc.). But Gremlin helps you understand graphs better than Cypher. And it’s available on free open-source software and most portable and available among vendors.” – John Ptacek [24:25] into “THAT Conference ‘19: Introduction to Graph Databases”
Neo4j Cypher language
See my Neo4j tutorial
More about Python
This is one of a series about Python:
- Python install on MacOS using Pyenv
- Python install on MacOS
- Test Python using Pytest BDD Selenium framework
- Python certifications
- Python tutorials
- Python coding notes
- Pulumi controls cloud using Python, etc.
- Microsoft Azure Machine Learning makes use of Python
- Python REST API programming using the Flask library
- Python coding for AWS Lambda Serverless programming
- Streamlit visualization framework powered by Python
- Web scraping using Scrapy, powered by Python
- Neo4j graph databases accessed from Python