Wilson Mar bio photo

Wilson Mar

Hello!

Email me Calendar Skype call

LinkedIn Twitter Gitter Instagram Youtube

Github Stackoverflow Pinterest

It’s more relational than relational databases

US (English)   Español (Spanish)   Français (French)   Deutsch (German)   Italiano   Português   Cyrillic Russian   中文 (简体) Chinese (Simplified)   日本語 Japanese   한국어 Korean

Overview

The contribution of this article is a maticulously sequenced presentation that curates a concise yet deep tidbits from the many resources about this topic.

Graph databases: the newest thing

Graph type databases is the latest in the evolution of data storage mechanisms to handle complexity.

neo4j-evolution-828x394-72052.jpg

PROTIP: People using graph databases call themselves “Graphistas”.

Graph databases provides an alternative to use of SQL tables, rows, columns, or NOSQL documents to store data. SQL makes it difficult to answer questions that were not already expected ahead of time. But instead of static predefined schemas which require shutdown to change, graph databases can be configured dynamically while running.

Graphs can provide insights not easily found using other technologies.

Graphs are important to visualizing AI/Machine Learning algorithms: neo4j-ai-graphs-823x589

Simpler complex connections, naturally

SQL databases from Oracle, MySQL, etc. need to join physical tables together using foreign keys and link tables.*
neo4j-link-table-488x264.jpg

Writing SQL to represent a social graph containing 1,000 persons averaging 50 friends each can be difficult due to the need for joins and “de-normalized” physical structures.

## Graph Faster

Moreover, it is also time consuming for traditional relational databases to process complex indexed queries, even if it’s all in cache. However, Neo4j can process complex data structures efficiently because it uses pointers instead of table lookups (for “index free adjacency”). A comparison VIDEO:

 # personsquery time
Relational database1,000200 ms
Neo4j1,0002 ms
"Supernodes" in Neo4j1,000,0002 ms

## More relational than relational databases

Whereas SQL data is stored in separate tables joined together using complex queries, Neo4j is “white-board friendly”. Neo4J data is stored the same way as illustrated by its data model. Graph database diagrams look like ER (Entity-Relation) diagrams for SQL databases. The example below uses data from Movielens database containing 62,000 movies with 25 million ratings and one million tag applications applied by 162,000 users:

neo4j-movie-graph-1676x702-144758

Graph databases manage nodes (data entities) in relationship to other nodes.

Instead of elaborate joins, labeled relationships between red nodes defining movie titles and green nodes defining actor names. Red and green differentiate entity types. Titles and actor names labels to nodes. “ACTED_IN” and “DIRECTED” are attributes of relationships.

Some (academics) call nodes “vertices” or objects and relationships “edges”.

Traversing graphs indirectly

Relationships and nodes can be associated with name/value pair properties used to narrow searches.

neo4j-co-example-1154x345

Third-party add-ons can add a GUID to each entity.

The advantage of Neo4j appears when working with complex indirect relationships.

Which graph database and language?

This ranking by db-engines.com lists Neo4j as the most popular graph database, with Microsoft Cosmos catching up quickly. Notice that Cosmos and others are called “Multi-model” (providing a Document store, Key-value store, wide-column store as well as graph database).

Gremlin traversal machine is to graph computing as what the Java virtual machine (JVM) is to general purpose computing. Gremlin was developed (beginning in 2009) by Apache TinkerPop of the Apache Software Foundation. Thus, it is Apache-2 licensed.

Cloud SaaS Graph database services

Instead of a local instance, if you’re working as a team of developers, consider always-on availability, on-demand scalability, and support:

  • GraphStory.com provides single-node Cloud Graph Neo4j databases from Azure, AWS, and GCP with their dashboard for $299/month (and up). $899/month and up with monitoring with HA multi-zone failover protection.

  • Microsoft’s Cosmos graph database running within a Azure HDInsight Spark cluster 2.0. VIDEO, slides

  • Apache TinkerPop Gremlin queries on DataStax Enterprise Graph

  • Neptune (named the ice planet in our solar system) is Amazon’s graph database managed cloud service (documentation). Pluralsight video course by Jeff Hoopper covers use of (non-prod) CloudFormation to establish a cluster of $250/month db.r4.large (or larger) EC2 instances in several availability zones within a region. IAM is used, but is accessible only via a VPC from a Lambda service. The read-only Reader can access Up to 16 read replicas behind separate IP addresses. The modules used in the course are release 2018.3:

    • apache-tinkerpop-gremlin-console-3.3.2 for Gremlin queries
    • eclipse-rd4jf-2.3.2 for SPARQL queries Triplestore

    AWS Neptune can (from S3) use curl to POST bulk-load graph UTF-8 data in several formats:

    • CSV
    • N-Tuples
    • N-Quads
    • RDF/XML
    • Turtle to load SPARQL

  • GraphStory can stand up Enterprise Neo4j Causal Clusters on Google Cloud Platform. Also on the GCP Marketplace. INTRO VIDEO

  • Neo4j’s own Aura cloud offering runs on GCP.

Gremlin language

The Gremlin language is implemented by a wide variety of vendors, including Neo4j. Gremlin is popular largely because it is supported by the open-source Apache Tinkerpop/TinkerGraph (docs TigerGraph analytics cloud). Tinkerpop is one of two open-source graph databases that include @JanusGraph (http://janusgraph.org), open-sourced in 2017 under The Linux Foundation, with participants from Google, Hortonworks, IBM, Amazon, GRAKN.AI, Expero Labs, etc. Its distributed graph database has multiple scalable storage backends:

  • Apache Cassandra®
  • Apache HBase®
  • Google Cloud Bigtable
  • Oracle BerkeleyDB

“It’s harder to get started with Gremlin than Neo4j’s Cypher. Gremlin has a SQL-like syntax (SELECT, WHERE, etc.). But Gremlin helps you understand graphs better than Cypher. And it’s available on free open-source software and most portable and available among vendors.” – John Ptacek [24:25] into “THAT Conference ‘19: Introduction to Graph Databases”

Neo4j Cypher language

See my Neo4j tutorial

More about Python

This is one of a series about Python:

  1. Python install on MacOS using Pyenv
  2. Python install on MacOS
  3. Python install on Raspberry Pi for IoT

  4. Test Python using Pytest BDD Selenium framework
  5. Test Python using Robot testing framework

  6. Python certifications
  7. Python tutorials
  8. Python coding notes
  9. Jupyter Notebooks provide commentary to Python

  10. Pulumi controls cloud using Python, etc.
  11. Microsoft Azure Machine Learning makes use of Python
  12. Testing AI uses Python code

  13. Python REST API programming using the Flask library
  14. Python coding for AWS Lambda Serverless programming
  15. Streamlit visualization framework powered by Python
  16. Web scraping using Scrapy, powered by Python
  17. Neo4j graph databases accessed from Python