Today we are going to talk about a very specific topic, but to begin What is a Graph Database?

A graph database is one that stores data in a graph-like structure, where nodes represent entities and edges represent relationships between them. It offers greater flexibility since it does not rely on tables with rigid schemas.

Having said that, there are two main approaches being currently used to represent graphs in Graph-DB solutions: RDF (Resource Description Framework) graphs and LPG (Labeled Property Graph) graphs.

Resource Description Framework graphs

https://www.w3.org/RDF/

RDF is more than a language; it is a model for encoding semantic relationships between items of data. It is also a framework since it is used to describe a family of specifications from the W3C designed as a data model used for meta models.

RDF was born in 2004. It extends the linking structure of the Web to use URIs to name the relationship between things as well as the two ends of the link (this is usually referred to as a “triple”).

A triple is the atomic data entity in RDF. It is a set of three entities that codifies a statement about semantic data.

For example: (John)-(born-in)->(1945), in which John is the subject, born-in (the relationship) is the predicate and 1945 is the object.

This linking structure forms a directed, labeled graph, where the edges represent the named  link between two resources, represented by the graph nodes.

In RDF, each node has its own URI: A Uniform Resource Identifier (URI) is a unique sequence of characters that identifies a logical or physical resource.

RDF is the way to go if you want to follow all standards that w3c recommends for integrations. RDF is the language itself and its typically used along with XML, RDFS, OWL, SHACL, SPARQL, and SWRL.

The most used RDF graph database platforms are Stardog, Ontotext Graph DB, Amazon Neptune and RDF4J, among others.

Elements

Now let’s describe the different elements that exist in an RDF model together with its equivalences with LPG.

Here we have an example model to go through each concept.

Object properties

Object properties represent relationships between two different individuals that belong to specific classes.

Convention

The most used convention is to name the object properties with camel case, starting with a lowercase letter. For example isLocatedIn, hasMenus, isSpecializedIn etc.

Another used option is to use snake case with all letters in lower case. For example: is_located_in, has_menus, etc.

Equivalence

Rdf object properties are equivalent to labeled property graph relationships.

Data Properties

Data properties represent relationships between an individual and a specific primitive type, for example, an Individual from the Location class, which has a city and an address, etc.

Convention

The most used convention, same as with object properties, is to name the data properties with camel case, starting with a lower-case letter. Example: city, address, startDate, endDate, etc.

Equivalence

Rdf data properties are equivalent to labeled property graph properties.

Classes

Classes are like a classification of a node. You can add multiple Classes to a Node. It could be seen as the “Class” or “Type” of the node. In our example, we have several classes such as Location, Dish, Food, etc.

Convention

The best convention is to use camel case, starting with an upper case letter. For example: Restaurant, Menu, Food, etc

Equivalence

Rdf classes are equivalent to labeled property graphs labels.

Individuals

Represents a specific instance of a certain class; ergo, individuals should represent actual or virtual objects in the domain.

For example: Food is a class but “Caesar Salad” refers to a specific individual or instance of the class Food which has a specific description, ingredients, and price.

Convention

The best convention is to use camel case, starting with an upper case letter, and using some id for the primary class. For example: Caesar-Salad, Customer-456, SalesPerson-12, etc.

Equivalence

Rdf individuals are equivalent to labeled property graphs nodes/vertixes.

Restrictions/Constraints

Represents validations, restrictions, and constraints that all the data inside the graph must follow. There are a lot of different alternatives to be used.

Some of them are:

  • OWL class restrictions: A restriction describes a class of individuals based on the relationships that members of the class participate in. In example we could validate that an individual of the class Food includes a relationship with at least one price, one description, and one set of ingredients. In OWL, there are three main types of restrictions that can be placed on classes. These are quantifier restrictions, cardinality restrictions and hasValue restriction. Some examples could be found at https://www.w3.org/TR/owl-ref/#Restriction
  • SHACL Shapes: Shapes Constraint Language (SHACL) is a World Wide Web Consortium (W3C) standard language for describing RDF graphs. SHACL has been designed to enhance the semantic and technical interoperability layers of ontologies expressed as RDF graphs. Everything you could do with Class restrictions is also available with SHACL but in an easier and more semantic way. Another interesting feature of SHACL is the capability to design inference rules that could be triggered to expose more implicit knowledge from the graph. Reference could be found at https://www.w3.org/TR/shacl/ and https://www.w3.org/TR/shacl-af/ and a detailed introduction to SHACL from TopQuadrant here https://www.youtube.com/watch?v=_i3zTeMyRzU

Languages

Now that we understand the different elements, let’s explore the different languages used to model, ingest, query and validate data in a RDF graph.

Model: To design a data model

Starting with RDF, there are a lot of W3C standard languages to model RDF graphs. Several elements that we defined in the previous section are part of the entire language ecosystem that lives along with RDF. Some of those languages are RDFs, OWL/OWL2, SWRL, and SHACL.

The Stanford University has a very well known “Ontology Development 101” guide that could be found at https://protege.stanford.edu/publications/ontology_development/ontology101.pdf

Ingest: To insert/delete/update data in the graph

There are several ways to ingest data and mostly all of them depend on the database management system that you are using to store your graph. Nevertheless, here is a list of a few approaches that are universal and could be used in mostly all of the platforms that live out there:

  • R2RML – RDB to RDF Mapping Language: a language for expressing customized mappings from relational databases to RDF datasets. It is a good way to ingest data coming from relational databases directly to your graph. More info at https://www.w3.org/TR/r2rml/
  • RDF files: Typically, every single graph database allows you to ingest data in your graph using RDF files that contain the triples you want to import.
  • https://www.w3.org/TR/sparql11-overview/

A good piece of advice: try to avoid using private solutions that do not comply with the standard unless you believe they will give you some benefits like better performance or higher interoperability.

Query: To query data from the graph

As was indicated in the last section, SPARQL is the standard language to be used as the Query Language for RDF. If you are familiar with SQL, you will find that SPARQL is very similar but also most powerful since it is more semantic.

One way to go: SPARQL

Validate: To validate information inside the graph

Typically, IDEs and platforms support Integrity Constraint Validation. ICV validates that the info inside a graph complies with all the restrictions and validations in the model.

As was indicated in the restrictions section, those validations/restrictions/constraints could be written using standard languages such as OWL, RDF, RDFS and, most recommended way, SHACL.

For example, the TopQuadrant IDE validates that the information you are adding to the graph complies with all the restrictions and SHACL validations automatically, showing violations, and warnings and it is also open to be extended using custom extensions.

Labeled Property Graph

In LPG-style databases the graph is comprised of nodes and relationships. Each node or relationship has an ID tag, one or more “labels” describing its type or class, and a set of “properties” that present values and a corresponding key to allow referencing.

Intuitively, two nodes are always joined by a relationship to form the larger structure of the graph.

There is no standardized query language for all LPG-style databases, but Cypher is the most widely adopted one.

LPG is the recommended way to query and visualize data in a more performant way.

The most used LPG graph database platforms are Neo4j and Amazon Neptune, among others.

Elements

Now let’s analyze the different elements that exists in an LPG graph, like we did with RDF.

Labels

Labels are like a classification to a node. You can add multiple Labels to a Node. It could be seen as the “Class” or “Type” of the node.

Convention

Best convention is to use camel case, starting with an upper-case letter. In our example, Post, Forum and Person are Labels examples.

Equivalence

Labels in labeled property graphs are equivalent to Classes in RDF.

Relationships/Edges

Relationships are edges that connect two different nodes.

Convention

Following neo4j tutorials, best convention is to use snake case with all letters in upper-case. Some examples are:  IS_PART_OF, HAS_INTEREST, etc.

Equivalence

Relationships/edges in labeled property graphs are equivalent to Object Properties in RDF.

Properties

Properties represents relationships between an individual and a specific primitive type, for example an Individual from the Person class, which has a name (string type) and age (integer type), etc. Properties exist for both, Nodes and Relationships. In the following example, the relationship IS_FRIENDS_WITH has the “since” (date type) property.

Convention

Following neo4j tutorials, the best convention is to use camel case, starting with a lower case letter. For example: flightNumber, airline, code, etc.

Equivalence

Properties in labeled property graphs are equivalent to Data Type Properties in RDF.

Languages

Since there is no standard way to go with LPG, it is important to mention that the most used languages are Cypher (neo4j) and Gremlin.

In this section, to simplify, we are going to show some examples that refer to neo4j and Cypher.

Having said that, let’s explore the different languages used to model, ingest, query and validate data in neo4j.

Model: To design a data model

There is no lnanguage to design a model in an abstract way since LPG are schema-less. The way to go is using a graphic tool like lucid chart, enterprise architect, etc.

You may think that not having a language to model the data schema is a missing feature, but it is actually a cool thing:  this makes an LPG modeling exercise flexible, visually driven and natural to the way people think.

A good alternative is to model the graph using RDF, with for example Protegé, and visualize the model using another tool like web vowl.

Ingest: To insert/delete/update data in the graph

For RDF I said that are several ways to ingest data and mostly all of them depend on the database management system that you are using to store your graph. Since we are going to talk about specific ways to ingest data in neo4j, we can be more accurate.

Here it is a non-exhaustive list of different possibilities that you could use to start ingesting data in your graph (once you design your model):

  • Plain clauses using cypher: It is always possible to create new nodes and relationships in the graph using the cypher language[1].
    • For example: CREATE (node:Person {name: “John Snow”})

Keep in mind that those queries could be executed using the neo4j browser[2] or the neo4j API. It is the most flexible way, but you must write your queries manually.

  • Import CSV: There are a few different approaches to get CSV data into Neo4j, each with varying criteria and functionality. The option you choose will depend on the data set size, as well as your degree of comfort with various tools.
    • LOAD CSV Cypher command: this command is a great starting point and handles small- to medium-sized data sets (up to 10 million records). Works with any setup, including AuraDB.
    • neo4j-admin bulk import tool: command line tool useful for straightforward loading of large data sets. Works with Neo4j Desktop, Neo4j EE Docker image and local installations.
    • Kettle import tool: maps and executes steps for the data process flow and works well for very large data sets, especially if developers are already familiar with using this tool. Works with any setup, including AuraDB.

A detailed tutorial for each approach could be found at: https://neo4j.com/developer/guide-import-csv/

  • Import API usinc APOC library: Neo4j has a library called APOC, “Awesome Procedures On Cypher”. It was created as an extension library to provide common procedures and functions to developers. This library is especially helpful for complex transformations and data manipulations. One of those helpers is the LOAD JSON.

A complete guide with examples could be found at https://neo4j.com/developer/guide-import-json-rest-api/

  • Programmatically: One of the most recommended ways. Gives you the ability to retrieve data from a relational database (or other tabular structure) and use the bolt protocol to write it to Neo4j through one of the drivers with your programming language of choice.

A tip here -> Spring Data Neo4j for Spring Boot with Java is one of the most complete libraries to be used against a neo4j database server.

Query: To query data from the graph

In neo4j, Cypher language provides several clauses to query nodes from the database. In this graphical example, we are querying for all the people that are loved by “Dan”.

As you can see, we re are using a MATCH clause to filter (Node with Person label and name property with “Dan” value)-(LOVES relationship)-(other node – whom). Remember that triples are always there: (subject)-(predicate)-(object).

Apart from that, there are other ways to avoid using cypher directly.

For example spring data neo4j ( https://spring.io/projects/spring-data-neo4j ) is a library for spring-boot framework ( https://spring.io/projects/spring-boot/ ) that allows you to create Java Classes that matches your model. You can create repositories for them to allow the access to the data without having to write manually most common queries (find a node by id, find all the nodes that shares a specific Label, etc).

Validate: To validate information inside the graph

In neo4j there are two ways to accomplish validation, Cypher constraints and SHACL validation with Neosemantics.

Cypher constraints

The most recommended way to ensure data integrity in neo4j is the utilization of Cypher Constraints.

The different type of constraints that you could use are:

  • Unique node property constraints
  • Node property existence constraints
  • Relationship property existence constraints
  • Node key constraints

To learn more about neo4j constraints, go to https://neo4j.com/docs/cypher-manual/current/constraints/

SHACL validation with Neosemantics

You may be thinking: Wait, but SHACL is a language used for RDF graphs, right?

The answer is YES, but neo4j has a very cool addon called neosemantics.

Neosemantics (n10s) is a plugin that enables the use of RDF and its associated vocabularies like (OWL,RDFS,SKOS and others) in Neo4j.

You can use n10s to build integrations with RDF-generating / RDF-consuming components. You can also use it to validate your graph against constraints expressed in SHACL or to run basic inferencing.

To learn more about SHACL validation with neosemantics, go to: https://neo4j.com/labs/neosemantics/4.3/validation/

Design Concerns

Let’s explore now important concepts that apply to working with graph DBs design in general.

So as to start, I would like to share this very interesting quote:

“Data models exist to facilitate answering questions from the databases — they are not for creating pristine semantic models of domains”.

Taking that into consideration, a typical error here would be trying to create a great semantic model of a domain, instead of focusing on what questions they need the database to answer.

For example, if you can’t figure which query will use the label/class you have in mind, don’t use it. You can always apply more labels/classes later.

That is highly related to the YAGNI principle (You are not going to need it). YAGNI is a principle which arose from extreme programming (XP) that states a programmer should not add functionality until deemed necessary.

Having made that introduction, let’s discuss two general tips related to graph design.

General Tips

Write Your Queries First

Knowing the kinds of questions and queries you want to ask of your data is a great way to determine the structure of your data model.

Understanding the intention of the system or application you are building and then constructing the model around the business need will help you organize it in a more accurate way.

Prioritize Queries

While you may improve certain things, there is no way to get a one-size-fits-all solution. Instead, you should determine which model best suits your needs. You may not be able to max out performance on every individual query, but you may be able to get the most out of your system with certain resources, time, and code.

Further reading

To finish this section, I would like to list a few links and blog entries that I recommend being read:

Conclusion

All platforms want to get compatibility with RDF, Cypher, and Gremlin. For example, amazon started to support Cypher in July 2021 while they already had compatibility with Gremlin and RDF. Another good example is Neo4j Neosemantics, a plugin that enables the use of RDF and its associated vocabularies like (OWL, RDFS, SKOS, and others) in Neo4j.

In addition, in a future where the industry will want to connect all providers and clients in the supply chain (Industry 4.0) RDF seems to be the way to go, because it has more expressivity and it’s designed to connect data.

On the other hand, neo4j and cypher seems a very good starting point to start working with graphs because is easier to use and most widely adopted.

It is more difficult for a web developer to learn the entire world of RDF, its boundaries and features, pros and cos, than learning the use of Cypher and neo4j.  Also Gremlin (which we didn’t analyze in this blog entry) is more linked to big data graph platforms and solutions, to their central ecosystem and to its most used applications that are more complex by definition and by consequence.

There are more developers in the market that know and use neo4j/cypher than any other language around graph ecosystem, a basic search in LinkedIn jobs can give that answer.

Taking that into consideration, my opinion is that if you want to achieve interoperability and have less effort to switch between different graph DB providers, you must go with RDF.

On the other hand, if you want to start quickly, be simpler and focalize on performance, you can always start with LPG graphs with neo4j and then evaluate the possibility to switch to RDF later.

Whatever you choose, keep continuous research for the advances on all the graph ecosystem to keep track of the different approaches and requirements that are more linked to specific solutions and platforms.


[1] https://neo4j.com/developer/cypher/

[2] https://neo4j.com/developer/neo4j-browser/