Loading Cosmos DB Graph Data Using Jupyter Notebooks

David
4 min readApr 20, 2021

--

In this article I’m going to share a simple Jupyter notebook that can be used to easily load data into a Cosmos DB Graph database.

I’ve been doing a lot of work with Cosmos DB’s Graph API recently. It’s a graph engine that makes use of Cosmos DB’s scalable backend to provide a fully-managed graph database that can be queried with the open source Gremlin API. Gremlin is the graph query language of the Apache Tinkerpop graph database project. See Apache TinkerPop

At the time of writing, Azure Data Factory does not support loading data into Cosmos GB Graph API so the only way to do it is to write some client code. [theoretically you could also manually enter graph data using the Gremlin console in the portal but that’s not really practical for much more than individual queries]

There is already a C# bulk importer sample application that you can use, see here: Use the graph bulk executor .NET library with Azure Cosmos DB Gremlin API and that’s great but sometimes you want to use something a little quicker, simpler and a bit more interactive; like for instance a Jupyter notebook.

Notebooks in Cosmos DB

One of the nice features about Cosmos DB is integrated Jupyter notebook support, which allows you to create and run Jupyter notebooks right in the Data Explorer part of the Azure Portal where you manage your Cosmos database.

Using the Data Explorer interface (see above) you can create a new notebook, import existing notebooks, and run them within the portal without having to worry about provisioning a server. There are a number of notebook runtimes to choose from but I’ll be using the traditional Python runtime.

From within the notebook you can install the libraries you need to connect to various Azure services, including Cosmos of course. This makes it pretty straightforward to load data into the graph from files in blob store or wherever. In my example I just connect to my graph database and populate it with one of the standard Tinkerpop “toy” databases using the gremlin API, the “modern” graph:

If you want to get a copy of the notebook it’s in a github repo here: dgpoulet/CosmosGraphLoader: Notebooks for loading graph data into Cosmos DB (github.com)

I’ll break it down section by section. In the first section we load the libraries needed to do Gremlin-ey things as they aren’t available by default. Note: the nest_asyncio library is required to deal with “Event loop is already running” type errors that can surface.

import sys, traceback
!{sys.executable} -m pip install gremlinpython==3.4.10
!{sys.executable} -m pip install futures
!{sys.executable} -m pip install networkx
import nest_asyncio
nest_asyncio.apply()
print(sys.version)

Next up the section of code that actually connects to your Cosmos DB graph.

from gremlin_python.driver import client, serializer

client = client.Client(
'<GREMLIN ENDPOINT URL>','g',
username="/dbs/<DATABASE NAME>/colls/<GRAPH NAME>",
password="<COSMOS ACCESS KEY>",
message_serializer=serializer.GraphSONSerializersV2d0()
)

The Gremlin endpoint URL and the access keys for your Graph account are visible in the Settings->Keys section of the Azure Portal management view for your Cosmos DB account. Note also that the database name and graph name are used as the “username” string.

The next section provides some helper functions / data types that we’re going to use to issue queries to the database and handle any errors.

from gremlin_python.driver.protocol import GremlinServerError

cosmosdb_messages = {
409: 'Conflict exception. You\'re probably inserting the same ID again.',
429: 'Not enough RUs for this query. Try again.'
}

def executeGremlinQuery(gremlinQuery, message=None, params=None):
try:
callback = client.submitAsync(gremlinQuery)
if callback.result() is not None:
return callback.result().one()
except GremlinServerError as ex:
status=ex.status_attributes['x-ms-status-code']
print('There was an exception: {0}'.format(status))
print(cosmosdb_messages[status])

And finally, the last section is where I’m writing the data to the graph using the Gremlin API, and then counting the edges and vertices just to prove to myself that it has in fact written out some data!

executeGremlinQuery("g.addV('person').property(id,'v1').property('name','marko').property('age','29').property('dept','accounts').addV('person').property(id,'v2').property('name','vadas').property('age','27').property('dept','accounts').addV('software').property(id,'v3').property('name','lop').property('lang','java').property('dept','accounts').addV('person').property(id,'v4').property('name','josh').property('age','32').property('dept','accounts').addV('software').property(id,'v5').property('name','ripple').property('lang','java').property('dept','sales').addV('person').property(id,'v6').property('name','peter').property('age','35').property('dept','sales')")

executeGremlinQuery("g.V('v1').addE('created').to(g.V('v3')).property(id,'e9').property('weight','0.4')")
executeGremlinQuery("g.V('v4').addE('created').to(g.V('v5')).property(id,'e10').property('weight','1.0')")
executeGremlinQuery("g.V('v4').addE('created').to(g.V('v3')).property(id,'e11').property('weight','0.4')")
executeGremlinQuery("g.V('v6').addE('created').to(g.V('v3')).property(id,'e12').property('weight','0.2')")

executeGremlinQuery("g.V('v1').addE('knows').to(g.V('v2')).property(id,'e7').property('weight','0.4')")
executeGremlinQuery("g.V('v1').addE('knows').to(g.V('v4')).property(id,'e8').property('weight','1.0')")


result = executeGremlinQuery("g.V().count()")
print("Count of vertices: {0}".format(result))

result = executeGremlinQuery("g.E().count()")
print("Count of edges: {0}".format(result))

So there we have it. It’s a very simple example of using the notebook functionality to interact with the Cosmos DB graph. Handy for loading some smaller amounts of data into a database, or doing exploratory analysis and data wrangling.

It’s not going to perform as well as the C# based bulk loading client (see above) in large scale load scenarios (for thousands of vertices/edges upwards). But for smaller workloads it’s a handy option to have.

dgpoulet/CosmosGraphLoader: Notebooks for loading graph data into Cosmos DB (github.com)

--

--

David
David

Written by David

Roving NoSQL specialist @Databricks. I write about technical stuff and cultural things and anything else that comes to mind.

No responses yet