Backing up Cosmos DB Graph with Azure Data Factory

David
7 min readJan 18, 2022

--

It’s not a well known fact that you can interact with a Cosmos DB Graph API account using the Core (SQL) API. In this article, I’ll take you through how to use that fact to copy data to/from a Graph account in Cosmos DB using ADF (Azure Data Factory), opening up the possibility to do manual backups, restores, copy from one database to another etc.

Cosmos DB Graph

A Quick Aside About Graph JSON

If you query a Cosmos DB Graph API account as if it were a Core API account, you operate on the JSON object representation of the graph’s Vertices and Edges — standard Cosmos DB queries which result in JSON objects returned. For instance, here is a graph vertex in its Cosmos DB JSON representation (the _something fields are Cosmos system internal fields common to all documents in a database):

{
"label": "person",
"id": "v1",
"name": [
{
"id": "5143acb6-e4e0-4a11-889f-e3d3b6be3b67",
"_value": "marko"
}
],
"age": [
{
"id": "0286d27b-4fcd-4392-a775-2dcb2734dd89",
"_value": "29"
}
],
"dept": "accounts",
"_rid": "aOhfAO7-FW8MAAAAAAAAAA==",
"_self": "dbs/aOhfAA==/colls/aOhfAO7-FW8=/docs/aOhfAO7-FW8MAAAAAAAAAA==/",
"_etag": "\"2b00100c-0000-1100-0000-61ad134f0000\"",
"_attachments": "attachments/",
"_ts": 1638732623
}

The edges are also separate JSON objects — the edges indicate the source and target (or “sink”) vertices so the Gremlin query engine can understand how vertices and edges are related. The following edge object shows that it’s a relationship called “knows” between the vertices V1 and V4:

{
"label": "knows",
"id": "e8",
"weight": "1.0",
"_sink": "v4",
"_sinkLabel": "person",
"_sinkPartition": "accounts",
"_vertexId": "v1",
"_vertexLabel": "person",
"_isEdge": true,
"dept": "accounts",
"_rid": "aOhfAO7-FW8BAAAAAAAAAA==",
"_self": "dbs/aOhfAA==/colls/aOhfAO7-FW8=/docs/aOhfAO7-FW8BAAAAAAAAAA==/",
"_etag": "\"2b00050c-0000-1100-0000-61ad134f0000\"",
"_attachments": "attachments/",
"_ts": 1638732623
}

From this hopefully it’s simple to see that to backup and restore a graph, all we need to do is retrieve the JSON representation of all the vertices and edges in one Cosmos DB graph, and when required populate another Cosmos DB graph account with the same JSON, using the Core API.

How to Access Your Graph Account via the Core API

If we take a standard Cosmos DB account, whether it’s Graph API, Mongo API, Core API etc, it will always have a connection URL that you use to connect to the database whether in your application code or using a tool like ADF. In addition to the endpoint URL you will also need the Access Key for authorisation purposes.

If we look at the Azure Portal view for a Graph Cosmos DB account, navigate to the Keys page in the left-hand toolbar there you can see two URLs: one for Gremlin applications, and one called .NET SDK URI — this is the Core API endpoint for this Graph database. We’ll use this to set up our ADF connection for the backup. Take a copy of this for the ADF setup.

You will also see the PRIMARY KEY field which shows the access key for this account, make a note of this also for the ADF setup.

The Core API URL and Access Key can be Seen on the KEYS Page in the Portal

Using ADF to Backup Data in the Graph Database

Now we understand how to find the Core API connection string, it’s pretty straightforward to imagine how we can use that to persuade ADF to copy the raw data out of the database in JSON format. You just treat the graph database as though it were a regular Core API Cosmos DB database — whether reading or writing data.

There are a couple of gotchas that you need to be aware of though, so we’ll step through the process here. In this example I’m going to copy all the graph data out of a Cosmos DB database into Azure Blob Storage — from there it can be archived, copied to another database or used to restore the same database at a later date.

(PS I’m going to assume you are familiar with the basics of ADF; if not I suggest you follow a simple Copy Data tutorial here: Copy data by using the Azure Copy Data tool — Azure Data Factory | Microsoft Docs)

To backup data in the Graph database it’s a simple case of creating an ADF pipeline and using the Copy Data step. When you first create a Copy Pipeline you will be asked to provide a Source and Destination dataset for the Copy process to work on. We will configure the SOURCE to be the Cosmos DB database, and the SINK to be some folder in blob storage. The copy data step will read all the objects in the given Cosmos container and dump the JSON representation of them into storage.

When Creating a Copy Pipeline, we Need to Define the Source Dataset

When you create the pipeline, navigate to the Source tab of the Copy Data step (see above) and click New. Here we create the Dataset & Linked Service.

Dataset Creation: 1 — Select Azure Cosmos DB SQL API

In the New Dataset tab select Azure Cosmos DB (SQL API) and click Continue. Note than in ADF each Dataset (Datasets — Azure Data Factory & Azure Synapse | Microsoft Docs) that we interact with has a Linked Service (Linked services — Azure Data Factory & Azure Synapse | Microsoft Doc) underneath it that defines how to get to that data

.In the next tab, we are asked to create a new Linked Service that is the source for the dataset.

Dataset Creation: 2 — Create New Linked Service

In the Set properties tab, select “New” under the Linked Service dropdown. In the tab that opens we are going to configure the connection to Cosmos DB using the values noted from the Keys page in our Cosmos DB management pane in the portal (see above).

Dataset Creation: 3 — Configuring the Linked Service to the Cosmos DB Graph Database

Configuring the Linked Service is the key part of this whole process which needs to be done right for the ADF connection to work with Graph API databases (see above).

Note that you must select Enter manually in the configuration window under Account selection method. If you try to “find” Cosmos DB from the Azure Subscription it won’t find the account.

When you select Enter manually as seen in the screenshot, you can then enter the URL and Access Key for the account that we saved earlier on. You put the database name you want in the Database name field (note, this is the graph database name, not the account name — a linked service only connects to one database in a given Cosmos DB account).

Dataset Creation: 4 — Select the Collection that contains the Datasets Data

Having set up the Linked Service, you then get taken back to finish the Dataset configuration, which just consists of selecting which Collection holds the data you want with the Cosmos DB database (a dataset in ADF is a single collection in Cosmos DB).

Now your Source dataset is configured you can click Preview Data in the main window, and ADF will retrieve the JSON data from your graph database:

ADF Pulling Graph Data in JSON Format

At this point it should be plain sailing. You have a source of JSON data representing the contents of your Cosmos DB Graph database. You can do all the standard things you can do with JSON data in ADF. I’m going to dump it out in compressed format to an Azure Blob Storage folder, but you could write it to another Cosmos DB, write it out to SQL Server, or whatever you need to do.

Storing the Data in Azure Blob Storage

To store this data in Blob Store it’s really straightforward. In the Sink tab, create a new Azure Blob Storage dataset. Make sure the file type is JSON:

Store the Graph Data in JSON Format

This ensures the data is stored exactly as it comes out of the Cosmos database, which makes life easy for later restore to Cosmos DB.

Create the Azure Blob Storage connection as you normally would, then define the path to whatever folder you are going to use to store the JSON that comes out of the database:

Setting up the Path to the Backup JSON File

As you can see, the backed up data goes into a single large JSON file. You can choose to compress this file in the Sink configuration tab of the pipeline.

I’ve called it backupfile.json but you would probably use the expression language to create a dynamic name for the backup file (Expression and functions — Azure Data Factory & Azure Synapse | Microsoft Docs) e.g.

@concat(‘graphbackup-’,formatDateTime(utcnow(),’dd-MM-yyyy’),’.json’)

Backup and Restore

If you run the backup pipeline as configured above it will dump all the graph data in your Cosmos DB graph API collection into a JSON file. This is the backup.

Restoring is as simple as creating another pipeline, exactly as above but with the Source and Sink configurations reversed: the Source is the Blob storage dataset already defined, and the sink is the Cosmos DB dataset already defined. Like this ADF reads the JSON file and understands how to write that directly into Cosmos DB.

--

--

David
David

Written by David

Roving NoSQL specialist @Databricks. I write about technical stuff and cultural things and anything else that comes to mind.

No responses yet