Data Archival (using Containers)

In large enterprise Insights implementation (>35 million Nodes), Neo4j query performance is greatly reduced ,and leads to bottlenecks in dashboard loading. This is because of limitations in Neo4j (Community Edition), where only a single server is used to store all the data. It is not Scalable. Key features like Scaling, HA,Replia ,etc other features are available in paid version only.

In order to overcome this limitation in Neo4j, Insights should have the capability to scale horizontally and store data. We intend to use data split approaches to solve the problem. This enable us to ingest much more data at scale and resolve performance bottleneck.

The below image shows the flow of data in the solution.

Data Archival module consists of two agents:

Neo4jArchival agent: This agent is responsible for creating backup of data from Neo4j data source to Elasticsearch.
ElasticTransfer agent: This agent is responsible for creating containers from the data that has been backed up in Elasticsearch.

Neo4jArchival agent is responsible for storing Neo4j data in Elasticsearch. A pre-requisite for Neo4jArchival agent is that the nodes should have InsightsTime. Incases nodes do not have InsightsTime, convert any date field in the node to InsightsTime. Version of Elasticsearch needs to be 6.0 or above.

The following are the features of Neo4jArchival agent:

All the DATA labeled nodes containing the fields toolName and inSightsTime will be collected as index.
All the relationship details whose start node and end node contains the inSightsTime will be collected.
Neo4j Label = es indexname (E.g.: git,jira,etc)
Neo4j Relationship name = es indexname (E.g.: from_jira_to_git)
Elasticsearch Id for nodes = nodeID+insightstime
Elasticsearch Id for relationships = relationshipId + source node insightstime
Node schema is stored as separate index in es (E.g: git_neo4j_schema)
Relationship schema is stored as separate index. (E.g: from_jira_to_git_neo4j_schema)
Agent can be stopped in the middle, while restarting data transfer resumes.
Data can be deleted after transfer of data.

archival_enddate – nodes/relationships older than ‘archival_enddate’(specified date) are chosen to migrate.
Timeperiod – nodes/relationships older than ‘Timeperiod’(in days) from current date are chosen to migrate.
neo4j_label – * to choose all labels, or you can specify only required labels.
Querylimit – number of nodes/relationships chosen to migrate at time.
neo4j_data_delete – true to delete neo4j nodes after migration.
neo4j_host_uri – source neo4j bolt url.
neo4j_user_id – source neo4j user id.
neo4j_password – source neo4j password.
elasticsearch_hostname_uri – elasticsearch url.
elasticsearch_username – elasticsearch username.
elasticsearch_passwd – elasticsearch password.

This agent is responsible for creating containers from the data that has been backed up in Elasticsearch.A prerequisite of the agent is that the image which will be used to create the container needs to be present in the Docker server.

The features of ElasticTransfer agent are the following:

The agent will receive the time range for which the container needs to be made in its subscribe. DataArchivalQueue queue.
It will extract data for the time range from Elasticsearch and create csv from it which will be stored in %INSIGHTS_HOME%/es_importcsv folder.
Container will be created using the csv files and the labels and relationships will be present in the Neo4j database.

dockerHost – docker machines IP.
dockerPort – docker API port running on ‘dockerHost’ (usually 2375).
dockerImageName – docker image name used neo4j container creation.
docker_csv_path – shared path in docker machine
dockerImageTag – docker image version/tag.
docker_repo_username – docker repo username used to pull the image in ‘dockerHost’.
docker_repo_passwd - docker repo password used to pull the image in ‘dockerHost’.
neo4j_user_id – neo4j userid used to run queries in container neo4j and import CSV file.
neo4j_password – neo4j password used to run queries in container neo4j and import CSV file.
bindPort – ports used inside each neo4j container.
hostPort – list of ports exported in ‘dockerHost’ machine for each(currently at max 5 neo4j containers) neo4j containers.
hostVolume – list of neo4j volume name (only prefix) used to create multiple volumes in ‘dockerHost’ machine for each neo4j container. Currently at max 5(containers) * 4(volumes) = 20 volumes.
mountVolume – list of mounted location inside each neo4j container for ‘hostVolume’.
archivalName – name provided in data archival UI for creation.
elasticsearch_hostname_uri – source elasticsearch url
elasticsearch_username – username of source elasticsearch
elasticsearch_passwd – password of source elasticsearch
es_indexes – "*":"*" get all the indices
fetch_all_data – ‘1’ for create neo4j container using all/entire data available in elasticsearch. ‘0’ for create neo4j container only to specified date in UI.
hostAddress – docker machines host address used to bind container port.
no_of_processes – number of python process (multiprocessing) used while downloading nodes/relationships as CSV from elasticsearch in parallel. (Usually 2 per actual core.)

To create a Data Archival record, the following steps needs to be followed:

Go to the Data Archival page and click on the add button on the top right corner. An ElasticTransfer agent needs to be configured in the system as it is a pre-requisite for Data Archival record creation.
Enter the following details in the Add screen.
- Archival Name: Name of the archival record.
- Records from: Starting date from which the records need to be fetched in the container.
- Records to: End date till which the records need to be fetched in the container.
- Days to Retain: Number of days till which the container needs to be active.
Click on the save button on the top right corner to save the record.
A Data Archival record is created with the state ‘INPROGRESS’. Container creation is in progress now.
After the container is created, the Data Source URL column will contain the URL of the container and the boltport column will contain the boltport for the Neo4j container.

The URL is clickable and it will open the Neo4j browser for the container. You will need to provide the boltport while login in to the Neo4j browser.
After the Data Archival record reaches expiration date, the container will be terminated and the record in Data Archival page will show ‘TERMINATED’ status.

The message that will be sent to the ElasticTransfer agent for container creation will be of the following format:
{ "archivalName": "Archive_1", "startDate": "2018-03-01T00:00:00Z", "endDate": "2018-10-31T00:00:00Z", "daysToRetain": 10 }
ElasticTransfer agent will send the following message to RabbitMQ’s SYSTEM_ELASTICTRANSFER_DATA queue so that it can be consumed by the engine and the URL of the container updated in the record.
{ "data":[ { "status":"Success", "execId":"fd0da051-8bcc-11eb-b09b-005056955e85", "sourceUrl":"http://<ip>:<port>", "containerID":"c68f74bef5aaa825663a327c78dacd5934fc4ec1385f4c7c91d8e3676f62355e", "archivalName":"Archive_1", "toolName":"ELASTICTRANSFER", "message":"Node Count = 7390", "categoryName":"SYSTEM" , "boltPort":7687 } ], "metadata":{ "dataUpdateSupported":false } }
The following message will be sent to ElasticTransfer agent to terminate the container.
{ "task":"remove_container", "containerID":"c68f74bef5aaa825663a327c78dacd5934fc4ec1385f4c7c91d8e3676f62355e" }
The following json is a sample daemon.json that should be present in /etc/docker/ path in Docker machine.
{ "debug": true, "insecure-registries" : ["<hostname/ip>:<port>"], "hosts": ["unix:///var/run/docker.sock","tcp://<ip>:<port>"] }

Appropriate version of python pip packages should be installed to run agents.

Ex. pip install neo4j==1.7.6

python pip packages with version.

Package	Version
2to3	1
APScheduler	3.6.3
backports.ssl-match-hostname	3.7.0.1
backports.time-perf-counter	0.0.4
boto3	1.14.63
botocore	1.17.63
certifi	2020.6.20
cffi	1.14.2
chardet	3.0.4
cryptography	3.1
docker	4.3.1
docutils	0.15.2
elasticsearch	7.9.1
enum34	1.1.10
funcsigs	1.0.2
futures	3.3.0
idna	2.1
ipaddress	1.0.23
jmespath	0.10.0
monotonic	1.5
neo4j	1.7.6
neobolt	1.7.17
neotime	1.7.4
ntlm-auth	1.5.0
pika	1.1.0
pip	20.2.3
pycparser	2.2
python-dateutil	2.8.1
pytz	2020.1
requests	2.24.0
requests-ntlm	1.1.0
s3transfer	0.3.3
schedule	0.6.0
setuptools	44.1.1
six	1.15.0
tzlocal	2.1
urllib3	1.25.10
websocket-client	0.57.0
wheel	0.35.1
xmltodict	0.12.0

Cognizant® Cloud Acceleration Platform Insights

Data Archival (using Containers)

Related content