Data Archival (using Containers)

In large enterprise Insights implementation (>35 million Nodes), Neo4j query performance is greatly reduced ,and leads to bottlenecks in dashboard loading. This is because of limitations in Neo4j (Community Edition), where only a single server is used to store all the data. It is not Scalable. Key features like Scaling, HA,Replia ,etc other features are available in paid version only.

In order to overcome this limitation in Neo4j, Insights should have the capability to scale horizontally and store data. We intend to use data split approaches to solve the problem. This enable us to ingest much more data at scale and resolve performance bottleneck.

The below image shows the flow of data in the solution.

Architecture of the solution

Data Archival module consists of two agents:

  • Neo4jArchival agent: This agent is responsible for creating backup of data from Neo4j data source to Elasticsearch.

  • ElasticTransfer agent: This agent is responsible for creating containers from the data that has been backed up in Elasticsearch.

Neo4jArchival agent is responsible for storing Neo4j data in Elasticsearch. A pre-requisite for Neo4jArchival agent is that the nodes should have InsightsTime. Incases nodes do not have InsightsTime, convert any date field in the node to InsightsTime. Version of Elasticsearch needs to be 6.0 or above.

The following are the features of Neo4jArchival agent: 

  • All the DATA labeled nodes containing the fields toolName and inSightsTime will be collected as index. 

  • All the relationship details whose start node and end node contains the inSightsTime will be collected. 

  • Neo4j Label = es indexname (E.g.: git,jira,etc) 

  • Neo4j Relationship name = es indexname (E.g.: from_jira_to_git) 

  • Elasticsearch Id for nodes = nodeID+insightstime 

  • Elasticsearch Id for relationships = relationshipId + source node insightstime 

  • Node schema is stored as separate index in es (E.g: git_neo4j_schema)  

  • Relationship schema is stored as separate index. (E.g: from_jira_to_git_neo4j_schema) 

  • Agent can be stopped in the middle, while restarting data transfer resumes. 

  • Data can be deleted after transfer of data. 

  • archival_enddate – nodes/relationships older than ‘archival_enddate’(specified date) are chosen to migrate. 

  • Timeperiod – nodes/relationships older than ‘Timeperiod’(in days) from current date are chosen to migrate. 

  • neo4j_label – * to choose all labels, or you can specify only required labels.  

  • Querylimit – number of nodes/relationships chosen to migrate at time. 

  • neo4j_data_delete – true to delete neo4j nodes after migration. 

  • neo4j_host_uri – source neo4j bolt url. 

  • neo4j_user_id – source neo4j user id. 

  • neo4j_password – source neo4j password. 

  • elasticsearch_hostname_uri – elasticsearch url. 

  • elasticsearch_username – elasticsearch username. 

  • elasticsearch_passwd – elasticsearch password. 

 

This agent is responsible for creating containers from the data that has been backed up in Elasticsearch.A prerequisite of the agent is that the image which will be used to create the container needs to be present in the Docker server. 

The features of ElasticTransfer agent are the following: 

  • The agent will receive the time range for which the container needs to be made in its subscribe. DataArchivalQueue queue. 

  • It will extract data for the time range from Elasticsearch and create csv from it which will be stored in %INSIGHTS_HOME%/es_importcsv folder. 

  • Container will be created using the csv files and the labels and relationships will be present in the Neo4j database. 

  • dockerHost – docker machines IP. 

  • dockerPort – docker API port running on ‘dockerHost’ (usually 2375). 

  • dockerImageName – docker image name used neo4j container creation. 

  • docker_csv_path – shared path in docker machine 

  • dockerImageTag – docker image version/tag. 

  • docker_repo_username – docker repo username used to pull the image in ‘dockerHost’. 

  • docker_repo_passwd - docker repo password used to pull the image in ‘dockerHost’. 

  • neo4j_user_id – neo4j userid used to run queries in container neo4j and import CSV file. 

  • neo4j_password – neo4j password used to run queries in container neo4j and import CSV file. 

  • bindPort – ports used inside each neo4j container. 

  • hostPort – list of ports exported in ‘dockerHost’ machine for each(currently at max 5 neo4j containers) neo4j containers. 

  • hostVolume – list of neo4j volume name (only prefix) used to create multiple volumes in ‘dockerHost’ machine for each neo4j container. Currently at max 5(containers) * 4(volumes) = 20 volumes. 

  • mountVolume – list of mounted location inside each neo4j container for ‘hostVolume’. 

  • archivalName – name provided in data archival UI for creation. 

  • elasticsearch_hostname_uri – source elasticsearch url 

  • elasticsearch_username – username of source elasticsearch 

  • elasticsearch_passwd – password of source elasticsearch 

  • es_indexes – "*":"*" get all the indices 

  • fetch_all_data – ‘1’ for create neo4j container using all/entire data available in elasticsearch.                              ‘0’ for create neo4j container only to specified date in UI. 

  • hostAddress – docker machines host address used to bind container port. 

  • no_of_processes – number of python process (multiprocessing) used while downloading nodes/relationships as CSV from elasticsearch in parallel. (Usually 2 per actual core.) 

To create a Data Archival record, the following steps needs to be followed: 

  1. Go to the Data Archival page and click on the add button on the top right corner. An ElasticTransfer agent needs to be configured in the system as it is a pre-requisite for Data Archival record creation. 

  2.  

    Enter the following details in the Add screen. 

    • Archival Name: Name of the archival record. 

    • Records from: Starting date from which the records need to be fetched in the container. 

    • Records to: End date till which the records need to be fetched in the container. 

    • Days to Retain: Number of days till which the container needs to be active. 

    Click on the save button on the top right corner to save the record. 

  3. A Data Archival record is created with the state ‘INPROGRESS’. Container creation is in progress now. 

     

  4. After the container is created, the Data Source URL column will contain the URL of the container and the boltport column will contain the boltport for the Neo4j container.

     

    The URL is clickable and it will open the Neo4j browser for the container. You will need to provide the boltport while login in to the Neo4j browser. 

  5. After the Data Archival record reaches expiration date, the container will be terminated and the record in Data Archival page will show ‘TERMINATED’ status. 

 

 

 

 

©2021 Cognizant, all rights reserved. US Patent 10,410,152