RabbitMQ Troubleshooting - Crash due to corrupt queues

Incase of RabbitMQ crash due to corrupt queues, follow the steps given in the below documentation:

 

Unable to access and restart RabbitMQ – Cause and Resolution:

Observation: Suddenly on one of customer implementation box we’ve observed that there are data inside data queues of Insights agents which is not being consumed by Insights Engine. When I try to restart Platform Engine, I got the error saying the box doesn’t have disk space left. After restarting Platform Engine RabbitMQ became inaccessible from UI.

Cause:

Due to the SSL configuration changes made in the Neo4j, InsightsEngine was not able to push the data into the database. Because of this InsightsEngine was logging the error continuously in the nohup log file, resulting in a huge log file which almost occupied all the disk space in the server. With not space to run erlang(erl) which is a part of RabbitMQ had crashed. Due to this RabbitMQ has also stopped running abruptly and corrupted some of the files.

RabbitMQ was displaying the following error while trying to access from the UI:

Got response code 500 with body {"error":"Internal Server Error","reason":"{error,badarg,\n [{ets,lookup,[rabbit_user,<<\"iSight\">>],[]},\n {rabbit_misc,dirty_read,1,[{file,\"src/rabbit_misc.erl\"},{line,386}]},\n {rabbit_auth_backend_internal,internal_check_user_login,2,\n [{file,\"src/rabbit_auth_backend_internal.erl\"},{line,121}]},\n {rabbit_access_control,try_authenticate,3,\n [{file,\"src/rabbit_access_control.erl\"},{line,88}]},\n {rabbit_access_control,'-check_user_login/2-fun-0-',4,\n [{file,\"src/rabbit_access_control.erl\"},{line,74}]},\n {lists,foldl,3,[{file,\"lists.erl\"},{line,1263}]},\n {rabbit_mgmt_util,is_authorized,6,\n [{file,\"src/rabbit_mgmt_util.erl\"},{line,134}]},\n {webmachine_resource,resource_call,3,\n [{file,\"src/webmachine_resource.erl\"},{line,186}]}]}\n"}

 

While trying to check the status of RabbitMQ, got the following error:

[root@awsdldevops init.d]# service rabbitmq-server status

Status of node rabbit@awsdldevops ...

Error: unable to connect to node rabbit@awsdldevops: nodedown

DIAGNOSTICS

===========

attempted to contact: [rabbit@awsdldevops]

 

rabbit@awsdldevops:

  • connected to epmd (port 4369) on awsdldevops

  • epmd reports: node 'rabbit' not running at all

                  no other nodes on awsdldevops

  • suggestion: start the node

 

current node details:

  • node name: 'rabbitmq-cli-71@awsdldevops'

  • home dir: /var/lib/rabbitmq

  • cookie hash: ZzJTrGE/rLTNnDSGCcASYg==

While checking the RabbitMQ logs at “/var/logs/rabbitmq”, rabbitmq@awsdldevops.log -20220501 had the following error detail:

{could_not_start,rabbit,

       {{badmatch,

            {error,

                {{{badmatch,

                      {error,

                          {not_a_dets_file,

                              "/var/lib/rabbitmq/mnesia/rabbit@awsdldevops/recovery.dets"}}},

                  [{rabbit_recovery_terms,open_table,0,

                       [{file,"src/rabbit_recovery_terms.erl"},{line,126}]},

                   {rabbit_recovery_terms,init,1,

                       [{file,"src/rabbit_recovery_terms.erl"},{line,107}]},

                   {gen_server,init_it,2,[{file,"gen_server.erl"},{line,365}]},

                   {gen_server,init_it,6,[{file,"gen_server.erl"},{line,333}]},

                   {proc_lib,init_p_do_apply,3,

                       [{file,"proc_lib.erl"},{line,247}]}]},

                 {child,undefined,rabbit_recovery_terms,

                     {rabbit_recovery_terms,start_link,[]},

                     transient,30000,worker,

                     [rabbit_recovery_terms]}}}},

        [{rabbit_queue_index,start,1,

             [{file,"src/rabbit_queue_index.erl"},{line,464}]},

         {rabbit_variable_queue,start,1,

             [{file,"src/rabbit_variable_queue.erl"},{line,435}]},

         {rabbit_priority_queue,start,1,

             [{file,"src/rabbit_priority_queue.erl"},{line,92}]},

         {rabbit_amqqueue,recover,0,

             [{file,"src/rabbit_amqqueue.erl"},{line,239}]},

         {rabbit,recover,0,[{file,"src/rabbit.erl"},{line,652}]},

         {rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,

             [{file,"src/rabbit_boot_steps.erl"},{line,49}]},

         {rabbit_boot_steps,run_step,2,

             [{file,"src/rabbit_boot_steps.erl"},{line,49}]},

         {rabbit_boot_steps,'-run_boot_steps/1-lc$^0/1-0-',1,

             [{file,"src/rabbit_boot_steps.erl"},{line,26}]}]}}

 

While checking the Erlang logs at “/var/lib/rabbitmq/”, erlang_crash.dump file had the following error:

Slogan: init terminating in do_boot ({could_not_start,rabbit,{{badmatch,{error,{_}}},[{rabbit_variable_queue,start_msg_store,2,[_]},{rabbit_variable_queue,start,1,[_]},{rabbit_priority_queue,start,1,[_]},{rab

 

With reference to the below thread from Stack Overflow we found that rabbitmq has crashed because of some queues being corrupted due to the sudden crash of erlang and RabbitMQ.

https://stackoverflow.com/questions/25619201/rabbitmq-start-fails

Resolution:

  1. Deleted the log file which was occupying a large part of the disk space.

  1. Fixed the InsightsEngine scripts so that it won’t log the errors in the nohup file.

  1. As suggested in the stack overflow thread, we need to delete the corrupted recovery.dets file at “/var/lib/rabbitmq/mnesia/rabbit@awsdldevops/” directory. recovery.dets file gets generated everytime  rabbitmq gets restarted. When there is an abrupt or sudden restart or termination of RabbitMQ, in this case it happened due to lack of disk space on the box, recovery.dets file becomes 0 KB. This prevents RabbitMQ and Erlang to get restarted. We need to delete this file and restart RabbitMQ.

  1. We also need to delete the queues directory and msg_store_persistent directory from “/var/lib/rabbitmq/mnesia/rabbit@awsdldevops/” directory.

  1. After deleting the recovery.dets file, while starting RabbitMQ again the recovery.dets file was created freshly.

 

Final Observations:

  1. Sudden crash of Erlang and RabbitMQ corrupted some queues and

recovery.dets files.

  1. Because of the corrupted files, RabbitMQ was not able to start.

  1. Deleting the

queues directory and msg_store_persistent directory resulted in some data loss which were there inside rabbitmq data queues.

  1. After deleting

queues directory and msg_store_persistent directory and recovery.dets file, we were able to start the RabbitMQ successfully.

 

Back to top

©2021 Cognizant, all rights reserved. US Patent 10,410,152