Accidentally deleted /etc/pve/local/pve-ssl.key, can't start pve-cluster
Long story short, I removed a node from my cluster and edited the corosync.conf instead of using delnode. After this error I decided it would be best to simply remove all other nodes from the cluster and to just make a new one. So I successfully did that with one server, but for the other one I somehow managed to delete the pve-ssl.key.
I found this guide https://pve.proxmox.com/wiki/Proxmox_SSL_Error_Fixing and followed it with no errors other then it creating a server.csr instead of server.pem. I tried renaming this file to .pem as well as starting from the top to just leave it as .csr and ignore the .pem missing. Doing systemctl status pve-cluster.service after following the guide gives a repeated "/etc/pve/local/pve-ssl.key: failed to load local private key" even though the pve-ssl.key file exists and is there. I am not sure why it is failing when the setup of the file is similar to the other pve-ssl.key of my other server which works.
Unfortunately I have quite a few vms on this server but cannot backup any of them due to pve-cluster not starting, I also cannot access the webgui. I am now at the point of just scrapping all of my vms and reinstalling proxmox on this server but I was hoping somebody could help me.
thanks
Comments Section
AFAIK the .csr file is only a certificate create request and not the actual certificate. You seemed to missed the last step of the manual for creating the server certificate.
There should also exist an openssl command for this step but I can't remember it right now.
Could you simply retry the whole process again and tell us if it worked?
Otherwise I will recreate your scenario in a vm and try to give you a fix but I then you would have to wait till this evening when I get home.
You were right, I didn't do the last step. I redid the process with the last step and Pve-cluster still doen't start, with journalctl -xe saying it failed to start corosync. So i tried systemctl restart corosync.service and it errors with the same "/etc/pve/local/pve-ssl.key: failed to load local private key" as it did when I was trying to start pve-cluster from before.
journalctl -xe output
-- Subject: Unit corosync.service has failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- Unit corosync.service has failed.
--
-- The result is failed.
Oct 18 01:20:57 x3650m4 systemd[1]: corosync.service: Unit entered failed state.
Oct 18 01:20:57 x3650m4 systemd[1]: corosync.service: Failed with result 'exit-code'.
Oct 18 01:20:57 x3650m4 pveproxy[3387]: worker exit
Oct 18 01:20:57 x3650m4 pveproxy[2172]: worker 3387 finished
Oct 18 01:20:57 x3650m4 pveproxy[2172]: starting 2 worker(s)
Oct 18 01:20:57 x3650m4 pveproxy[2172]: worker 3530 started
Oct 18 01:20:57 x3650m4 pveproxy[2172]: worker 3531 started
Oct 18 01:20:57 x3650m4 pveproxy[3530]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1683.
Oct 18 01:20:57 x3650m4 pveproxy[3531]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1683.
Oct 18 01:20:57 x3650m4 pveproxy[3388]: worker exit
Oct 18 01:20:57 x3650m4 pveproxy[2172]: worker 3388 finished
Oct 18 01:20:57 x3650m4 pveproxy[2172]: starting 1 worker(s)
Oct 18 01:20:57 x3650m4 pveproxy[2172]: worker 3532 started
Oct 18 01:20:57 x3650m4 pveproxy[3532]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1683.
Oct 18 01:20:57 x3650m4 systemd[1]: pve-cluster.service: Service hold-off time over, scheduling restart.
Oct 18 01:20:57 x3650m4 systemd[1]: Stopped The Proxmox VE cluster filesystem.
-- Subject: Unit pve-cluster.service has finished shutting down
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- Unit pve-cluster.service has finished shutting down.
Oct 18 01:20:57 x3650m4 systemd[1]: pve-cluster.service: Start request repeated too quickly.
Oct 18 01:20:57 x3650m4 systemd[1]: Failed to start The Proxmox VE cluster filesystem.
-- Subject: Unit pve-cluster.service has failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- Unit pve-cluster.service has failed.
--
-- The result is failed.
Oct 18 01:20:57 x3650m4 systemd[1]: pve-cluster.service: Unit entered failed state.
Oct 18 01:20:57 x3650m4 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Oct 18 01:20:57 x3650m4 systemd[1]: corosync.service: Start request repeated too quickly.
Oct 18 01:20:57 x3650m4 systemd[1]: Failed to start Corosync Cluster Engine.
-- Subject: Unit corosync.service has failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- Unit corosync.service has failed.
--
-- The result is failed.
Oct 18 01:20:57 x3650m4 systemd[1]: corosync.service: Failed with result 'exit-code'.
Check the permissions on the file? Often (private) keys need to be only readable by the owner (root), and unreadable by the group and world. Otherwise the software refuses to use them.
So along with restarting the process and adding the last part, which I forgot I tried
chown root:root /etc/pve/local/pve-ssl.key
chmod 700 /etc/pve/local/pve-ssl.key
but It still doesn't work, I'm not entirely sure how to modify file permissions so perhaps I am wrong with my use of chown and chmod.
In the worst case you can still manually backup the vms and Containers. I would try to use vzdump first to create normal backups but if even that dose not work then you can manually copy the configs and disks by hand.
You'll need to run the oven update certs command to force to cluster to read the new cert. I'm on mobile and can't recall the exact command, but Google "remove node from over cluster" it's the last step in the official proxmox guide. It'll force your cluster to see the new certs.
Comment removed by moderator
How can I find the old name?
How can I find which host name is proxmox configured?