Well hello again, I have just learned that the host that recently had both nvme drives fail upon drive replacement, now has new problems: the filesystem report permanent data errors affecting the database of both, Matrix server and Telegram bridge.

I have just rented a new machine and am about to restore the database snapshot of the 26. of july, just in case. All the troubleshooting the recent days was very exhausting, however, i will try to do or at least prepare this within the upcoming hours.

Show

Update

After a rescan the errors have gone away, however the drives logged errors too. It's now the question as to whether the data integrety should be trusted.

Status august 1st

Well ... good question... optimizations have been made last night, the restore was successful and ... we are back to debugging outgoing federation :(


The new hardware also will be a bit more powerful... and yes, i have not forgotten that i wanted to update that database. It's just that i was busy debugging federation problems.

References

  • federation issues after restore: https://github.com/matrix-org/synapse/issues/16025
  • why we had to restore initially: https://text.tchncs.de/tchncs/about-the-matrix-incident-on-july-26-2023
  • Haui@discuss.tchncs.de
    ·
    11 months ago

    Thanks for putting in the work. Is there anything we can help you with? From what I understood the domain is german, is the server in germany as well? I‘m located in germany and do sysadmin work. Fighting with hosting companies is part of my job. ;) let me know if I can do anything. Have a good one!

    • Milan@discuss.tchncs.de
      hexagon
      M
      ·
      11 months ago

      Thank you :) Well i am not sure if there was something to fight over except maybe some sort of refund... for now it seems to be fine one the new machine. – yes, i am from germany, however i think its a helsinki dc from hetzner.

      • Haui@discuss.tchncs.de
        ·
        11 months ago

        You’re very welcome. Hetzner is generally a good host afaik. It does depend on the configuration I suppose. Are you using the shared vps or something else? If the storage is guaranteed (as in not custom hardware) they are technically responsible for its condition. A host I‘m working with (also located at hetzner but in falkenstein) does 2 backups a day which also prevents having to revert far back.

        • Milan@discuss.tchncs.de
          hexagon
          M
          ·
          11 months ago

          on hetzner its all dedicated servers – out goes an ax51-nvme, in comes an ax102. they have tried a connector cable swap in order to try to bring the nvme(s) back to life, i was wondering if this could have something to do with the smart errors logged and the temp zpool errors, however i think the cpu upgrade now at least is very welcomed by the matrix server 😅