There was a period on an hour where we migrated our database to it's own server in the same data centre.

This server has faster CPUs than our current one, 3.5GHz v 3.0GHz, and if all goes well we will migrate the extra CPUs we bought to the DB server.

The outage was prolonged due to a PEBKAC issue. I mistyped an IP in our new postgres authentication as such I kept getting a "permission error".

The decision of always migrating the DB to it's own server was always going to happen but was pushed forward with the recent donations we got from Guest on our Ko-Fi page! Thank-you again!

If you see any issues relating to ungodly load times please feel free to let us know!

Cheers,
Tiff

  • Tiff@reddthat.com
    hexagon
    M
    ·
    8 months ago

    Hmmmm that isn't great news. While testing it was working great, which is just my luck.

    I can see that there are a lot more spikes now compared to before, and even when ssh'd to the server I can sometimes feel the lag.

    I'll be investigating today as I really don't want to roll back to our single server again. There are probably some fine tuning items that need to be done on the network stack side now.

    • e0qdk@reddthat.com
      ·
      8 months ago

      Wishing you the best of luck in your investigation! Hopefully it turns out to be something quick and easy, but having been through the wringer debugging my own projects, I totally get it if it's not.

      • Tiff@reddthat.com
        hexagon
        M
        ·
        8 months ago

        How's the lag today? I still see the long load times in the browser... And it's really getting on my nerves. It's definitely a storage issue now.

        I've gone an bought a server in a different hosting company because this is ridiculous. Cue the migration post!

        • Blaze@reddthat.com
          ·
          8 months ago

          It is noticeable in the browser, but I guess it's definitely an improvement if we are now up-to-date with LW for posts and comments!

        • e0qdk@reddthat.com
          ·
          8 months ago

          Huh. I don't think I got a notice for this reply. So, sorry I didn't respond -- I didn't see it! (Just saw this randomly while checking your profile to see if there was a thread regarding the "next" interface.)

          Haven't checked again thoroughly yet today, but the timings I saw yesterday matched your averages pretty well and I definitely saw a reduction in lag spikes after your other reply. I also did a fair bit of poking around with my browser's network inspector -- and I found a potential issue.

          It looks like mlmym (the "old" interface -- which is the primary one I use) does not set Cache-Control on pages it serves, so you may be getting traffic to your backend for every logged out lurker and scraper hitting that interface instead of letting your CDN handle it. The header is set for images and such, but not the pages themselves. Lemmy's default UI sets it (public, max-age=60 for logged out users, private for logged in users). The "next" interface seems like it sets it to private, no-cache, no-store, max-age=0, must-revalidate for logged out users, which is probably not ideal... I'm not sure yet what the other interfaces are doing since they're JS heavy and I haven't dug into them as much.

      • Tiff@reddthat.com
        hexagon
        M
        ·
        8 months ago

        Alright, I've had a look, and we are now HA. We now have a lemmy front end and back end instances across two different servers with the third server being the new database server.

        This has reduced the number of connections through our one server to a more manageable number which seems to have helped with the lag spikes! From the logs it does look like the total number of lag spikes has disappeared but I'm not counting my chickens yet.

        With regards to regular requests. With the added server, on average we are looking at

        • comments: 1-3s
        • users: 2-5s
          Which again is still not ideal, but it's something to work towards.

        One of our servers is also slower at responding in general. So I think we'll decommission it at the end of this month and use the money to buy 1-2 more frontend servers, (that's how big it is).