Google's search results have become pretty much useless for anything but like... how to videos and song lyrics. I know there are better alternatives out there in the year of our lord 2022, but I've been using Google since before it was a verb. Any suggestions?
Also, how difficult is it to build an open source search engine? I imagine it's not something that a group of hexbear coders could manage, but if I'm wrong... we could really use a search engine with an explicitly leftwing bias.
The bare bones of a search engine are pretty easy. You can make 1998's Google in an afternoon.
Crawling and indexing can get expensive, especially if you want your results to be up-to-date. Even Google can't afford to keep all the sites up to date all the time, and has to prioritize based on which update fastest.
Having good-quality results is an open question, really. PageRank hasn't actually worked in over a decade due to SEO. Later most of the internet's information moved to walled gardens (Facebook, Pinterest, Instagram, and the ultimate death of hobby information on the internet: Discord) and video (indexing information in videos costs approx a bazillion dollars). Then more recently, AI-generated SEO spam sites cropped up, which are really hard for a computer to tell apart from useful information.
To make a better search engine, I think you'd need some mix of:
custom crawlers to sneak into all the biggest walled gardens (they have anti-crawler defenses though, not sure how you could pull it off without crowd sourcing it somehow).
a hand-curated list of trusted sites
a collection of bullshit rules to exclude the major SEO spam farms. It'd be an ongoing process. The good news is that it's easier while you're small enough that they aren't specifically targeting you. The bad news is that you really can't open source your techniques
[if you somehow have a huge budget] Get video contents into the index using AI captioning and image recognition.
This is precisely why forums, mailing lists, and public-visible membership sites like Mastodon or even Twitter are far, far better than fucking Discord. I hate when a FOSS project does development/support based on Discord. That information becomes so difficult to retrieve and share later.
That's absolutely fascinating. I have a vague notion of what SEO does, but I'd love to hear more about how this breaks PageRank, considering how famous that algorithm is to google's monopoly
PageRank just counts how much a website is linked to by other websites with good PageRank. It makes a ton of sense in the old internet where links were manually curated. It'd find everyone in your web ring, notice that all of you are also linking to the same Sailor Moon fansite, and rank that site highly because it's clearly well-respected in whatever niche this is.
It's easy to game this by just making a bunch of trivial websites that all link to a few of each other plus the site you actually want to boost.
Obviously you then try to exclude things that look like fake web rings, but now you're in an arms race with the SEO goblins.
You should use an AI for detecting & excluding SEO spam farms.
Spam detection is extremely well studied to the point that it's often the textbook example of a machine learning classification task.
SEO spam farms can optimize against AI detection by training their content generators against the AI detectors. Ultimately you need detection algorithms that they don't know about, which are sufficiently different from the ones they do. That's best if it's an ensemble approach, since it's harder to replicate. Some of the detectors in there may be AI based.