In the light of Snowden's latest post: What are your FOSS-AIs?

Peter_Arbeitslos@discuss.tchncs.de · 4 months ago

In the light of Snowden's latest post: What are your FOSS-AIs?

WalnutLum@lemmy.ml · 4 months ago

There are VERY FEW fully open LLMs. Most are the equivalent of source-available in licensing and at best, they're only partially open source because they provide you with the pretrained model.

To be fully open source they need to publish both the model and the training data. The importance is being "fully reproducible" in order to make the model trustworthy.

In that vein there's at least one project that's turning out great so far:

https://www.llm360.ai/

umami_wasabi@lemmy.ml · edit-2 4 months ago

Not just LLMs but all kinds of models are equlvant to freeware, aka the model itself and other essential bits for it to work. I won't even call it source avaliable as there is no source.

Take redis as example. I can still go grab the source and compile a binary that works. This doesn't applies on ML models.

Of course one can argue the training process isn't determistic thus even with the exact training corpus, it can't create the same model in terms of bits on mulitple runs. However, I would argue the same corpus provide the chance to train a model of similar or equivalent performance. Hence the openness of the training corpus is an absolute neccessary to qualify a model being FOSS.

WalnutLum@lemmy.ml · 4 months ago

I've seen this said multiple times, but I'm not sure where the idea that model training is inherently non-deterministic is coming from. I've trained a few very tiny models deterministically before...

umami_wasabi@lemmy.ml · 4 months ago

You sure you can train a model deterministically down to each bits? Like feeding them into sha256sum will yield the same hash?

Canary9341@lemmy.ml · 4 months ago

The importance is being “fully reproducible” in order to make the model trustworthy.

Well that's a problem, because even with training data that's impossible by design.

WalnutLum@lemmy.ml · 4 months ago

I'm not sure where you get that idea. Model training isn't inherently non-deterministic. Making fully reproducible models is 360ai's apparent entire modus operandi.

umami_wasabi@lemmy.ml · edit-2 4 months ago

What's FOSS-AI? A model everyone can download and use for free? Or in the OSS spirit that everything need to be open and without discrimination of use, aka OSS training data corpus and no AUP attached?

Or you mean the inference engine running those models?

Peter_Arbeitslos@discuss.tchncs.de · 4 months ago

Everything which is not BigTech. Preferably FOSS, at least not BigTech, just alternatives to for example OpenAI.

umami_wasabi@lemmy.ml · edit-2 4 months ago

So you're including free models like freeware, not FOSS only, by non big tech.

Your choice of models will be quite limited as the compute resource and training corpus needed to make a viable base model isn't anyone can do.

earmuff@lemmy.dbzer0.com · 4 months ago

https://aihorde.net

utopiah@lemmy.ml · edit-2 4 months ago

My documented process https://fabien.benetou.fr/Content/SelfHostingArtificialIntelligence but honestly I just tinker with this. Most of that isn't useful IMHO except some pieces, e.g STT/TTS, from time to time. The LLM aspect itself is too unreliable, and I do like 2 relatively recent papers on the topic, namely :

No "Zero-Shot" Without Exponential Data https://arxiv.org/abs/2404.04125
ChatGPT is bullshit https://link.springer.com/article/10.1007/s10676-024-09775-5

which are respectively saying that the long-tail makes it practically impossible to train AI to be correct in rare cases and that "hallucinations" are a misnomer for marketing purposes to be replaced instead by "bullshit" used to convinced people without caring for veracity.

Still, despite all this criticism it is a very popular topic, hyped up to be the "future" of computing. Consequently I did want to both try and help others to do so rather than imagine that it was restricted to a kind of "elite". I try to keep the page up to date but so far, to be honest, I do it mostly defensively, to be able to genuinely criticize because I did take the time to try, not reject in block.

PS: I do try also state of the art, both close and open-source, via APIs e.g OpenAI or Mistral but only for evaluation purposes, not as tools part of my daily usage.