ChatGPT is NOT A SEARCH ENGINE.

stinky [any] · 2 years ago

ChatGPT is NOT A SEARCH ENGINE.

supermangoman [he/him, they/them] · 2 years ago

From my current understanding, I'm not sure referring to GPT models as "stochastic parrots" is accurate. There is evidence the LLM builds internal "world models," even if it emerges through probabilistic mechanisms: https://thegradient.pub/othello/

BabaIsPissed [he/him] · edit-2 2 years ago

Let me preface this by saying I'm stupid, I can't even do my own research work well, let alone comment on cutting edge stuff with any degree of confidence:

I don't think this is incompatible with the concept of the stochastic parrot. Like, by the time "On the dangers of stochastic parrots" came out, it was already known that language models have rich representations of language structure:

(from the stochastic parrot paper):There are interesting linguistic questions to ask about what exactly BERT, GPT-3 and their kin are learning about linguistic structure from the unsupervised language modeling task, as studied in the emerging field of ‘BERTology’ [...]

So I don't think we can take this, or the probing/interpretability work later in the paper as a refutation of LLMs as stochastic parrots, because it was never about memorization:

(From the ICRL paper the blog post is based on): A potential explanation for these results may be that Othello-GPT is simply memorizing all possible transcripts. To test for this possibility, we created a skewed dataset of 20 million games to replace the training set of synthetic dataset[...] Othello-GPT trained on the skewed dataset still yields an error rate of 0.02%. Since Othello-GPT has seen none of these test sequences before, pure sequence memorization cannot explain its performance.

the concept is useful primarily as a way of delimiting how far this "understanding" really goes:

(stochastic parrot paper again) If a large LM, endowed with hundreds of billions of parameters and trained on a very large dataset, can manipulate linguistic form well enough to cheat its way through tests meant to require language understanding, have we learned anything of value about how to build machine language understanding or have we been led down the garden path?

the metaphor of the crow is kind of apt, I think. Like an LLM, it is working only with form, not meaning:

(blog post) At this point, it seems fair to conclude the crow is relying on more than surface statistics. It evidently has formed a model of the game it has been hearing about, one that humans can understand and even use to steer the crow's behavior. Of course, there's a lot the crow may be missing: what makes a good move, what it means to play a game, that winning makes you happy, that you once made bad moves on purpose to cheer up your friend, and so on.

(stochastic parrot paper) Furthermore, as Bender and Koller argue from a theoretical perspective, languages are systems of signs, i.e. pairings of form and meaning. But the training data for LMs is only form; they do not have access to meaning. Therefore, claims about model abilities must be carefully characterized.

Someone smarter please feel free to correct/dunk on me.

dat_math [they/them] · 2 years ago

): A potential explanation for these results may be that Othello-GPT is simply memorizing all possible transcripts. To test for this possibility, we created a skewed dataset of 20 million games to replace the training set of synthetic dataset[…] Othello-GPT trained on the skewed dataset still yields an error rate of 0.02%. Since Othello-GPT has seen none of these test sequences before, pure sequence memorization cannot explain its performance.

I'm going to maybe dunk on the authors of that ICLR paper (even giving them the benfit of the doubt, they should really know better if they got into ICLR). You can't conclude a lack of memorization from observation that the model in question maintains training accuracy on the holdout set. I really hope they meant to say that they tested on the skewed dataset and saw that the model maintained performance (without seeing any of the skewed data in training). However, if they simply repeated the training step on the skewed data and saw the same performance, all we know is that the model might have memorized the new training set.

I also agree with your conclusions about the scant interpretability results not necessarily refuting the mere stochastic parrot hypotheses.

BabaIsPissed [he/him] · 2 years ago

really hope they meant to say that they tested on the skewed dataset and saw that the model maintained performance (without seeing any of the skewed data in training).

Yeah, that's it, I should have provided the full quote, but thought it would make no sense without context so abbreviated it. They generate synthetic training and test data separately, and for the training dataset those games could not start with C5.

A potential explanation for these results may be that Othello-GPT is simply memorizing all possible transcripts. To test for this possibility, we created a skewed dataset of 20 million games to replace the training set of synthetic dataset. At the beginning of every game, there are four possible opening moves: C5, D6, E3 and F4. This means the lowest layer of the game tree (first move) has four nodes (the four possible opening moves). For our skewed dataset, we truncate one of these nodes (C5), which is equivalent to removing a quarter of the whole game tree. Othello-GPT trained on the skewed dataset still yields an error rate of 0.02%. Since Othello-GPT has seen none of these test sequences before, pure sequence memorization cannot explain its performance

I don't know much about Othello so don't really know if this is a good way of doing it or not. In chess it wouldn't make much sense, but in this game's case the initial move is always relevant for the task of "making a legal move" I think(?) It does seem to make sense for what they want to prove:

Note that even truncated the game tree may include some board states in the test dataset, since different move sequences can lead to the same board state. However, our goal is to prevent memorization of input data; the network only sees moves, and never sees board state directly.

Anyway, I don't think it's that weird that the model has an internal representation of the current board state, as that is directly useful for the autoregressive task. Same thing for GPT picking up on syntax, semantic content etc. Still a neat thing to research, but this kind of behavior falls within the stochastic parrot purview in the original paper as I understand it.

The term amounts to: "Hey parrot, I know you picked up on grammar and can keep the conversation on topic, but I also know what you say means nothing because there's really not a lot going on in that little tiny head of yours" :floppy-parrot:

dat_math [they/them] · 2 years ago

They generate synthetic training and test data separately, and for the training dataset those games could not start with C5.

hmmm still feels like not as strong a result as the authors want us to read it as. I'd be much more impressed if they trained on the original training set and observed that the model maintains the performance observed on the original test set when tested on the skewed test set, but I bet they didn't find that result

Anyway, I don’t think it’s that weird that the model has an internal representation of the current board state, as that is directly useful for the autoregressive task. Same thing for GPT picking up on syntax, semantic content etc. Still a neat thing to research, but this kind of behavior falls within the stochastic parrot purview in the original paper as I understand it.

Totally agree. At least with biological parrots, they learn in the physical world and maybe have some chance of associating the sounds with some semantically relevant physical processes.

Transformers trained after 2022 can't cook, all they know is match queries to values, maximize likelihoods by following error gradients, eat hot chip and lie

dat_math [they/them] · 2 years ago

Actually, the more I think about the experiment and their conclusions, the worse it gets. They synthesized the skewed dataset by sampling from a distribution that they assumed for both the synthetic training and synthetic testing set, so in a way, they've deliberately engineered the result.

GonzoBonzo [none/use name] · edit-2 2 years ago

deleted by creator

supermangoman [he/him, they/them] · edit-2 2 years ago

Showing a correlation between board states and internal representations on in-distribution data doesn’t disprove that it’s surface statistics at all.

The article doesn't make such a bold claim. It presents its goal as "exploring" the question, so not sure why the redditor started off with that.

If they had done an experiment where they change the distribution of the data, like a larger board, or restrict training to a part of the board, and it still works, that would show something.

Why?

They then “intervene” on all the later layers to get the result they wanted, proving that “intervening” on a single intermediate representation isn’t enough. It seems like this is the core claim by the post - that the author "led the witness" so to speak to get the desired outcome. Despite using the word "graident," the redditor does not really explain this, but could be true. It's definitely worth going through the appendices.

Appreciate the (semi-anonymous?) critique regardless.

GonzoBonzo [none/use name] · edit-2 2 years ago

deleted by creator

supermangoman [he/him, they/them] · 2 years ago

Doesn't look it's the same guy. This is the Kenneth Li that wrote the article: https://twitter.com/ke_li_2021

ChatGPT is NOT A SEARCH ENGINE.

ChatGPT is NOT A SEARCH ENGINE.

Chris Paukert on Twitter