friso.lol

I'm being serious here.

ChatGPT vs. Google, and The Confusion Between Enabling Technologies, Product Features, and Problem Spaces

Posted at —

TL;DR

  • The problem space for search engines is Information Retrieval, including ranking.
  • For those old enough to remember the side-by-side comparison, Google was consistently better at Information Retrieval than its predecessor AltaVista (competitor, if you're generous).
  • PageRank was the novel enabling technology for Google to outperform AltaVista in spite of lacking several advanced features at launch (boolean searches, date ranges, language selection, etc.).
  • Good enabling technologies are economically scalable, and progress the state of the art in a problem space by a meaningful increment.
  • For a user, naming enabling technologies is irrelevant. The fact that some product is built with AI under the hood is as irrelevant as the fact that (early days) Google is built with PageRank under the hood.
  • The problem space for ChatGPT is Conversational AI. It optimises for a convincing conversational experience. Its enabling technology is Large Language Models.
  • Conversation itself has little overlap with Information Retrieval, yet Large Language Models can be relevant to both spaces.
  • For Conversational AI to address a problem space beyond just conversation itself, I believe more enabling technology will have to be built.

Can ChatGPT Render Google Obsolete?

Several people have pointed out how ChatGPT can produce entirely nonsensical responses, like dreaming up a list of non-existing academic papers, or explaining that ChatGPT itself is based on a GAN (it is not). These examples are brought up by people who contend that ChatGPT can never replace Google. Others then assume that these issues will be corrected for, and when ChatGPT loses its truth-bending problems, it will surely be the Google-killer that we all long for.

There is one issue with this debate: Google itself is not at all concerned with truth either. For something to appear in a Google result, it has to be on the internet. No more, no less. Truthfulness notwithstanding. In the same vain, in order for ChatGPT to utter something, it must be a probable sequence of words. Again, truthfulness notwithstanding. The difference between the two is that for Google something has to be on the internet, while for ChatGPT something just has to sound convincing; neither of those requirements imply truth. It just means that Google is an Information Retrieval system, while ChatGPT is a Conversational AI system. The problem spaces hardly have any overlap. It seems unlikely that one will replace the other.

Prior to Google there was a different dominant web search engine: AltaVista (later acquired by Yahoo!). That was not so good, but at the time we did not know any better. Google was the first to produce sub-second, highly relevant results, without resorting to specialised advanced query syntax, or tedious query fine tuning. To achieve this, Google applied two novel enabling technologies: the PageRank algorithm, and an implementation of MapReduce that ran on cost efficient commodity server hardware. AltaVista relied on more traditional ranking methods, and ran on high end server hardware, resulting in sub-par relevance, and difficult to scale systems. Google did not obsolete AltaVista through excellence in a different problem space, but by providing a meaningful increment in the same problem space. Users care about this meaningful increment, not about PageRank or MapReduce.

Good enabling technologies bring a meaningful increment to what users believe is state of the art, while maintaining viable unit economics for a product.

Prior to ChatGPT there were other conversational systems. You probably never heard of them; they are terrible. The underlying technology is just not up to the task. That changed with Large Language Models, the enabling technology for ChatGPT. Google (who run a search engine) made sizeable investments into bringing down the cost of training and evaluating deep learning models in order to train, amongst others, Large Language Models. The same goes for Microsoft (who invest in OpenAI, and apparently also run a search engine).

This brings us to two other questions:

  1. Are Large Language Models an important enabling technology for web search?
  2. Is Conversational AI a relevant technology to a search engine product?

Looking at what both Google and Microsoft are investing in research, it looks like the answer is "maybe" to both questions. While OpenAI ChatGPT is now the public face of Large Language Models, it is not as if Google is not interested. Their own models is called Pathways Language Model, and it can do equally impressive things just the same. If there is value in leveraging this in web search, it is near certain that it is somehow used.

And surely conversational interfaces are not new. You can ask Google, or ask Siri, or ask Alexa, or ask Cortana (that last one belongs to Microsoft; you can Bing that).

So yes, Large Language Models are a thing. Still in order to surpass Google, you have to beat them at Information Retrieval for a meaningful improvement to web search; and Google is annoyingly good at Information Retrieval.

However, ChatGPT is so good at conversations, that people seem to desperately want it to be better than specialised products in other problem spaces. Let us look at some more of those…

StackOverflow?

If this thing can point out the problem with my use of the useRef React hook, surely it must be an infinitely knowledgable StackOverflow, right? Alas, as it turns out it is so terrible that StackOverflow was quick to ban its use altogether.

StackOverflow is a Q&A site. Their problem space is user generated content. Anyone working in this space can tell you curation is more expensive than production. Therefor platforms tend to raise the bar for producing content, and aggressively promote anyone to join curation through voting, commenting, and various gamification efforts. Adding an actor that produces content for free would break the already hard to maintain balance between generation and curation. It would be the end of StackOverflow.

What if the machine generated answers are flawless? Then there is no need for curation! That's right. Also, there would be no need to upload any answers to StackOverflow in the first place. Flawless content needs no curation. Unfortunately, a language model that only produces correct answers, could never produce anything novel that was not in the training set. For the training set to progress, you need content curation. Mixing user generated content with machine generated content earns you hard to correct feedback loops at best. It renders the user generated content platform uselss at worst. You see the problem?

Computer Programming?

Still, based on the training set that it has now, it can produce working pieces of code based on real world requirements. Does it not change the profession of computer programming forever? Admittedly, this one is perhaps trickier. To the point that I have to say: I don't know.

Let us say that the problem space of computer programming is mastering abstractions. Sometimes the abstractions are over other abstractions, and ultimately the lowest level abstractions are over hardware. Based on market rates for computer programming professionals, we can say that mastering more, and lower layers tends to be valued higher. For example there might be a programmer who masters the abstractions ReactJS, and TypeScript in order to manipulate the lower level abstractions of DOM, and the JavaScript runtime. That person might earn around 75k-100k anually. Then there might be another computer programmer working at a business like Google on database query engine optimisations in C, while also having a better than trivial understanding of the underlying abstractions like assembler language, and a decent understanding of the hardware that executes the instructions once in machine code. That person might earn around 450k anually. (Currency is not relevant to this example). Both of these people do most of their work in one abstraction: TypeScript and C respectively, yet their concern for underlying (and adjacent) abstractions is very different.

ChatGPT is not concerned with abstractions. It can only go as far as transforming description into code in the requested language. It produces tokens of "text", not an abstract syntax tree; it does not compile the result to inspect the assembly representation. So long as correctnes of the happy flow is the objective, it can code. This could be why it is popular for Advent of Code. As soon as what lies beneath the abstraction starts to matter, all bets are off. Performance critical, security sensitive, error prone, resilience against distributed failures, defense against adverse inputs… A lot of what matters in real programs is out of reach, to be honest.

Into The Future

It is a conversational model after all, right? By having long conversations, we can teach this system new things, right? Wrong. It is a transformer after all. As per the ChatGPT FAQ: "The model is able to reference up to approximately 3000 words (or 4000 tokens) from the current conversation - any information beyond that is not stored.". There is no live training. It can not come up with novel things, or make new connections.

With all of that said, ChatGPT is definitely the best Conversational AI system that has a public user interface. Asking it to change writing style, bring up specific arguments, follow a prescribed line of reasoning, as well as producing actual code are all very impressive.

Right now, when it comes to Conversational AI, Large Language Models are what makes the conversation feel real. But in order for the contents of the conversation to cover a real problem space, I believe that more breakthrough enabling technology will have to be built. Can we make transformers incorporate live data sources in their output? Can we apply them to just the conversational aspects, transforming more abstract machine generated input from sources like (curated) knowledge graphs, and systems of record? The internet was never much about truth; human generated content, curation, and information retrieval have always been at the core of online products. Let's stay tuned.