Context: Breakthrough capabilities, but risks of misinformation
So much has been written about ChatGPT since its launch that most of us have probably wondered how AI will enhance or challenge our jobs. A recent BBC story carried this ominous quote – “Workers that don’t work with AI are going to find their skills [become] obsolete … it’s imperative to work with AI to stay employed.” Amidst all these “change or you shall be replaced” warnings, it’s easy to empathize for Steven Schwartz, a New York lawyer who relied on ChatGPT for research, only to realize later that “six of the submitted cases appear to be bogus judicial decisions with bogus quotes and bogus internal citations.”
AI systems such as ChatGPT and Google Bard no doubt possess remarkable capabilities in terms of speed, information processing, natural language understanding, and responsive communication. However, it is less clear just how good they are for research. In this post, we delve into this specific question, and try to use data to evaluate their performance within the specific context of the Southeast Asian e-commerce landscape.
Beyond the Surface: Analyzing ChatGPT and Bard’s ecommerce knowledge
Every day, our analysts research news websites, company reports, and government publications for any new announcements or information relating to e-commerce in Asia.
To gauge the performance of ChatGPT and Bard in research, we conducted an assessment using 50 data points related to the Southeast Asian e-commerce market, including the gross merchandise value (GMV) of different countries and categories. To ensure fairness, all the data points queried were for the year 2020, considering ChatGPT’s knowledge limitation up until September 2021.
The results were tagged in three simple buckets:
- Green – reliability; AI was able to generate answers and point to real sources where the data existed
- Yellow – hallucinations*; AI either quoted a wrong data point or invented a source that doesn’t exist
- Gray – honesty; AI acknowledged that it did not know the answer
* Hallucinations refer to the creation of nonexistent sources and information, where the AI fabricates facts and details instead of admitting its lack of knowledge. Within our research we saw 2 types of hallucinations i) source provided but data not found ii) fabricated data point with no source provided.
The ideal color mix would have been green and gray – for there can only be 2 possibilities, either something is available, or it isn’t. And yet, as highlighted in the chart above, both models provided many manufactured answers.
Truths vs Fiction: Insights reveal reliability gaps and hallucination galore
What we observed:
- The findings revealed that both ChatGPT and Bard had accuracy rates below 20%, indicating significant room for improvement. The results did however improve between May and June, suggesting the teams behind the models are hard at work trying to improve them over time.
- ChatGPT’s honesty showed improvement as it refused to provide answers in 50% of the cases, up from 32% in May. This suggests that the company may be taking steps to reduce hallucinations. However, there was a noticeable negative trend in reliability, with the share of reliable (green) answers dropping from 14% to 4%.
- As for Bard we observed that the AI never indicated that it doesn’t know an answer, hallucinating >80% of the time. Furthermore, in 50% of cases, Bard reported data without any source. Bard’s warning about its accuracy upon sign-up is spot on – the high volume of fabrication undermines its suitability as a research tool for now.
We’ve highlighted one particular egregious example of hallucinations below. We asked Google Bard to look through YouTube transcripts to see if there is any data available about Lazada, and it referred to a video interview with the Lazada Philippines CEO that it claimed was uploaded on 28 May 2023, had over 1,000 views, and at least 3 comments.
Remarkably however, all 3 of these data points, which could all be easily verified, were incorrect – the interview had happened a full year ago (on 04 April 2022), has fewer than 448 views, and only 1 comment!
Decoding the trends: Our theory of factors behind AI performance challenges
In our quest to understand the factors contributing to the observed results, we identified three potential reasons:
- Generative Nature of Language Models: ChatGPT and Bard, being language models, are primarily trained as a generative tool and do not possess the ability to differentiate between fact and fiction. Simon Willison, a software developer, explains that large language models rely on statistical probability from their training data to merely predict the next word, which can lead to confabulation. Benj Edwards’ article on why AI models hallucinate is also a good read.
- Overfitting: Overfitting is a common issue in machine learning, where a model becomes excessively tailored to the training data, making it difficult to generalize to new or unseen data. This phenomenon could contribute to the inconsistency and lack of accuracy observed in AI systems.
- Lack of Contextual Understanding: AI systems currently struggle with contextual understanding, which encompasses elements like common sense, nuanced details, emotions, social dynamics, and human behavior. These limitations hinder their ability to accurately interpret and predict outcomes, particularly in areas that require a comprehensive understanding. Ted Chiang’s New Yorker piece gives a great insight into the limitations of AI.
Final word: Generative AI tools can’t replace human researchers just yet
Our research indicates that popular generative AI models suffer from extensive hallucinations, providing confident yet fabricated information. This makes them hard to rely on and forces analysts to exercise plenty of caution and skepticism when using them for e-commerce research.
Although generative AI continues to evolve rapidly, it is not yet a substitute for human analysts and researchers. Just how much longer would that continue to be the case? On that question our guess is probably as good as yours.