NIX Solutions: Drift Phenomenon in AI Chatbots

Scientists discover the “drift” phenomenon, causing degradation in AI chatbots’ performance, particularly in mathematical operations.

Scientists have reported another problem that can haunt chatbots based on artificial intelligence platforms – this phenomenon is called “drift”, and it reflects the degradation of the intellectual abilities of the system.

NIX Solutions

GPT-3.5 and GPT-4 Performance Comparison

Analyzing the performance of GPT-3.5 and GPT-4, and how they fare against each other in various tasks, including mathematical operations and question answering.

ChatGPT, which made its debut last year, revolutionized the field of artificial intelligence and even indirectly contributed to the writers’ strike that broke out in Hollywood. But a study published by scientists from Stanford University and the University of California at Berkeley (USA) points to a new problem for AI: ChatGPT has become worse at performing some elementary mathematical operations. This phenomenon is known as “drift”: an attempt to improve one part of a complex AI model leads to a deterioration in other areas of it. And this, the researchers note, greatly complicates the continuous improvement of neural networks.

Scientists came to this conclusion when they tested two versions of GPT: available to everyone for free 3.5; and 4.0, which can only be used with a paid subscription. The chatbot was given an elementary task: to determine whether a certain number is prime. A prime number is a natural number that is only divisible by 1 and itself. If the number is large enough, then one cannot judge whether it is prime in the mind. But a computer can cope with this task by brute force: check its divisibility by 2, 3, 5, etc. The basis of the test was a sample of 1000 numbers. In March, the premium GPT-4 gave 84% correct answers, which is already a dubious result for a computer, but by June, the correct answer rate had fallen to 51%.

In general, GPT-4 showed degradation in six out of eight tasks. GPT-3.5, on the contrary, showed progress in six tasks, but in most cases remained weaker than its advanced brother. An increase in the number of incorrect answers was noted by many users of chatbots, and according to the findings of scientists from Stanford and Berkeley, these are not subjective sensations – degradation is confirmed by empirical data. “When we release new versions of models, our priority is to make the new models smarter across the board. We make efforts to improve the new versions in the whole range of tasks. At the same time, our assessment methodology is imperfect, and we are constantly improving it, ”commented the scientific work in OpenAI.

The Unpredictable Journey of AI

Understanding the non-linear progress of AI models and the need for closer observation and evaluation in the field.

There is no talk of total degradation of AI models: in a number of tests, the less accurate GPT-3.5 in general showed progress, and GPT-4 worsened its results. In addition to math problems, the researchers asked chatbots to answer 1,500 questions. And if in March a chatbot based on GPT-4 answered 98% of questions, then in June it gave answers only to 23%, and often they turned out to be too short: the AI stated that the question was subjective, and it had no opinion of its own.

Stanford and Berkeley scientists say that in their study they do not call for the abandonment of AI technologies, but rather a close observation of their dynamics, notes NIX Solutions. A person is accustomed to perceiving knowledge as a solution to a series of problems, where each subsequent one is based on the previous one. In the case of AI, the scheme turns out to be different: one step forward corresponds to a step back or in another unpredictable direction. AI services will likely continue to evolve, but their journey will not be in a straight line.