NIX Solutions: Apple Questions AI’s Logic Skills

Apple researchers have raised concerns about artificial intelligence, questioning its capacity for logical reasoning. Their experiments revealed that even advanced language models struggle with basic math problems that most people, including children, can solve effortlessly.

The study found that AI responses to math questions vary based on how the problems are worded. More concerning was the discovery that the models’ accuracy declines as the number of problem conditions increases. This suggests that modern large language models (LLMs) do not possess true logical thinking abilities. Instead, they mimic patterns of reasoning found in their training data.

NIX Solutions

GSM-Symbolic Test and AI Performance

To evaluate AI’s reasoning skills, the Apple team developed a benchmark test called GSM-Symbolic. This tool generates various math problems using symbolic templates. The tasks also include seemingly important but ultimately irrelevant details, which add complexity without affecting the underlying logic.

These additional statements significantly confused the AI models. In one test, the performance of state-of-the-art models dropped by 65% after introducing just one irrelevant variable. This highlights how small changes in problem wording can derail AI problem-solving.

For example, one task stated: “Oliver picks 44 kiwis on Friday, 58 on Saturday, and twice as many on Sunday as on Friday. Five of the kiwis picked on Sunday are smaller than average. How many kiwis does Oliver have?”

Several models, such as O1-mini and LLama3-8B, mistakenly subtracted the five smaller kiwis from the total, giving 185 instead of the correct 190. This error reflects how AI often misinterprets problem elements, leading to incorrect answers.

Limitations of AI Models and Future Research

Researchers noted that AI systems frequently convert statements into mathematical operations without grasping their actual meaning. For instance, terms like “discount” were often treated as multiplication problems, regardless of the context. Although larger models such as Claude and Gemini answered the kiwi problem correctly, the general trend remains: accuracy declines as problems grow more complex, notes NIX Solutions.

Smaller models, containing only a few billion parameters, showed the most significant performance drops. Even O1-preview, one of OpenAI’s top models, exhibited a notable 17.5% decrease in accuracy.

These findings highlight the ongoing challenge of developing AI capable of formal reasoning. Achieving human-like thinking and robust problem-solving remains a key objective in the pursuit of general artificial intelligence. We’ll keep you updated on future advancements in this area.