ars-rss
Apple study exposes deep cracks in LLMs’ “reasoning” capabilities
Irrelevant red herrings lead to “catastrophic” failure of logical inference.
For a while now, companies like OpenAI and Google have been touting advanced “reasoning” capabilities as the next big step in their latest artificial intelligence models. Now, though, a new study from six Apple engineers shows that the mathematical “reasoning” displayed by advanced large language models can be extremely brittle and unreliable in the face of seemingly trivial changes to common benchmark problems.
The fragility highlighted in these new results helps support previous research suggesting that LLMs use of probabilistic pattern matching is missing the formal understanding of underlying concepts needed for truly reliable mathematical reasoning capabilities. “Current LLMs are not capable of genuine logical reasoning,” the researchers hypothesize based on these results. “Instead, they attempt to replicate the reasoning steps observed in their training data.”
Mix it up
In “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models”—currently available as a pre-print paper—the six Apple researchers start with GSM8K’s standardized set of over 8,000 grade-school level mathematical word problems, which is often used as a benchmark for modern LLMs’ complex reasoning capabilities. They then take the novel approach of modifying a portion of that testing set to dynamically replace certain names and numbers with new values—so a question about Sophie getting 31 building blocks for her nephew in GSM8K could become a question about Bill getting 19 building blocks for his brother in the new GSM-Symbolic evaluation.
Expert witness used Copilot to make up fake damages, irking judge
Judge calls for a swift end to experts secretly using AI to sway cases.
A New York judge recently called out an expert witness for using Microsoft’s Copilot chatbot to inaccurately estimate damages in a real estate dispute that partly depended on an accurate assessment of damages to win.
In an order Thursday, judge Jonathan Schopf warned that “due to the nature of the rapid evolution of artificial intelligence and its inherent reliability issues” that any use of AI should be disclosed before testimony or evidence is admitted in court. Admitting that the court “has no objective understanding as to how Copilot works,” Schopf suggested that the legal system could be disrupted if experts started overly relying on chatbots en masse.
His warning came after an expert witness, Charles Ranson, dubiously used Copilot to cross-check calculations in a dispute over a $485,000 rental property in the Bahamas that had been included in a trust for a deceased man’s son. The court was being asked to assess if the executrix and trustee—the deceased man’s sister—breached her fiduciary duties by delaying the sale of the property while admittedly using it for personal vacations.
Ward Christensen, BBS inventor and architect of our online age, dies at age 78
Christensen kick-started online culture by inspiring thousands of hobbyist communities.
On Friday, Ward Christensen, co-inventor of the computer bulletin board system (BBS), died at age 78 in Rolling Meadows, Illinois. Christensen, along with Randy Suess, created the first BBS in Chicago in 1978, leading to an important cultural era of digital community-building that presaged much of our online world today.
Friends and associates remember Christensen as humble and unassuming, a quiet innovator who never sought the spotlight for his groundbreaking work. Despite creating one of the foundational technologies of the digital age, Christensen maintained a low profile throughout his life, content with his long-standing career at IBM and showing no bitterness or sense of missed opportunity as the Internet age dawned.
“Ward was the quietest, pleasantest, gentlest dude,” said BBS: The Documentary creator Jason Scott in a conversation with Ars Technica. Scott documented Christensen’s work extensively in a 2002 interview for that project. “He was exactly like he looks in his pictures,” he said, “like a groundskeeper who quietly tends the yard.”
People think they already know everything they need to make decisions
When given partial info, most people felt confident they knew all they needed to.
The world is full of people who have excessive confidence in their own abilities. This is famously described as the Dunning-Kruger effect, which describes how people who lack expertise in something will necessarily lack the knowledge needed to recognize their own limits. Now, a different set of researchers has come out with what might be viewed as a corollary to Dunning-Kruger: People have a strong tendency to believe that they always have enough data to make an informed decision—regardless of what information they actually have.
The work, done by Hunter Gehlbach, Carly Robinson, and Angus Fletcher, is based on an experiment in which they intentionally gave people only partial, biased information, finding that people never seemed to consider they might only have a partial picture. “Because people assume they have adequate information, they enter judgment and decision-making processes with less humility and more confidence than they might if they were worrying whether they knew the whole story or not,” they write. The good news? When given the full picture, most people are willing to change their opinions.
Ignorant but confident
The basic setup of the experiment is very straightforward. The researchers developed a scenario where an ongoing water shortage was forcing a school district to consider closing one of its schools and merging its students into another existing school. They then wrote an article that described the situation and contained seven different pieces of information: three that favored merging, three that disfavored it, and one that was neutral. Just over half of the control group that read the full article favored merging the two schools.
Smart gardening firm’s shutdown a reminder of Internet of Things’ fickle nature
Company closing “due to a number of challenges with this business.”
AeroGarden, which sells Wi-Fi-connected indoor gardening systems, is going out of business on January 1. While Scotts Miracle-Gro has continued selling AeroGarden products after announcing the impending shutdown, the future of the devices’ companion app is uncertain.
AeroGarden systems use hydroponics and LED lights to grow indoor gardens without requiring sunlight or soil. The smart gardening system arrived in 2006, and Scotts Miracle-Gro took over complete ownership in 2020. Some AeroGardens work with the iOS and Android apps that connect to the gardens via Wi-Fi and tell users when their plants need water or nutrients. AeroGarden also marketed the app as a way for users to easily monitor multiple AeroGardens and control the amount of light, water, and nutrients they should receive. The app offers gardening tips and can access AeroGarden customer service representatives and AeroGarden communities on Facebook and other social media outlets.
Regarding the reasoning for the company’s closure, AeroGarden’s FAQ page only states:
Rebellion brews underground in Silo S2 trailer
“What if everything you know to be true was just one big lie?”
Apple TV’s dystopian sc-fi drama Silo, based on the trilogy by novelist Hugh Howey, was one of the more refreshing surprises on streaming television in 2023: a twist-filled combination of political thriller and police procedural set in a post-apocalyptic world. We included it in our year-end TV roundup, calling the series “one of the more intriguing shows of the year.” The official trailer recently dropped for S2, and it looks like we can expect another suspenseful season full of surprising revelations.
(Spoilers for S1 below.)
As we wrote in last year’s roundup, Silo is set in a self-sustaining underground city inhabited by a community whose recorded history only goes back 140 years, generations after the silo was built by the founders. Outside is a toxic hellscape that is only visible on big screens in the silo’s topmost level. Inside, 10,000 people live together under a pact: Anyone who says they want to “go out” is immediately granted that wish—cast outside in an environment suit on a one-way trip to clean the cameras. But those who make that choice inevitably die soon after because of the toxic environment.
Lots of PCs are poised to fall off the Windows 10 update cliff one year from today
Windows 10 is by far the most-used version of Windows, and support ends soon.
One year from today, on October 14, 2025, Microsoft will stop releasing security updates for PCs that are still running Windows 10.
Organizations and individuals will still be able to pay for three more years of updates, with prices that go up steadily each year (Microsoft still hasn’t provided pricing for end users, only saying that it will release pricing info “closer to the October 2025 date.”) But for most PCs running Windows 10, the end of the line is in sight.
Normally, this wouldn’t be a huge deal; the last dregs of support for Windows 7 and Windows 8 dried up in January 2023, and the world didn’t end even though some PCs continue to run those OS versions. But there are three things about the end of Windows 10 support that are slightly different from other recent end-of-life dates:
Musk’s X blocked links to JD Vance dossier after hearing from Trump campaign
Report: Trump campaign “connected with X to prevent the circulation of links.”
Elon Musk’s X blocked links to the JD Vance dossier after hearing directly from the Trump campaign, according to a new report that describes Musk’s extensive efforts to boost Trump’s presidential campaign.
A New York Times article titled “Musk Is Going All In to Elect Trump” said that “the relationship [between Trump and Musk] has proved significant in other ways. After a reporter’s publication of hacked Trump campaign information last month, the campaign connected with X to prevent the circulation of links to the material on the platform, according to two people with knowledge of the events. X eventually blocked links to the material and suspended the reporter’s account.”
We contacted X today and will update this article if it provides comment.
Two comets will be visible in the night skies this month
Halloween visitors from the distant Oort Cloud.
The human mind may find it difficult to conceptualize: a cosmic cloud so colossal it surrounds the Sun and eight planets as it extends trillions of miles into deep space.
The spherical shell known as the Oort Cloud is, for all practical purposes, invisible. Its constituent particles are spread so thinly, and so far from the light of any star, including the Sun, that astronomers simply cannot see the cloud, even though it envelops us like a blanket.
It is also theoretical. Astronomers infer the Oort Cloud is there because it’s the only logical explanation for the arrival of a certain class of comets that sporadically visit our solar system. The cloud, it turns out, is basically a gigantic reservoir that may hold billions of icy celestial bodies.
SpaceX catches returning rocket in mid-air, turning a fanciful idea into reality
“Starships are meant to fly. It sure as hell flew today. So let’s get ready for the next one.”
SpaceX accomplished a groundbreaking engineering feat Sunday, when it launched the fifth test flight of its gigantic Starship rocket, then caught the booster back at the launch pad in Texas with mechanical arms seven minutes later.
This achievement is the first of its kind, and it’s crucial for SpaceX’s vision of rapidly reusing the Starship rocket, enabling human expeditions to the Moon and Mars, routine access to space for mind-bogglingly massive payloads, and novel capabilities that no other company—or country—seems close to attaining.
The test flight began with a thundering liftoff of the 398-foot-tall (121.3-meter) Starship rocket at 7:25 am CDT (12:25 UTC) from SpaceX’s Starbase launch site in South Texas, a few miles north of the US-Mexico border. The rocket’s Super Heavy booster stage fired 33 Raptor engines, generating nearly 17 million pounds of thrust and gulping 20 tons of methane and liquid oxygen propellants per second at full throttle.