¿Qué tan preciso es ChatGPT? Guía completa 2025 sobre la fiabilidad de la IA, benchmarks y rendimiento específico por dominio

24 de enero de 2025
Measuring ChatGPT accuracy: Understanding AI reliability in 2025

How accurate is ChatGPT? It is one of the most frequently asked questions about artificial intelligence in 2025. With millions of people using ChatGPT daily for everything from writing emails to diagnosing medical symptoms, understanding the reliability and accuracy of this AI tool has never been more important. This comprehensive guide examines the latest research, benchmark results, and real-world studies to give you a complete picture of ChatGPT accuracy across different domains.

Whether you are a student using ChatGPT for homework help, a professional relying on it for work tasks, or simply curious about AI capabilities, this article will help you understand when you can trust ChatGPT and when you need to double-check its responses. We will cover everything from official benchmark scores to domain-specific accuracy rates, hallucination studies, and expert tips for getting the most accurate results.

Understanding AI Accuracy: What Does It Really Mean?

Before diving into specific numbers, it is essential to understand what accuracy means in the context of large language models (LLMs) like ChatGPT. Unlike traditional software that follows predetermined rules, ChatGPT generates responses based on patterns learned from vast amounts of training data. This means accuracy can vary dramatically depending on the type of question, the domain, and even how the question is phrased.

AI accuracy is typically measured through several dimensions including factual correctness (whether the information provided is true), reasoning accuracy (whether logical conclusions are valid), completeness (whether all relevant information is included), and consistency (whether the AI gives the same answer to similar questions). Different benchmarks test different aspects of these capabilities.

It is also crucial to distinguish between accuracy and confidence. ChatGPT often presents information with high confidence even when it is wrong, a phenomenon that makes its inaccuracies particularly dangerous. According to research from Harvard Kennedy School, this tendency to generate plausible-sounding but false information has led researchers to develop frameworks specifically for studying AI hallucinations.

Overall Accuracy Rates by ChatGPT Model Version

ChatGPT accuracy has improved significantly across different model versions. Understanding these differences is crucial because many users may still be using older versions or free tier access that provides access to less capable models.

GPT-3.5 (Released 2022)

GPT-3.5, the model that powered the original viral ChatGPT release, achieves approximately 70-75% general accuracy according to industry assessments. On the MMLU benchmark, which tests broad knowledge across 57 subjects, GPT-3.5 scored around 70%. For coding tasks measured by HumanEval, it achieved 48.1% accuracy. This version is known to have a hallucination rate of approximately 39.6% when generating scientific references.

GPT-4 (Released March 2023)

GPT-4 represented a massive leap in accuracy, achieving 85-88% general accuracy across various benchmarks. On MMLU, GPT-4 broke through with scores around 86%, a significant improvement over its predecessor. The model reduced hallucination rates to 28.6% for citation accuracy in scientific contexts. GPT-4 also demonstrated remarkable performance on professional exams, scoring in the top 10% on Bar exam simulations.

GPT-4o (Released 2024)

GPT-4o, where the o stands for omni, brought further improvements. According to OpenAI benchmarks, GPT-4o scores 88.7% on MMLU for language understanding, 76.6% on MATH for arithmetic skills, 53.6% on GPQA for graduate-level reasoning, and an impressive 90.2% on HumanEval for coding tasks. However, studies have found that on the SimpleQA factual benchmark, GPT-4o still hallucinates 61.8% of the time, highlighting the persistent challenge of factual accuracy.

GPT-4.5 (Released Early 2025)

GPT-4.5 made significant strides in reducing hallucinations. According to OpenAI, it cuts hallucination rates by 63% compared to GPT-4o. On the SimpleQA benchmark, GPT-4.5 generates information purely out of thin air only 19% of the time, compared with 52% for GPT-4o. The model is described as 10 times more efficient with substantially fewer factual errors.

GPT-5 (Released 2025)

GPT-5 represents the current state of the art, achieving 87-94% accuracy depending on the task domain. OpenAI reports that GPT-5 achieves 94.6% accuracy on AIME 2025 math problems, 87% on MMLU broad knowledge tests, makes 45% fewer factual errors compared to GPT-4, and is 6 times less likely to make up answers. These improvements make GPT-5 significantly more reliable for professional and academic use cases.

Benchmark Results: MMLU, HumanEval, and Beyond

AI researchers use standardized benchmarks to measure and compare model capabilities. Understanding these benchmarks helps contextualize accuracy claims and understand where ChatGPT excels or struggles.

MMLU (Massive Multitask Language Understanding)

MMLU is one of the most important benchmarks for measuring broad knowledge. It tests AI systems across 57 subjects ranging from elementary mathematics to professional law and medicine. Early large models like GPT-3 only managed around 30-40% accuracy on MMLU, while a human expert ensemble could reach about 89%. Over time, models improved dramatically: Google Chinchilla and PaLM reached the 50-60% range by 2022, GPT-3.5 hit around 70-75%, and GPT-4 broke through with scores around 86%.

According to the Stanford AI Index 2025 Report, at the end of 2023, performance gaps on MMLU between leading American and Chinese models were 17.5 percentage points. By the end of 2024, these differences had narrowed substantially to just 0.3 percentage points, indicating rapid global convergence in AI capabilities.

HumanEval (Coding Benchmark)

HumanEval measures how well LLMs can generate code by testing their ability to understand programming tasks and produce syntactically correct and functionally accurate code. The dataset includes 164 programming tasks with unit tests that automatically verify solutions. A models solution must pass all provided test cases for a given problem to be considered correct.

Performance on HumanEval has improved dramatically: GPT-3.5 achieved 48.1%, GPT-4 reached 67.0%, and GPT-4o now scores 90.2%. For comparison, Claude 3.5 Sonnet achieved 92% on HumanEval, slightly edging out GPT-4o. However, researchers have found that when using HumanEval+, an extended version with 80 times more test cases, pass rates drop by 19.3-28.9%, suggesting that standard benchmarks may overestimate real-world coding accuracy.

GPQA (Graduate-Level Reasoning)

GPQA Diamond tests graduate-level reasoning in physics, chemistry, and biology. GPT-4o achieves 53.6% on this challenging benchmark, while the latest GPT-5.2 Pro reaches 93.2%. This benchmark is particularly important because it tests the kind of advanced reasoning required for scientific research and professional applications.

Emerging Benchmarks

The saturation of traditional benchmarks like MMLU, GSM8K, and HumanEval has pushed researchers to develop more challenging evaluations. Notable among these newer benchmarks are Humanitys Last Exam, a rigorous academic test where the top system scores just 8.80%, FrontierMath, a complex mathematics benchmark where AI systems solve only 2% of problems, and BigCodeBench, a coding benchmark where AI systems achieve 35.5% success rate compared to the human standard of 97%. These benchmarks reveal that despite impressive progress, significant gaps remain between AI and human expert performance.

Domain-Specific Accuracy: Math and Reasoning

Mathematics has historically been one of ChatGPTs weaker areas, though recent versions have shown substantial improvement. Understanding the nuances of math accuracy is crucial for anyone relying on ChatGPT for calculations or mathematical reasoning.

Historical Math Performance

According to a NeurIPS 2023 study on Mathematical Capabilities of ChatGPT, the January 2023 version of ChatGPT achieved a perfect score on only 29% of random samples from the MATH dataset, compared to Minervas best model which achieved 50%. Initial tests indicated that performance is significantly below the 60% accuracy for state-of-the-art algorithms for math word problem-solvers.

Researchers identified that one of the limitations with ChatGPT is its inability to do good multistep logical inference. The study found that ChatGPT and GPT-4 can be used most successfully as mathematical assistants for querying facts, acting as mathematical search engines and knowledge base interfaces. GPT-4 can additionally be used for undergraduate-level mathematics but fails on graduate-level difficulty.

Why LLMs Struggle with Math

Tokenization has something to do with ChatGPTs math limitations. The process of dividing data into chunks helps AI densely encode information, but because tokenizers do not really know what numbers are, they frequently destroy the relationships between digits. Testing has shown that OpenAIs o1 model solves up to 9x9 multiplication with decent accuracy, while GPT-4o struggles beyond 4x4 multiplication.

Performance Variability

A concerning finding from Stanford University research showed that ChatGPT performed worse on certain tasks in June than its March version. The study found wild fluctuations, called drift, in the technologys ability to perform certain tasks. One example showed ChatGPT accurately answering a simple math problem 98% of the time, then dropping to just 2% over several months. This performance drift highlights the importance of testing accuracy regularly rather than assuming consistent performance.

Geometry and Measurement Challenges

Research on NAEP mathematics problem solving found that ChatGPT-4 and ChatGPT-4o models excel in computational tasks and procedural logic, as seen in their adept handling of algebra and number properties. However, the AI models demonstrated notably poorer performance in geometry and measurement compared to algebra, with odds ratios of 0.18 for geometry and 0.21 for measurement compared to algebra questions.

Recent Improvements

The latest models show dramatic improvement. GPT-5 achieves 94.6% accuracy on AIME 2025 math problems. On FrontierMath Tier 1-3, an evaluation of expert-level mathematics, GPT-5.2 Thinking set a new state of the art, solving 40.3% of problems. While this represents significant progress, it still means the AI fails on nearly 60% of expert-level math problems.

Domain-Specific Accuracy: Coding and Programming

Coding is one of ChatGPTs strongest domains, with GPT-4o achieving 90.2% on HumanEval. However, real-world programming involves much more complexity than benchmark tasks.

Benchmark Performance

On the HumanEval benchmark, ChatGPT performance has improved steadily: GPT-3.5 achieved 48.1%, GPT-4 reached 67.0%, and GPT-4o scores 90.2%. This makes coding one of the areas where ChatGPT approaches human-level performance on standardized tests.

Real-World Software Engineering

The SWE-bench benchmark tests AI on real-world software engineering tasks. GPT-4o achieves approximately 49% on SWE-bench when combined with OpenAI o1. GPT-5.2 Thinking sets a new state of the art of 55.6% on SWE-Bench Pro, which tests four programming languages rather than just Python. On SWE-bench Verified, GPT-5.2 Thinking scores 80%, translating into a model that can more reliably debug production code, implement feature requests, refactor large codebases, and ship fixes end-to-end.

Benchmark Limitations

Researchers have found that standard benchmarks may overestimate real-world coding ability. EvalPlus extends HumanEval test cases by 80x, and when evaluating performance using HumanEvalNext, an improved version, there is a substantial decline in pass@1 accuracy. Across ten state-of-the-art open-source code models, the average pass@1 score decreases by 31.2%, with a median drop of 26.0%. This suggests that ChatGPTs apparent 90% coding accuracy may be closer to 60-70% in more rigorous testing.

Common Failure Modes

ChatGPT coding failures typically fall into several categories: syntax errors in less common programming languages, incorrect handling of edge cases, failure to understand complex system architecture, and generation of code that works for simple examples but fails at scale. Users should always test generated code thoroughly before deploying it in production environments.

Domain-Specific Accuracy: Medical and Health

Medical accuracy is perhaps the most critical domain for ChatGPT evaluation, given the potential for harm if incorrect medical information leads to poor health decisions. Research in this area has been extensive and revealing.

Overall Clinical Decision Making

Research from Mass General Brigham found that ChatGPT was about 72 percent accurate in overall clinical decision making, from coming up with possible diagnoses to making final diagnoses and care management decisions. The study showed ChatGPT was best at making final diagnoses, where it was 77 percent accurate. It was lowest-performing in making differential diagnoses at only 60 percent accuracy, and 68 percent accurate in clinical management decisions such as determining medications.

ChatGPT vs. Physicians

A University of Virginia study found surprising results: physicians using ChatGPT Plus had a median diagnostic accuracy of 76.3%, while ChatGPT Plus alone achieved a median diagnostic accuracy of more than 92%. However, many physicians did not agree with or factor in the models diagnostic predictions, though those with ChatGPT access did complete their assessments more than a minute faster on average.

In an emergency department setting, a head-to-head comparison published in PMC showed that while GPT-3.5 achieved the same diagnostic accuracy as ED resident physicians, GPT-4 actually surpassed the physicians. A separate study found that ChatGPT outperformed human doctors 64% vs 60.2% on neurology assessments.

Performance on Medical Exams

Studies comparing ChatGPT versions showed improved accuracy from ChatGPT 3.5 to ChatGPT 4.0 when answering USMLE Step 2 CK practice questions, improving from 47.7% to 87.2%. This dramatic improvement demonstrates the rapid advancement in medical knowledge capabilities.

Common vs. Rare Conditions

A critical finding from medical research shows that ChatGPT accuracy varies dramatically based on condition prevalence. Physicians rated ChatGPT responses as completely or mostly correct 84.8% of the time overall. However, for common conditions, accuracy was 86.6%, while for rare disorders, accuracy dropped to just 16.6%. ChatGPT 4 solved all common cases within 2 suggested diagnoses, while rare disease conditions required 8 or more suggestions to solve 90% of cases.

Specific Medical Capabilities

Research found ChatGPT showed high accuracy in identifying disease terms (88-97%), drug names (90-91%), and genetic information (88-98%). However, symptom identification scored lower at 49-61%. When asked for specific accession numbers from genetic databases, ChatGPT made them up entirely, demonstrating the hallucination problem in specialized medical contexts.

Expert Recommendations

Researchers consistently conclude that ChatGPT remains best used to augment, rather than replace, human physicians. Studies emphasize proceeding with caution when using ChatGPT as a diagnostic tool and ensuring it is used responsibly, as high relevance combined with relatively low accuracy can present important information that may be misleading.

Domain-Specific Accuracy: Legal

ChatGPTs legal accuracy has been a subject of significant debate, particularly following OpenAIs claims about GPT-4s Bar exam performance.

Bar Exam Performance Controversy

GPT-4 achieved a score of 298 out of 400 on the US Uniform Bar Examination (MBE + MEE + MPT), which OpenAI claimed placed it within the top 10% of human test takers. However, this claim has been challenged by researchers, most prominently by Eric Martinez in his article Re-evaluating GPT-4s bar exam performance published in Artificial Intelligence and Law (2024).

When limiting the sample to those who passed the bar, the models percentile dropped significantly. With regard to the aggregate UBE score, GPT-4 scored in approximately the 45th percentile. For the MBE section, GPT-4 scored in approximately the 69th percentile, whereas for the MEE + MPT essay sections, GPT-4 scored in approximately the 15th percentile.

Performance by Subject Area

On the MBE, GPT-4 significantly outperforms both human test-takers and prior models, demonstrating a 26% increase over ChatGPT and beating humans in five of seven subject areas. GPT-4 achieves a nearly 40% increase over ChatGPT in Contracts and more than 35% raw increase in Evidence. Civil Procedure is both the worst subject for GPT-4, ChatGPT, and human test-takers.

Practical Legal Limitations

In practical legal tasks, ChatGPT shows significant limitations. One study on law school exams found that ChatGPT struggled to spot issues, failed to go into sufficient detail when applying legal rules to hypothetical facts, and misunderstood some legal terms. The results were not accurate enough to deliver legal information directly to non-experts.

Legal practitioners have found LLMs to be very good at summarizing and outlines, but not as adept at legal analysis, where it frequently hallucinates the legal conclusions. As one expert noted: In the legal context, you need to know when it is right or wrong. You need to double-check everything that it does.

Essay Writing Weakness

Research found that GPT-4 particularly struggled on essay writing compared to practicing lawyers, indicating that large language models struggle on tasks that more closely resemble what a lawyer does on a daily basis. This is concerning because legal work often requires nuanced analysis and persuasive writing rather than multiple-choice answers.

Domain-Specific Accuracy: Science

Scientific accuracy is crucial for students, researchers, and professionals who rely on ChatGPT for scientific information and problem-solving.

Physics Performance

Research found that ChatGPT (GPT-4) could successfully solve 62.5% of well-specified physics problems, but accuracy drops dramatically to just 8.3% for under-specified problems. Analysis revealed three distinct failure modes: failure to construct accurate models of the physical world, failure to make reasonable assumptions about missing data, and calculation errors.

An earlier study found that ChatGPT based on GPT-3 could narrowly pass a calculus-based college-level introductory physics course, solving 60% (18 of 30) of the Force Concept Inventory items. Researchers concluded that GPT-4 falls short in providing creative solutions to problems in physics and likely other subjects.

General Science and Engineering

A study from Delft University of Technology collected 594 questions from 198 faculty members across five faculties. The results showed that answers from ChatGPT are, on average, perceived as mostly correct. However, the rating of ChatGPT answers significantly decreases as the educational level of the question increases and when evaluating skills beyond scientific knowledge.

Standardized Science Tests

GPT-4 has performed well on standardized tests such as AP Biology, Chemistry, Environmental Science, and Physics Exams. This suggests that for established, well-documented scientific knowledge, ChatGPT performs reasonably well. Problems arise with cutting-edge research, specialized topics, and questions requiring creative problem-solving.

Cell Biology and Biochemistry

Research evaluated ChatGPT-generated responses for scientific questions related to biochemistry, specifically cell energy metabolism. Faculty members from chemistry and biology departments assessed the responses using university-level textbooks as references. While general concepts were often correctly explained, detailed mechanisms and specific pathway interactions showed more errors.

Domain-Specific Accuracy: History and Facts

Historical and factual accuracy presents unique challenges for ChatGPT, as it requires both accurate recall and proper contextualization of information.

Historical Accuracy Studies

A 2024 study published in Frontiers in Artificial Intelligence tested ChatGPT-4s ability to provide accurate information about the origins and evolution of SWOT analysis, examining it for historical accuracy and hallucinations. Researchers found that inaccuracies ranged from minor factual errors to more serious hallucinations that deviate from evidence in scholarly publications.

Interestingly, ChatGPT-4 also came up with spontaneous historically accurate facts that were not directly prompted. The interpretation is that ChatGPT is largely trained on easily available websites and to a very limited extent has been trained on scholarly publications, especially when these are behind a paywall.

Pattern of Historical Errors

While ChatGPT-4 demonstrates a high level of proficiency in describing and outlining general concepts, there are notable discrepancies when detailing origins and evolution of concepts. This pattern suggests ChatGPT is better at summarizing well-known information than tracing the development of ideas or attributing discoveries to specific individuals and dates.

Domain-Specific Accuracy: Current Events

¿Necesitas Ayuda Eligiendo? ¿Aún Decidiendo? 🤷‍♀️

¡Haz nuestro quiz rápido para encontrar la herramienta IA perfecta para tu equipo! 🎯✨