We Evaluated 2,788 expert-curated chemistry questions across undergrad/grad topics and provide a human baseline from 19 chemistry experts (mostly MS/PhD level). Leading LLMs (like o1, Claude 3.5) significantly outperformed human experts on average
Performance varies by topic:
- Strong: General chemistry, calculations
- Weak: Safety/toxicity, analytical chemistry
Notable: Models excel at textbook problems but struggle with reasoning that requires to map text to reasoning about 2D/3D structures.
Interesting observations:
- Tool-augmented systems (with literature search) didn't improve performance much
- Open source models (Llama-3.1-405B) approaching closed source performance
- Models provide unreliable confidence estimates
- Performance correlates with model size but not molecular complexity
The results suggest both impressive capabilities and clear limitations of current LLMs in chemistry, with implications for chemistry education and scientific tooling.
Code/Data: https://github.com/lamalab-org/chem-bench Interactive results: https://chembench.org