ChemBench: Evaluating LLMs Against Expert Chemists [New Results]

We've released a new version of ChemBench (https://arxiv.org/abs/2404.01475), a framework measuring LLMs' chemistry capabilities against human experts.

We Evaluated 2,788 expert-curated chemistry questions across undergrad/grad topics and provide a human baseline from 19 chemistry experts (mostly MS/PhD level). Leading LLMs (like o1, Claude 3.5) significantly outperformed human experts on average

Performance varies by topic:

- Strong: General chemistry, calculations

- Weak: Safety/toxicity, analytical chemistry

Notable: Models excel at textbook problems but struggle with reasoning that requires to map text to reasoning about 2D/3D structures.

Interesting observations:

- Tool-augmented systems (with literature search) didn't improve performance much

- Open source models (Llama-3.1-405B) approaching closed source performance

- Models provide unreliable confidence estimates

- Performance correlates with model size but not molecular complexity

The results suggest both impressive capabilities and clear limitations of current LLMs in chemistry, with implications for chemistry education and scientific tooling.

Code/Data: https://github.com/lamalab-org/chem-bench Interactive results: https://chembench.org