Benchmarking AI Capabilities Via Virtual Physical Laws Discovery

8 min readDec 28, 2024

The article is the result of my BlueDot Impact’s AI Safety Fundamentals Writing Intensive.

Introduction

As modern LLMs and multimodal models grow more sophisticated, conventional benchmarks often fail to capture the depth of their scientific reasoning skills. Common tests (like question-answer datasets or reading comprehension) may feel limited because we must already know the “right answers” in advance. Designing truly novel challenges — ones we ourselves haven’t solved — is difficult [1,2]. At the same time, ensuring that AI capabilities benchmarks are representative and robust is crucial for AI safety [3].

A compelling way to address these issues is to create virtual physical environments governed by known but arbitrarily complex sets of “physical laws.” Within these artificially constructed universes, AI systems must run experiments, gather data, and uncover the hidden laws on their own. Because we define the simulation’s ground truths, we already know the correct laws. Hence, we can easily measure whether the AI’s discovered law matches the truth.

This idea bypasses a central difficulty in AI benchmarking: how to evaluate correct answers if humans can’t solve the problem themselves while doing it in an easily measurable, controlled toy environment. By simulating universal rules from the start, we can produce challenges as simple or as bizarre as we wish. And in doing so, we push AI systems to act like creative scientists: forming hypotheses, running experiments, and revising theories based on evidence.

There are several research papers investigating the abilities of modern AI systems to conduct scientific research autonomously (and corresponding benchmarks), including specific instances like AI Scientist [4], CRESt Copilot [5], MLE-bench [6], and general reviews and programs [7,8,9].

However, there is a substantial difference between the program I propose and the existing research on AI capabilities to do science: the latter investigates the abilities of LLMs to assist in “real-world” scientific projects, testing on concrete existing problems. While this approach is highly practically relevant, it is less useful for the purposes of benchmarking, because we cannot controllably scale and adjust the scope and the hardness of problems and because we often need to know the ground truth in advance (by doing science ourselves, which is hard and costly). My approach should be considered as “new evals via science-like tasks” rather than “testing the abilities of AIs to solve real-world scientific problems”. We can expect that there may be a significant overlap between these two domains, but the prior difference must be emphasised.

Why New Benchmarks Are Needed

Modern LLMs and other AI techniques are rapidly surpassing classic or even extremely recent benchmarks, such as [10], [11] and [12]. Tasks once seen as “hard” (e.g., logical reasoning, coding, undergraduate mathematics) are becoming trivial for state-of-the-art models.
Curating high-quality datasets and tasks is expensive and time-consuming. Worse, many current benchmarks rely on fixed sets of questions, which LLMs can “memorize” during training.
The next frontiers (like scientific discovery) are inherently open-ended. Building an AI that can independently derive new knowledge — like Newton’s laws — requires tasks that push the limits of discovery and reasoning.
Typically, to confirm an AI’s solution, humans themselves must know the correct answer. But for extremely complex or novel problems, humans might not know the solution a priori. Instead, if we define a fully known simulation environment, we automatically have the ground truth accessible — no complicated, real-world, trial-and-error approach is needed.

Overview of the Virtual Physical Environments

a) Base-Level Environment: Newtonian Mechanics

Imagine the simplest possible environment, a digital world governed by classical Newtonian mechanics. Suppose there are only one or two bodies experiencing a force — like gravity. The AI can apply various test “pushes,” observe the resulting motion, and collect data on velocity, position, and acceleration. Eventually, it should infer something akin to F=ma (Force = Mass × Acceleration) and the law of gravitation.

This approach replicates a basic physics lab experiment, but entirely in a simulator. By systematically applying known forces, collecting data, and analyzing it, an AI can try to guess the law of gravity, for instance. Because we wrote the simulation code, we know exactly what the correct formula for force is. The AI is tested on how closely it matches this known law.

b) Increasing Complexity and Weirdness

Next, we can increase complexity. Instead of sticking to inverse-square laws, we can define bizarre functional forms: polynomial potentials, exponential effects, cyclical interactions, or entirely novel relationships between position, velocity, acceleration, and force. Here are a few elementary creative extensions:

Strange Force Laws
Instead of gravity scaling like inverse square law, try something like a force that depends on the velocity of the body or the time of day within the simulation. Maybe it’s F=a⋅t⋅v, where a is a constant but depends on a hidden dimension.
Modified Equations of Motion
We could define a more exotic relationship between force and acceleration — one that’s not linear or that depends on an unknown hidden parameter only revealed under certain conditions (like crossing a “wormhole region” in the simulation).
High-Dimensional Spaces
Instead of a normal 2D or 3D environment, push the system into 4D or higher dimensions — an environment that humans struggle to visualize. The AI would gather numerical data, observe how bodies move in these higher dimensions, and attempt to formulate the underlying laws.
Combinatorial Laws
Introduce multiple forces with complicated superposition rules — such as non-linear additions or force components that interact in an entangled manner — so that identifying each piece separately becomes challenging.

As we keep adding more “weird” layers, the environment transitions from a near-intuitive system to something that defies human intuition but is still entirely deterministic within the simulator. The AI’s challenge is to discover these hidden truths via virtual experiments. Such virtual laws can be created arbitrarily complex very easily, and an automated framework generating them on different complexity levels can be developed.

Methodology for AI Discovery of Virtual Laws

1. Designing the Simulator
At the core is a Python library (or any other environment) that can easily define:

A set of entities (masses, charges, “field” sources, etc.).
The geometry or dimensionality of the space.
The functional forms of the laws (e.g., potential fields, forces, motion equations — all that can be done in differential equations notation).

2. Experiment Interface
The AI receives an interface that lets it:

Query the environment for the current positions, velocities, accelerations, etc.
Apply certain actions (e.g., exerting a test force, or placing a new object).
Run the simulation for discrete or continuous time steps to collect observational data.

3. Data Analysis Tools
The AI is provided with standard scientific data analysis techniques — like numerical differentiation, regression, or even symbolic manipulation methods — to hypothesize formulas. Essentially, the AI acts like a scientist, formulating a guess such as, “I think the force is proportional to 1/r² with some constant,” and then refining or rejecting that hypothesis based on new experiment results.

4. Evaluation Metrics
Because we know the underlying ground truth, scoring can be done by comparing the AI’s discovered formulas to the actual ones. Various scoring techniques are possible:

Formula Similarity: Does the discovered law closely match the symbolic expression of the truth?
Predictive Accuracy: Does the law predict motion or final positions accurately across new tests not used in training?

Why This Approach Is Powerful

We can create infinitely many “universe variants,” controlling the difficulty level. Minimal changes in the force equations can produce large shifts in challenge complexity. Unlike real-world discovery, we know the fundamental rules, so we can straightforwardly check if an AI’s found law is correct.

Moreover, traditional AI tasks often involve classification or short-answer Q&A. Here, the AI must create a consistent scientific theory, an intrinsically open-ended challenge mirroring human-like scientific exploration. Discovering new laws from data requires more than memorization; it demands an ability to hypothesize, test, and refine — essentially replicating the scientific method.

Finally, because the environment can generate its own data, minimal human intervention is required. An AI can run unlimited experiments to incrementally improve its model of the world.

Challenges and Considerations

Extremely complicated laws might require substantial computational resources or sophisticated solvers to run. Balancing realism, complexity, and speed is key. However, note that we are testing here complex laws in simple environments and not complex applications of simple laws (such as would be the case with fluid dynamics or galaxy formation), and it generally requires much less resources — especially compared to running a modern LLM.
Symbolic representation (e.g., using symbolic math engines) can be tricky. AI models must be guided to produce valid, interpretable equations, rather than just black-box neural networks that approximate the effect. We aim at getting simple conceptual representations from AIs, not just fitting the data to some high-dimensional curves.
Deciding how close an AI’s discovered formula is to the “real” formula can be nuanced. We might allow for physically equivalent forms (e.g., factoring out constants) or approximate equivalences in certain domains.
If an AI only has a limited number of experiments, it might overfit that data rather than finding the general rule. Strategies like cross-validation within the environment or randomized scenarios can help.

Conclusion

Building virtual physical environments with adjustable, “weird” rules offers a promising new benchmark for testing AI’s capacity for scientific discovery. By simulating unknown physical laws, we empower AI systems to do exactly what real scientists do: propose hypotheses, perform experiments, analyze results, and refine theories. The result is an infinitely extensible, open-ended challenge that not only helps us measure AI’s growth but also points the way toward a future where AI operates as a genuine collaborator in scientific exploration.

As AI continues to mature, these simulated worlds can become more sophisticated — adding diverse phenomena, complicated interdependencies, and higher dimensions. Alongside better simulation interfaces and more powerful symbolic tools, we could see LLMs or related AI frameworks deriving creative, even unexpected, explanations for the laws we embed. Ultimately, this approach to benchmarking will help ensure that AI is evaluated on its ability to reason, discover, and create new knowledge — core facets of true intelligence.

References

Burden, J. (2024). Evaluating AI evaluation: Perils and prospects. arXiv preprint arXiv:2407.09221.
Oveisi, S., Gholamrezaie, F., Qajari, N., Moein, M. S., & Goodarzi, M. (2024). Review of Artificial Intelligence-Based Systems: Evaluation, Standards, and Methods. Advances in the Standards & Applied Sciences, 2(2), 4–29.
Xia, B., Lu, Q., Zhu, L., & Xing, Z. (2024, July). An ai system evaluation framework for advancing ai safety: Terminology, taxonomy, lifecycle mapping. In Proceedings of the 1st ACM International Conference on AI-Powered Software (pp. 74–78).
Lu, C., Lu, C., Lange, R. T., Foerster, J., Clune, J., & Ha, D. (2024). The ai scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292.
Ren, Z., Zhang, Z., Tian, Y., & Li, J. (2023). Crest–copilot for real-world experimental scientist.
Chan, J. S., Chowdhury, N., Jaffe, O., Aung, J., Sherburn, D., Mays, E., … & Mądry, A. (2024). Mle-bench: Evaluating machine learning agents on machine learning engineering. arXiv preprint arXiv:2410.07095.
Wang, H., Fu, T., Du, Y., Gao, W., Huang, K., Liu, Z., … & Zitnik, M. (2023). Scientific discovery in the age of artificial intelligence. Nature, 620(7972), 47–60.
Reddy, C. K., & Shojaee, P. (2024). Towards Scientific Discovery with Generative AI: Progress, Opportunities, and Challenges. arXiv preprint arXiv:2412.11427.
Glickman, M., & Zhang, Y. (2024). AI and generative AI for research discovery and summarization. Harvard Data Science Review, 6(2).
Chollet, F. (2019). On the measure of intelligence. arXiv preprint arXiv:1911.01547.
Wang, A. (2018). Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
Zheng, T., Bai, J., Wang, Y., Fang, T., Guo, Y., Yim, Y., & Song, Y. (2024). CLR-Fact: Evaluating the Complex Logical Reasoning Capability of Large Language Models over Factual Knowledge. arXiv preprint arXiv:2407.20564.