Small Models, Stubborn Physics: A Conversation With Boris Kriuk

In recent years, Boris Kriuk has quietly assembled one of the more unusual track records in applied machine learning. He’s released the largest open-source seismic catalog ever published, with 2.8 million earthquake events across 30 years, alongside a 2.9 million observation pan-Arctic permafrost dataset, a 13-month wildfire dataset spanning over 26,000 incidents, and validated his turbulence network across 340 high-fidelity flight simulations. And yet his neural networks keep getting smaller. We sat down with him to ask why.

Q: Let’s start simple. You’ve published four papers in the last eighteen months on what you call “physics-structured” AI. To someone outside the field, what does that actually mean?

Boris Kriuk: It means I got tired of watching people throw GPUs at problems that physicists already partially solved a century ago. When an aircraft hits turbulence at Mach 8, the air doesn’t care about your transformer architecture. It obeys Kolmogorov’s energy cascade from 1941. So why are people training models with millions of parameters to “discover” laws we already have?

Take PSTNet, our turbulence model. The entire neural network has 552 parameters. Total. It fits in 2.5 kilobytes. A single emoji is bigger than my model. And it beats a deep network with twelve times more parameters and a gradient-boosted ensemble with sixteen times more. Why? Because we hard-coded the physics. Not as a suggestion in the loss function. As architecture.

Q: That sounds modest, but you seem almost angry about it.

Kriuk: To some extent, yes. Because the field has confused size with intelligence. The orthodoxy says: more parameters, more data, more compute. I find that lazy. If you actually understand what you’re modeling, you don’t need to brute-force it. The Monin-Obukhov similarity theory has been describing atmospheric boundary layers since 1954. The Gutenberg-Richter law has predicted earthquake magnitudes since 1944. These aren’t outdated. They’re correct. They’re just inconvenient because you can’t write a press release about discovering them with deep learning.

Q: Can you give a layman a sense of what these old laws actually say?

Kriuk: Sure. Gutenberg-Richter says that for every magnitude 7 earthquake, you’ll see roughly ten magnitude 6 earthquakes, a hundred magnitude 5 earthquakes, and so on. It’s a logarithmic relationship. Same with Omori-Utsu for aftershocks: their rate decays predictably after a mainshock. Kolmogorov’s cascade describes how kinetic energy in turbulent fluids transfers from large eddies down to small ones at a precise mathematical rate. These are not approximations or rules of thumb. They’re some of the most rigorously validated empirical laws in geophysics and fluid dynamics.

The point is, if you build a neural network that ignores them, you’re either going to rediscover them poorly, or violate them silently and not notice. Both outcomes are bad. So why not bake them in from the start?

Q: Your POSEIDON paper applies a similar idea to earthquakes, and you released a 2.8 million event catalog with it. Why earthquakes, and why at that scale?

Kriuk: The dataset had to be that large because the events that matter — the tsunami-generating ones — make up about 1.14% of the catalog. You can’t learn rare phenomena from small data. That’s not a deep learning principle, it’s basic statistics.

But scale alone is meaningless. What I noticed was that nobody was treating physical laws as learnable parameters. Everyone treated them as validation metrics. The typical workflow was: train a black box, then check if it accidentally agrees with the Gutenberg-Richter b-value. That’s backwards. We embedded the b-value as a constrained parameter, bounded between 0.7 and 1.3, and let the network discover the right value while simultaneously predicting aftershocks, tsunamis, and foreshocks. It converged to 0.752. Within established seismological range. Without us forcing it directly.

That’s the real result. Not that AI rediscovered physics, but that physics made the AI smarter.

Q: Let’s talk about the permafrost work. You analyzed 2.9 million observations across 171,605 Arctic locations. Some people might call this “climate research.” You seem reluctant to.

Kriuk: (laughs) I’d call it infrastructure research. There’s over a hundred billion dollars of pipelines, railways, and buildings sitting on ground that’s getting softer. I don’t particularly care whether you blame humans, the sun, volcanic outgassing, or Poseidon himself. The ground is moving. Engineers need numbers. That’s what we built.

What I refuse to do is what a lot of “Climate AI” does, which is generate scary maps to win conference papers and policy points. Our framework gives uncertainty estimates. It tells you where the model is unsure. That’s heretical in a field where everyone wants to project confidence to justify funding.

Q: That’s an unusually skeptical position for someone working in this space.

Kriuk: I’m not skeptical of physics. I’m skeptical of theater. There’s a difference between a thermodynamics problem and a moral crusade, and most climate machine learning conflates them. When you train a model on historical data and then run it under RCP8.5, which is essentially worst-case fossil fuel acceleration, your model is extrapolating beyond its training distribution. It’s hallucinating. That’s not a small detail. That’s the entire methodological problem.

Pure data-driven models cannot extrapolate climate. They can only interpolate weather. People who don’t understand the difference are giving policymakers garbage and calling it science. So in our hybrid approach, we constrain the ML predictions with physical adjustment factors. Minus 10 percentage points of permafrost per degree Celsius near the freezing threshold. That’s not from a neural network. That’s from soil thermodynamics. The neural network refines, it doesn’t invent.

Q: A skeptic might push back here and say: aren’t you just using physics to justify whatever the model already wanted to do?

Kriuk: Fair question. The honest answer is that we have rigorous spatiotemporal cross-validation to test exactly that. We hold out entire geographic regions and entire years from training. The model has never seen them. If it still predicts well on those held-out blocks, the relationships are real, not memorized. Most published permafrost ML papers don’t do this. They use random train-test splits on spatially autocorrelated data, which means the test set is basically the training set with different labels. They get inflated metrics and nobody catches it because reviewers don’t check.

When we apply proper validation, our R-squared still hits 0.98. That’s not because we’re doing magic. It’s because the underlying physics is strong enough that even with rigorous splitting, the model finds the signal.

Q: You also worked on Eurasian wildfires, where you released a thirteen-month dataset covering over 26,000 fire incidents. What surprised you?

Kriuk: That solar radiation matters more than temperature for ignition. Everyone’s screaming about temperature, but our Random Forest analysis ranked solar radiation as the single most important predictor at about 24 percent of the model’s decision-making. Temperature was second. Wind speed was third. And precipitation was dead last in importance.

That’s interesting because it tells you something the popular narrative misses: fires are an energy balance problem, not a thermometer problem. Drying matters more than heating. A region can be cool and burning. A region can be hot and stable. The data doesn’t care about the headline.

Q: Did anything else surprise you about the fire data?

Kriuk: Two things. First, most fires we recorded started under 70 to 80 percent relative humidity, not in dry crisp conditions. That’s because many ignitions happen in the morning or evening when humidity hasn’t dropped yet, or in swampy and forested areas where ambient humidity stays elevated even during active fires. The “tinderbox” image people have in their heads is wrong for a large class of real-world ignitions.

Second, the model couldn’t distinguish uncontrolled burns from forest fires at all. Zero accuracy on that class. And I think that’s actually informative. It means uncontrolled burns don’t have a unique meteorological signature. They’re forest fires that got out of hand. The boundary is operational, not physical. Machine learning correctly told us the categories overlap, and that’s a more useful finding than a fake number.

Q: One thing that runs through all your papers is this insistence on minimal models. Your turbulence network has 552 parameters. Even your seismic and permafrost frameworks are computationally cheap relative to the dataset sizes. Why does this matter outside of academic bragging rights?

Kriuk: Because deployment is reality. A pilot flying over the Arctic doesn’t have a data center in the cockpit. PSTNet runs on a Cortex-M7 microcontroller in under 12 microseconds. That means it can sit inside the actual guidance computer. A 9,000-parameter ensemble can’t. A transformer can’t. It’s deployable physics.

People mistake academic papers for solutions. A model that achieves state-of-the-art on a benchmark and then requires an H100 GPU to run is, for most operational purposes, useless. It’s a museum piece. I build things that fit in 2.5 kilobytes because the world that needs them — Arctic infrastructure, embedded avionics, oceanic flight corridors without ground stations — doesn’t have GPUs.

Q: There’s another aspect to the small-model approach: interpretability. Can you talk about that?

Kriuk: Yes, and this is something the field undersells. When PSTNet’s gating network learned to route inputs through four expert sub-networks, we never told it what the experts should specialize in. We just gave it the data and the architecture. It independently recovered the four classical atmospheric stability regimes that boundary-layer meteorologists have used for decades: convective, neutral, stable, and stratospheric. The transitions even happened at the right Richardson numbers and altitudes.

That’s not a coincidence. That’s the architecture forcing the model to organize itself along physically meaningful axes. And the consequence is that when the model makes a prediction, you can ask it why, and the answer is intelligible to a domain expert. With a billion-parameter transformer, that question has no real answer. You get a vibe at best.

Q: What do you think your field fundamentally gets wrong?

Kriuk: Three things. First, the obsession with scale of parameters. Scale of data is fine. I built million-event datasets for exactly that reason. But most problems do not need bigger models. They need correct structure. Second, the worship of the loss function. Encoding physics in the loss is a soft suggestion the model can ignore when convenient. Encode it in the architecture and it becomes a hard constraint. Big difference. Third, and this is the unpopular one, climate machine learning has confused activism with rigor. If your validation strategy uses random train-test splits on spatially autocorrelated data, your impressive R-squared is fiction. I see this constantly. It would not survive a methods seminar in geophysics from 1985.

Q: Final question. Where does this go next?

Kriuk: I want to make models that are even smaller, even faster, and even more boring. The future of useful AI isn’t chatbots writing poetry. It’s networks of 500 parameters that fit inside microcontrollers, embedded with the laws of physics, running on the wing of an aircraft or under the foundation of a pipeline, telling engineers something true. That’s the work.

Everything else is just compute-burning that flatters its authors.