Claude 3.7 is here. Is it good?

Anthropic has finally released a new Claude model, the one that can be used in both thinking and non-thinking mode.

The general sentiment on hacker news is positive.

I, for one, plugged it into the nicechat to test it out, and it's ok. My subjective feeling is that it's good, but o1 and o3 are still better when comes to following instructions.

Benchmarks

The naming indicates a rather incremental progress (it's not Claude 4 after all), and this is also reflected in the benchmarks. Nevertheless, Claude 3.7 claims a new SOTA leadership in several areas (coding and "agentic" usage).

Benchmarks claim better programming capabilities (62% non-thinking / 70% thinking on SWE-bench), and better agentic use (81% vs 73% for o1, no o3 mentioned?). Other metrics seems more or less on par with other SOTA models.

As a side note, it was quite surprising to see Grok 3 leading in graduate-level reasoning scoring even above o3-mini/high.

Aider (which is a command-line based coding assistant) in its own leaderboard, put the new Claude on the top spot (with R1+claude 3.5 now coming second).

Global context

In my humble opinion, Claude 3.7 is another hint that we are actually not in the exponential growth phase. I know, I know. It's a powerful model, and great progress.

I mean, when they call it 3.7, you know beforehand it's going to be incremental.

GPT-4 was released almost 2 years ago, and the AI companies still failed to deliver anything that would feel as big as GPT-3 to GPT-4 jump. The progress is there, but it's kinda getting slower.

We are still waiting for some more SOTA models to drop maybe even this week (R2? GPT 4.5?), so I'm not going to make quick judgements yet.

#Public #AI #LLMs