> Now, instead asking the model to respond in JSON we use XML tags to separate the start and end of the contemplation phase and the final answer
I suspect the author wisely avoided function-calling/JSON since it doesn't guarantee sequence. This–and a few other frailties–make me almost always use XML-like markup for my LLM API calls.
Markup langs like XML and HTML lend themselves quite beautifully to this task. They are stream-friendly, semantically enriched, leniently parseable (html was designed in part for fallible humans to write and for browsers to incrementally render) and by nature of being "markup" they are complementary to the autoregressive nature of LLMs. One assumes as well that tonnes of prose appears in HTML found in training corpuses, less so in JSON which is usually used for transactional data and RPC-like things, which must surely bias JSON completions to more robotic formations. FWIW I ended up creating a library (github.com/padolsey/xmllm) to help me get structured data from LLMs using XML (through the forgiving eyes of an HTML parser), so that I never have to rely on specific LLM tool/function-calling abstractions. Even tiny models like Qwen2.5 and Ministral3B have pretty superb (x|ht)ml compliance, much less so with JSON.