Background: The evaluation of large language models (LLMs) in medicine has undergone a shift from knowledge-based testing to practice-based assessment, representing an evolution in how we measure ...