BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models
pip install beyondbench