- Add normalization of messages before API calls
- Implement token projection and enforce budget for 16k/32k windows
- Summarize only tool-call request arguments (not responses) when over budget
- Optionally elide redundant code blocks in old assistant messages as last-resort trimming
- Default small-model limit to 16k, large to 32k; reserve space for response tokens
- Keep core behavior and tool execution unchanged
This commit introduces a new command-line argument `--use-large-model` to `openai_compatible_inference_bot.py`. When this argument is provided, the bot will initialize and use the large model (as configured via environment variables) by default, instead of the small model. This allows for easier testing and deployment of the large model from the command line.
Fixes#224