A distilled, refactored, and packaged variant of the inference tool from the MLX project. It runs on macOS if you have Python with the key feature being able to easily manipulate inference.
For example, it lets you make DeepSeek Llama 8B 'think' for longer by suppressing its thought closing tokens, or you can limit its thinking process to a certain number of tokens in just a few lines of code. This helps the smaller model solve things it gets wrong out of the box.
For example, it lets you make DeepSeek Llama 8B 'think' for longer by suppressing its thought closing tokens, or you can limit its thinking process to a certain number of tokens in just a few lines of code. This helps the smaller model solve things it gets wrong out of the box.