Best CodeLlama Copilot for M1 Mac

I’ve been experimenting with LLMs, particularly those with coding assistant capabilities such as CodeLlama from Meta. In this post, I’ll share the best open-source locally running options for a coding copilot-like assistant on a M1/M2 Mac.

The language model I’ve been using is Phind-CodeLlama-34B-v2, specifically the GGUF quantised version from TheBloke (the “Q4_K_M” version has a good balance of speed and quality). This is a fine-tuned version of the original CodeLlama 34B by Phind, and from my experiments it’s both faster to run inferences and yields better results compared to the standard model.

Running the model locally

My preferred way of running the model locally has been through LM Studio. This app allows downloading models directly from HuggingFace, configuring the parameters, chatting directly with a nice chat UI or running a local server with an OpenAI compatible API.

Here is the config I’m using with Phind-CodeLlama-34B-v2:

// ~/.cache/lm-studio/config-presets/phind_codellama.preset.json
  "name": "Phind CodeLlama v2",
  "load_params": {
    "rope_freq_base": 1000000,
    "n_ctx": 4096, // Sets the max context to 4096 tokens
    "n_gpu_layers": 1, // Enable inference using M1/M2 GPU
    "use_mlock": false // Disable loading the model into RAM
  "inference_params": {
    "input_prefix": "### User Message",
    "input_suffix": "### Assistant",
    "antiprompt": [
      "### User Message"
    "pre_prompt": "### System Prompt\nBelow is an instruction that describes a task. Write a response that appropriately completes the request.\n\n"

VSCode Integration

In order to get the best coding experience we need to have a way to query the model directly from the IDE we are working on. I’ll focus on VSCode as it’s the one I’ve been using.

I found two good options for integrating a locally running LLM with VSCode: and wingman. is more similar to Copilot and offers a nice intuitive interface, Wingman is apparently simpler but provides a lot of pre-configured prompts.

Continue is compatible with several model providers as described in their docs. I’ve found the default configuration to not be suitable for running a model locally, so I’ve tweaked it to suit my needs:

  • disable telemetry to remove the requirement for internet connection,
  • use QueuedLLM to ensure only one request is sent to the model at a time,
  • use GGML to connect Continue to the OpenAI compatible API in LM Studio,
  • disable summaries to reduce the number of requests made to the model.

This is the configuration I’m using with

# ~/.continue/
from continuedev.src.continuedev.libs.llm.ggml import GGML
from continuedev.src.continuedev.libs.llm.queued import QueuedLLM
config = ContinueConfig(
                context_length=4096, server_url="http://localhost:8080"

With Continue you can also select some code in the editor to include it as context and then ask a question about it, which allows the model to offer better responses to your needs.


Wingman offers a simpler yet very capable interface. Instead of a chat-based UI, there are multiple commands (with pre-built prompts) for specific tasks. Selecting a command runs the inference on the model and presents the response. Follow-up questions can be asked, and depending on the command selected you can include context from the current file too.

Configuring wingman is quite simple, all you have to do is go to the extension settings and change:

Wingman > Openai: Api Base Url

Wingman > Openai: Model


Using either method we are able to get a locally running code assistant with high quality output and quite decent inference times (if using a M1 Pro/Max chip with 32GB RAM). It is quite incredible what these models can achieve and the future seems very promising for even better results. It is amazing what we can achieve with free and/or open source projects and with great stability too! If you have a decent setup you can definitely use this instead of the commercial alternatives like Github Copilot, Amazon Code Whisperer or Tabnine. And with new fine-tuned models coming out every day this will only improve both in performance and better outputs.

Header image generated by BluePencil XL powered by Stable Diffusion XL AI.

Edit (2023-09-15): Removed TogetherLLM requirement by using the disable_summaries and QueuedLLM options in