Windows で llama.cpp のビルド

Visual Studio のインストール

C++ のビルド環境の構築。

https://visualstudio.microsoft.com/ja/vs/community/

Tab Workload: Desktop-development with C++
Tab Components (select quickly via search): C++-_CMake_ Tools for Windows, _Git_ for Windows, C++-_Clang_ Compiler for Windows, MS-Build Support for LLVM-Toolset (clang)
Please remember to always use a Developer Command Prompt / PowerShell for VS2022 for git, build, test

CUDA のインストール

ソースの取得

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

CMake のインストール

winget install CMake

powershell 再起動

ビルド

windows はデフォルトで curl が使えないので。curl を off。CUDA を使わず CPU を使う場合は -DGGML_CUDA=ON を削除。

cmake -B build -DLLAMA_CURL=OFF -DGGML_CUDA=ON
cmake --build build --config Release

python ツールのインストール

python -m venv venv
./venv/Scripts/activate
pip install -r requirements.txt

モデルのダウンロード

Git LFS のインストール

winget install -e --id GitHub.GitLFS
git lfs install

huggingface の clone

huggingface でよくある『ユーザー名/プロジェクト名』ではなく、URL のフルパスを指定する。

git clone <リポジトリのフルパス>

モデルを fp16/bf16/Q8_0 に変換

モデルが fp16/bf16/Q8_0 ではない、もしくは model-00001-of-00002.safetensors のように結合されていない場合、モデルを変換する必要がある。

powershell で以下のようにコマンドを実行する。

HF_MODEL_PATH="/path/to/Llama-3.2-3B-Instruct"
OUTPUT_BF16_GGUF="Llama-3.2-3B-Instruct-BF16.gguf"
python convert_hf_to_gguf.py "$HF_MODEL_PATH" --outtype bf16 --outfile "$OUTPUT_BF16_GGUF"

FP16/bf16 を Q4_K_M に量子化

powershell で以下のコマンドを実行。Q4_K_M は Q6_K_M などに書き換える。

QUANTIZE_TOOL="./build/bin/Release/llama-quantize.exe"
OUTPUT_BF16_GGUF="Llama-3.2-3B-Instruct-BF16.gguf"
OUTPUT_Q4_K_M_GGUF="Llama-3.2-3B-Instruct-で以下のコマンドを実行。Q4_K_M.gguf"
"$QUANTIZE_TOOL" "$OUTPUT_BF16_GGUF" "$OUTPUT_Q4_K_M_GGUF" Q4_K_M