不用GPU也能加载模型

Alt text

为什么不用GPU ?

我想应该不言而喻了，就拿阿里云来说吧v100 12核92GB 200Gb硬盘一年10.3万，普通开发者或者中小型公司，难以承受。因为目前AI只是尝鲜，还不能对工作带来非常确定的帮助。所以在预算支出不会太多。于是就有了llama.cpp 它是把模型转换格式后，可以在没有GPU的情形下运行。虽然速度慢了点，但也能运行。保障我们继续低成本运行模型。

为符合本土化情景，我们加载清华的chatglm2

它对中文支持较好些，

假设你已经下载好了模型chatglm2模型

将 ChatGLM.cpp 存储库克隆到本地计算机中：

git clone --recursive https://github.com/li-plus/chatglm.cpp.git && cd chatglm.cpp

转换模型

用于将 ChatGLM-6B 或 ChatGLM2-6B 转换为量化的 GGML 格式。例如，要将 fp16 原始模型转换为q4_0（量化的 int4）GGML 模型，请运行：convert.py

如果需要转换时带上lora的权重(可以理解为你微调的模型checkpoint) 可以跟参数-l /path/lora/checkpoint

python3 chatglm_cpp/convert.py -i THUDM/chatglm-6b -t q4_0 -o chatglm-ggml.bin

编译和运行就是加载模型的二进制程序

cmake -B build
cmake --build build -j --config Release

Now you may chat with the quantized ChatGLM-6B model by running:

./build/bin/main -m chatglm-ggml.bin -p 你好
# 你好！我是人工智能助手 ChatGLM-6B，很高兴见到你，欢迎问我任何问题。

进行交互聊天模式

./build/bin/main -m chatglm-ggml.bin -i

Python 绑定方便用fastapi开发接口

pip install -U chatglm-cpp

You may also install from source. Add the corresponding CMAKE_ARGS for acceleration.

# install from the latest source hosted on GitHub
pip install git+https://github.com/li-plus/chatglm.cpp.git@main
# or install from your local source after git cloning the repo
pip install .

Using pre-converted ggml models

>>> import chatglm_cpp
>>> 
>>> pipeline = chatglm_cpp.Pipeline("../chatglm-ggml.bin")
>>> pipeline.chat(["你好"])
'你好！我是人工智能助手 ChatGLM-6B，很高兴见到你，欢迎问我任何问题。'

网页模式加载，如上图所示那样:

python3 web_demo.py -m ../chatglm-ggml.bin

这样就可以低成本运行文本模型了,如果需要训练模型时，就按量付费购买GPU的服务器，训练完转换模型后再退掉GPU服务器，用普通服务器即可。