手动编译pytorch+启动LLama_fatory

手动编译pytorch

https://blog.csdn.net/L1481333167/article/details/137919464

切换到conda环境:

1
conda activate torch_new_env

拉取源码:

1
2
3
4
5
git clone --recursive https://github.com/pytorch/pytorch
cd pytorch
# if you are updating an existing checkout
git submodule syncu
git submodule update --init --recursive --jobs 0

编译Pytorch:

1
2
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
DEBUG=1 USE_DISTRIBUTED=1 USE_MKLDNN=1 USE_CUDA=1 BUILD_TEST=1 USE_FBGEMM=1 USE_NNPACK=1 USE_QNNPACK=1 USE_XNNPACK=1 USE_NUMPY=1 python setup.py develop

清除编译结果:

1
python setup.py clean

编译LLama_factory遇到报错:

1
libstdc++.so.6: version `GLIBCXX_3.4.30‘ not found

检查GLIBCXX_3.4.30是存在的:

1
strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX

查看是否已经建立过软链接:

1
2
ls -l /usr/lib64/libstdc++.so*
ln -sf /usr/lib64/libstdc++.so.6.0.32 /usr/lib64/libstdc++.so.6

这里我没有和anaconda3虚拟环境中的链接起来

1
2
(torch_new_env) dell@dell-Precision-7920-Tower:~/sdb/LLaMA-Factory$ ls -l /usr/lib64/libstdc++.so*
lrwxrwxrwx 1 root root 30 1127 20:47 /usr/lib64/libstdc++.so.6 -> /usr/lib64/libstdc++.so.6.0.30

应该是操作系统会优先使用系统路径中的库( /usr/lib64/),而不是 Anaconda 环境中的库。这意味着当在 Anaconda 环境中编译程序时,编译器会优先查找系统安装的 libstdc++.so,而不是虚拟环境中的版本。

将环境中的libstdc++.so.6与anaconda3虚拟环境中的链接起来(这里我没有这样做):

1
ln -sf /usr/lib/x86_64-linux-gnu/libstdc++.so.6 /home/dell/anaconda3/envs/pycode/bin/../lib/libstdc++.so.6

linstdc++.so.6 和 gcc的版本对应关系 非常重要

1
2
GLIBCXX_3.4.30  需要 GCC 12.1.0
GLIBCXX_3.4.31 需要 GCC 13.1.0

这里GLIBCXX_3.4.30我用了gcc9,也没有问题

当时降低了scipy版本

https://zhuanlan.zhihu.com/p/637165718?utm_medium=social&utm_psn=1842275695461552129&utm_source=wechat_session

1
conda install scipy=1.13

启动LLama_factory:

1
export FORCE_TORCHRUN=1
1
2
CUDA_VISIBLE_DEVICES=1 llamafactory-cli train     --stage sft     --do_train     --model_name_or_path /home/dell/sdb/.cache/Qwen2-0___5B-Instruct     --dataset identity     --dataset_dir ./data     --template qwen     --finetuning_type freeze     --output_dir /home/dell/sdb/saves/Qwen2-0___5B-Instruct/freeze/sft     --overwrite_cache     --overwrite_output_dir     --cutoff_len 1024     --preprocessing_num_workers 16     --per_device_train_batch_size 2     --per_device_eval_batch_size 1     --gradient_accumulation_steps 8     --lr_scheduler_type cosine     --logging_steps 50     --warmup_steps 20     --save_steps 100     --eval_steps 50     --evaluation_strategy steps     --load_best_model_at_end     --learning_rate 5e-5     --num_train_epochs 5.0     --max_samples 1000     --val_size 0.1     --plot_loss     --fp16     --deepspeed examples/deepspeed/ds_z3_offload_config.json

  • gdb+pdb调试
1
CUDA_VISIBLE_DEVICES=0 gdb --args  python3 -m pdb $(which llamafactory-cli) train   --stage sft   --do_train   --model_name_or_path /home/dell/sdb/.cache/Qwen2-0___5B-Instruct   --dataset identity   --dataset_dir ./data   --template qwen   --finetuning_type freeze   --output_dir /home/dell/sdb/saves/Qwen2-0___5B-Instruct/freeze/sft   --overwrite_cache   --overwrite_output_dir   --cutoff_len 1024   --preprocessing_num_workers 16   --per_device_train_batch_size 2   --per_device_eval_batch_size 1   --gradient_accumulation_steps 8   --lr_scheduler_type cosine   --logging_steps 50   --warmup_steps 20   --save_steps 100   --eval_steps 50   --evaluation_strategy steps   --load_best_model_at_end   --learning_rate 5e-5   --num_train_epochs 5.0   --max_samples 1000   --val_size 0.1   --plot_loss   --fp16   --deepspeed examples/deepspeed/ds_z3_offload_config.json

一些其他报错:

CMAKE报错信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
CMake Error at third_party/protobuf/cmake/protoc.cmake:11 (add_executable):
The install of the protoc target requires changing an RPATH from the build
tree, but this is not supported with the Ninja generator unless on an
ELF-based or XCOFF-based platform. The CMAKE_BUILD_WITH_INSTALL_RPATH
variable may be set to avoid this relinking step.
Call Stack (most recent call first):
third_party/protobuf/cmake/CMakeLists.txt:258 (include)


CMake Error at third_party/tensorpipe/third_party/libuv/CMakeLists.txt:338 (add_library):
The install of the uv target requires changing an RPATH from the build
tree, but this is not supported with the Ninja generator unless on an
ELF-based or XCOFF-based platform. The CMAKE_BUILD_WITH_INSTALL_RPATH
variable may be set to avoid this relinking step.


CMake Error at third_party/tensorpipe/third_party/libuv/CMakeLists.txt:338 (add_library):
The install of the uv target requires changing an RPATH from the build
tree, but this is not supported with the Ninja generator unless on an
ELF-based or XCOFF-based platform. The CMAKE_BUILD_WITH_INSTALL_RPATH
variable may be set to avoid this relinking step.


CMake Error at third_party/foxi/CMakeLists.txt:97 (add_library):
The install of the foxi_dummy target requires changing an RPATH from the
build tree, but this is not supported with the Ninja generator unless on an
ELF-based or XCOFF-based platform. The CMAKE_BUILD_WITH_INSTALL_RPATH
variable may be set to avoid this relinking step.


CMake Error at third_party/foxi/CMakeLists.txt:97 (add_library):
The install of the foxi_dummy target requires changing an RPATH from the
build tree, but this is not supported with the Ninja generator unless on an
ELF-based or XCOFF-based platform. The CMAKE_BUILD_WITH_INSTALL_RPATH
variable may be set to avoid this relinking step.


CMake Error at third_party/foxi/CMakeLists.txt:70 (add_library):
The install of the foxi_wrapper target requires changing an RPATH from the
build tree, but this is not supported with the Ninja generator unless on an
ELF-based or XCOFF-based platform. The CMAKE_BUILD_WITH_INSTALL_RPATH
variable may be set to avoid this relinking step.
  • 解决:
    在终端中使用 export 命令来设置 CMAKE_BUILD_WITH_INSTALL_RPATH 环境变量为 TRUE,然后再运行 CMake。
  • 这样可以确保 CMake 在构建时不会修改 RPATH,从而避免你遇到的错误。
1
export CMAKE_BUILD_WITH_INSTALL_RPATH=TRUE

RPATH 和 LD_LIBRARY_PATH 的关系

RPATH 是在编译时嵌入到可执行文件中的,而 LD_LIBRARY_PATH 是在运行时由环境变量指定的动态库搜索路径。
RPATH 的优先级高于 LD_LIBRARY_PATH,也就是说,如果一个可执行文件中有 RPATH,操作系统会首先查找 RPATH 中指定的路径,而不会优先查看 LD_LIBRARY_PATH。

解决gcc g++软链接问题:

  • 首先查看现有软链接
1
2
ls -l /usr/bin/gcc
ls -l /usr/bin/g++
  • 删除软链接:
1
2
sudo rm /usr/bin/gcc 
sudo rm /usr/bin/g++
  • 建立新的软链接:
1
sudo ln -s /usr/bin/gcc-9 /usr/bin/gcc sudo ln -s /usr/bin/g++-9 /usr/bin/g++
  • 11.28 实时记录一下进度
    图片

手动编译pytorch+启动LLama_fatory
http://sjx.com/2024/11/28/手动编译pytorch-启动LLama-fatory/
作者
sjx
发布于
2024年11月28日
许可协议