手动编译pytorch+启动LLama_fatory

手动编译pytorch

https://blog.csdn.net/L1481333167/article/details/137919464

切换到conda环境：

1	`conda activate torch_new_env`

拉取源码：

git clone --recursive https://github.com/pytorch/pytorch
cd pytorch
# if you are updating an existing checkout
git submodule syncu
git submodule update --init --recursive --jobs 0

编译Pytorch:

1
2

export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
DEBUG=1 USE_DISTRIBUTED=1 USE_MKLDNN=1 USE_CUDA=1 BUILD_TEST=1 USE_FBGEMM=1 USE_NNPACK=1 USE_QNNPACK=1 USE_XNNPACK=1 USE_NUMPY=1 python setup.py develop

清除编译结果:

1	`python setup.py clean`

编译LLama_factory遇到报错：

1	libstdc++.so.6: version `GLIBCXX_3.4.30‘ not found

检查GLIBCXX_3.4.30是存在的：

1	`strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 \| grep GLIBCXX`

查看是否已经建立过软链接：

1 2	`ls -l /usr/lib64/libstdc++.so* ln -sf /usr/lib64/libstdc++.so.6.0.32 /usr/lib64/libstdc++.so.6`

这里我没有和anaconda3虚拟环境中的链接起来，

1
2

(torch_new_env) dell@dell-Precision-7920-Tower:~/sdb/LLaMA-Factory$ ls -l /usr/lib64/libstdc++.so*
lrwxrwxrwx 1 root root 30 11月 27 20:47 /usr/lib64/libstdc++.so.6 -> /usr/lib64/libstdc++.so.6.0.30

应该是操作系统会优先使用系统路径中的库（ /usr/lib64/），而不是 Anaconda 环境中的库。这意味着当在 Anaconda 环境中编译程序时，编译器会优先查找系统安装的 libstdc++.so，而不是虚拟环境中的版本。

将环境中的libstdc++.so.6与anaconda3虚拟环境中的链接起来(这里我没有这样做):

1	`ln -sf /usr/lib/x86_64-linux-gnu/libstdc++.so.6 /home/dell/anaconda3/envs/pycode/bin/../lib/libstdc++.so.6`

linstdc++.so.6 和 gcc的版本对应关系非常重要

1 2	`GLIBCXX_3.4.30 需要 GCC 12.1.0 GLIBCXX_3.4.31 需要 GCC 13.1.0`

这里GLIBCXX_3.4.30我用了gcc9，也没有问题

当时降低了scipy版本

https://zhuanlan.zhihu.com/p/637165718?utm_medium=social&utm_psn=1842275695461552129&utm_source=wechat_session

1	`conda install scipy=1.13`

启动LLama_factory:

1	`export FORCE_TORCHRUN=1`

1
2

CUDA_VISIBLE_DEVICES=1 llamafactory-cli train     --stage sft     --do_train     --model_name_or_path /home/dell/sdb/.cache/Qwen2-0___5B-Instruct     --dataset identity     --dataset_dir ./data     --template qwen     --finetuning_type freeze     --output_dir /home/dell/sdb/saves/Qwen2-0___5B-Instruct/freeze/sft     --overwrite_cache     --overwrite_output_dir     --cutoff_len 1024     --preprocessing_num_workers 16     --per_device_train_batch_size 2     --per_device_eval_batch_size 1     --gradient_accumulation_steps 8     --lr_scheduler_type cosine     --logging_steps 50     --warmup_steps 20     --save_steps 100     --eval_steps 50     --evaluation_strategy steps     --load_best_model_at_end     --learning_rate 5e-5     --num_train_epochs 5.0     --max_samples 1000     --val_size 0.1     --plot_loss     --fp16     --deepspeed examples/deepspeed/ds_z3_offload_config.json

gdb+pdb调试

CUDA_VISIBLE_DEVICES=0 gdb --args  python3 -m pdb $(which llamafactory-cli) train   --stage sft   --do_train   --model_name_or_path /home/dell/sdb/.cache/Qwen2-0___5B-Instruct   --dataset identity   --dataset_dir ./data   --template qwen   --finetuning_type freeze   --output_dir /home/dell/sdb/saves/Qwen2-0___5B-Instruct/freeze/sft   --overwrite_cache   --overwrite_output_dir   --cutoff_len 1024   --preprocessing_num_workers 16   --per_device_train_batch_size 2   --per_device_eval_batch_size 1   --gradient_accumulation_steps 8   --lr_scheduler_type cosine   --logging_steps 50   --warmup_steps 20   --save_steps 100   --eval_steps 50   --evaluation_strategy steps   --load_best_model_at_end   --learning_rate 5e-5   --num_train_epochs 5.0   --max_samples 1000   --val_size 0.1   --plot_loss   --fp16   --deepspeed examples/deepspeed/ds_z3_offload_config.json

一些其他报错：

CMAKE报错信息：

CMake Error at third_party/protobuf/cmake/protoc.cmake:11 (add_executable):
  The install of the protoc target requires changing an RPATH from the build
  tree, but this is not supported with the Ninja generator unless on an
  ELF-based or XCOFF-based platform.  The CMAKE_BUILD_WITH_INSTALL_RPATH
  variable may be set to avoid this relinking step.
Call Stack (most recent call first):
  third_party/protobuf/cmake/CMakeLists.txt:258 (include)


CMake Error at third_party/tensorpipe/third_party/libuv/CMakeLists.txt:338 (add_library):
  The install of the uv target requires changing an RPATH from the build
  tree, but this is not supported with the Ninja generator unless on an
  ELF-based or XCOFF-based platform.  The CMAKE_BUILD_WITH_INSTALL_RPATH
  variable may be set to avoid this relinking step.


CMake Error at third_party/tensorpipe/third_party/libuv/CMakeLists.txt:338 (add_library):
  The install of the uv target requires changing an RPATH from the build
  tree, but this is not supported with the Ninja generator unless on an
  ELF-based or XCOFF-based platform.  The CMAKE_BUILD_WITH_INSTALL_RPATH
  variable may be set to avoid this relinking step.


CMake Error at third_party/foxi/CMakeLists.txt:97 (add_library):
  The install of the foxi_dummy target requires changing an RPATH from the
  build tree, but this is not supported with the Ninja generator unless on an
  ELF-based or XCOFF-based platform.  The CMAKE_BUILD_WITH_INSTALL_RPATH
  variable may be set to avoid this relinking step.


CMake Error at third_party/foxi/CMakeLists.txt:97 (add_library):
  The install of the foxi_dummy target requires changing an RPATH from the
  build tree, but this is not supported with the Ninja generator unless on an
  ELF-based or XCOFF-based platform.  The CMAKE_BUILD_WITH_INSTALL_RPATH
  variable may be set to avoid this relinking step.


CMake Error at third_party/foxi/CMakeLists.txt:70 (add_library):
  The install of the foxi_wrapper target requires changing an RPATH from the
  build tree, but this is not supported with the Ninja generator unless on an
  ELF-based or XCOFF-based platform.  The CMAKE_BUILD_WITH_INSTALL_RPATH
  variable may be set to avoid this relinking step.

解决：
在终端中使用 export 命令来设置 CMAKE_BUILD_WITH_INSTALL_RPATH 环境变量为 TRUE，然后再运行 CMake。
这样可以确保 CMake 在构建时不会修改 RPATH，从而避免你遇到的错误。

1	`export CMAKE_BUILD_WITH_INSTALL_RPATH=TRUE`

RPATH 和 LD_LIBRARY_PATH 的关系

RPATH 是在编译时嵌入到可执行文件中的，而 LD_LIBRARY_PATH 是在运行时由环境变量指定的动态库搜索路径。
RPATH 的优先级高于 LD_LIBRARY_PATH，也就是说，如果一个可执行文件中有 RPATH，操作系统会首先查找 RPATH 中指定的路径，而不会优先查看 LD_LIBRARY_PATH。

解决gcc g++软链接问题：

首先查看现有软链接

1 2	`ls -l /usr/bin/gcc ls -l /usr/bin/g++`

删除软链接：

1 2	`sudo rm /usr/bin/gcc sudo rm /usr/bin/g++`

建立新的软链接：

1	`sudo ln -s /usr/bin/gcc-9 /usr/bin/gcc sudo ln -s /usr/bin/g++-9 /usr/bin/g++`

11.28 实时记录一下进度

Llama_factory

手动编译pytorch+启动LLama_fatory

http://sjx.com/2024/11/28/手动编译pytorch-启动LLama-fatory/

作者

sjx

发布于

2024年11月28日

许可协议

6.s081 Lab2-sysinfo实现上一篇

hexo管理下一篇