安装 apex 部分: RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries. Pytorch binaries were compiled with Cuda 12.1. In some cases, a minor-version mismatch will not cause later errors: https://github.com/NVIDIA/apex/pull/323#discussion_r287021798. You can try commenting out this check (at your own risk).
注释 setup.py 的check 部分
File “/tmp/pip-build-env-q_6ryjrc/overlay/lib/python3.10/site-packages/setuptools/build_meta.py”, line 311, in run_setup exec(code, locals()) File “”, line 5, in ModuleNotFoundError: No module named ‘packaging’
/opt/tensorflow/lib/python3.10/site-packages/torch/include/c10/util/C++17.h:13:2: 错误:#error “You’re trying to build PyTorch with a too old version of GCC. We need GCC 9 or later.”
torch 版本不匹配 1.2.1
训练时:
File “/home/ec2-user/gpt-neox/megatron/data/gpt2_dataset.py”, line 189, in _build_index_mappings from megatron.data import helpers ImportError: cannot import name ‘helpers’ from ‘megatron.data’ (/home/ec2-user/gpt-neox/megatron/data/init.py)
由于 helps的编译有问题 https://github.com/pytorch/pytorch/issues/120020 因为 [ec2-user@ip-10-233-228-5 data]$ pwd /home/ec2-user/gpt-neox/megatron/data [ec2-user@ip-10-233-228-5 data]$ vim Makefile 里 的 python3-config 不是 当前 python3.10,改为 python-config即可,编译成当前 python3.10解释器的版本。
