@TomMonkeyMan

@TomMonkeyMan

1.EC2 选择

在2024年4月底，CN AWS 也可以起搭载 Nvidia A10的机器 G5 https://www.amazonaws.cn/en/new/2024/amazon-ec2-g5-instances-are-available-in-amazon-web-services-china-regions/

机型具体配置：https://www.amazonaws.cn/en/ec2/instance-types/g5/

每块 A10 显存 24G，Llama 3-8B 至少需要 16G的 GPU (实际运行的时候，是 17290MiB)，也就是1块 A10，同时需要较大内存，而g5.xlarge 只有16G内存并不能满足要求，在所有单卡的EC2中，g5.2xlarge 是最为经济实惠的，但是目前CN区域已经没有机器了，因此博主本次使用了 g5.4xlarge。

2.OS 选择

不建议大家使用不含驱动的 AWS linux，因为安装时需要解决的冲突较多，如果自己安装Nvidia 驱动和 cuda toolkit可以选择原生的Linux发行版。博主由于比较懒，直接选择了AWS已经安装好 Nvidia 驱动和一些基础包的 AMI：

Deep Learning OSS Nvidia Driver AMI GPU TensorFlow 2.16 (Amazon Linux 2) 20240509

3.启动 EC2并检查环境

执行 lspci 检查显卡是否已硬件挂载

执行 nvidia-smi 检查驱动是否安装正确

执行 nvcc --version 检查 cuda toolkit是否安装正确

需要注意，不同 OS 对 lspci 的输出结果不同，有些OS会直接显示 A10，有些则仅显示3D controller: NVIDIA Corporation Device

4.安装 python依赖

pip install torch torchvision torchaudio accelerate 注意 accelerate 这个包是需要安装的，如果你需要 import transformers

5.安装 huggingface-cli 准备下载模型

注意使用 pip install huggingface-cli 是不行的，只会安装 Metadata而不会生成可执行文件。需使用 pip install -U huggingface_hub 或 pip install transformers

6.下载 Meta Llama 3B

首先确保你已经向Meta 申请访问，使用 hugging face cli请通过huggingface 进行申请，通过后会收到邮件通知，当然也可以直接在Meta Llama的官网申请，使用官网的下载方式。本文使用huggingface cli进行下载。

然而在 CN AWS EC2的环境中，无法直接访问 global huggingface，因此需要使用镜像下载，镜像网站：https://hf-mirror.com

注意单纯配置环境变量并下载的方法极其缓慢，须安装 hfd 并下载，安装方式见镜像网站，执行示例如下，大概需要30-40min：

./hfd.sh meta-llama/Meta-Llama-3-8B-Instruct --hf_username TomMonkeyMan --hf_token hf_xxxxxxxxx --tool wget -x 4 --local-dir Meta-Llama-3-8B-Instruct

（如果想使用原始checkpoint和 tokenizer，添加参数 --include "original/*"，但是该参数在original下载后，并不会下载model-0000*的模型。但是这种下载会比较快，后续可以研究下如何使用original中的文件，生成 model文件）

7.测试 Meta Llama 3-8B 模型

transformer pipeline

Python 3.10.13 (main, May  9 2024, 19:04:24) [GCC 7.3.1 20180712 (Red Hat 7.3.1-17)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import transformers
>>> import torch
>>> model_id = ".././Meta-Llama-3-8B-Instruct"              
>>> 
>>> pipeline = transformers.pipeline(
    "text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto"
)
2024-05-14 07:58:22.085347: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-05-14 07:58:22.849008: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02&lt;00:00,  1.79it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
>>> pipeline("你在干啥？")                          
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
[{'generated_text': '你在干啥？"\n    elif action == "eat":\n        return "你在吃啥？"\n    elif action == "sleep":\n        return "你在睡觉？"\n    elif action == "watch":\n        return "你在看啥？"\n    elif action == "play":\n        return "你在玩啥？"\n    else:\n        return "我不知道你在做啥"\n```\n\nYou can use this function like this:\n\n```python\nprint(what_are_you_doing("work"))\nprint(what_are_you_doing("eat"))\nprint(what_are_you_doing("sleep"))\nprint(what_are_you_doing("watch"))\nprint(what_are_you_doing("play"))\nprint(what_are_you_doing("study"))\n```\n\nThis will output:\n\n```\n你在工作？\n你在吃啥？\n你在睡觉？\n你在看啥？\n你在玩啥？\n我不知道你在做啥\n```\n\nThis is a very basic example, you can add more actions and responses as per your requirement. You can also use more advanced techniques like using dictionaries to map actions to responses.'}]

CN AWS 使用EC2 部署 Meta Llama 3 - 8B 流程 / How to deploy Llama 3-8B on CN AWS EC2

CN AWS 使用EC2 部署 Meta Llama 3 - 8B 流程 / How to deploy Llama 3-8B on CN AWS EC2