Llama-3.1-8B-Instruct 모델을 Modal에서 서버리스로 배포하기
작성일자 : 2024년 09월 21일
Modal이란?
Modal은 개발자가 복잡한 인프라 관리 없이 클라우드에서 코드를 실행하고 배포할 수 있는 서버리스 클라우드 플랫폼입니다.
모든 코드를 원격으로 빠르게 실행할 수 있고, 수천 개의 컨테이너로 확장할 수 있으며, 간편한 GPU 액세스를 제공합니다.
Modal 플랫폼은 AI/ML 작업, 배치 작업, 웹 엔드포인트 등 계산 집약적인 워크로드를 원활하게 실행하며, 주로 AI/ML 모델을 훈련하고 배포하는 데 사용됩니다.
Llama-3.1-8B-Instruct 모델을 Modal에서 서버리스로 배포하기
이번 포스트에서는 Llama-3.1-8B-Instruct 모델을 Modal에서 서버리스로 배포하는 방법에 대해 알아보겠습니다.
Reference
Run an OpenAI-Compatible vLLM Server
LLMs do more than just model language: they chat, they produce JSON and XML, they run code, and more. This has complicated their interface far beyond “text-in, text-out”. OpenAI’s API has emerged as a standard for that interface, and it is supported
modal.com
modal-examples/06_gpu_and_ml/llm-serving at main · modal-labs/modal-examples
Examples of programs built using Modal. Contribute to modal-labs/modal-examples development by creating an account on GitHub.
github.com
Modal 라이브러리 설치
- 파이썬 프로젝트를 생성합니다.
python3 -m venv .venv
source .venv/bin/activate
- Modal 라이브러리를 설치합니다.
pip install modal
python3 -m modal setup
- API 토큰을 생성하는 페이지가 열리면, Authorize 버튼을 클릭하여 토큰을 생성합니다.
Llama-3.1-8B-Instruct 모델 다운로드
다음으로는 Modal의 Storage 안에 Llama-3.1-8B-Instruct 모델을 다운로드 하겠습니다.
Modal에서 인프라를 관리하는 과정은 IaC(Infrastructure as Code)를 사용하여 진행됩니다.
- 프로젝트 디렉토리에
download_llama.py
파일을 생성합니다.
# ---
# args: ["--force-download"]
# ---
import modal
MODELS_DIR = "/Llama-3.1-8B-Instruct"
DEFAULT_NAME = "meta-llama/Meta-Llama-3.1-8B-Instruct"
DEFAULT_REVISION = "8c22764a7e3675c50d4c7c9a4edb474456022b16"
volume = modal.Volume.from_name("Llama-3.1-8B-Instruct", create_if_missing=True)
image = (
modal.Image.debian_slim(python_version="3.10")
.pip_install(
[
"huggingface_hub", # download models from the Hugging Face Hub
"hf-transfer", # download models faster with Rust
]
)
.env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
)
MINUTES = 60
HOURS = 60 * MINUTES
app = modal.App(
image=image, secrets=[modal.Secret.from_name("huggingface-secret")]
)
@app.function(volumes={MODELS_DIR: volume}, timeout=4 * HOURS)
def download_model(model_name, model_revision, force_download=False):
from huggingface_hub import snapshot_download
volume.reload()
snapshot_download(
model_name,
local_dir=MODELS_DIR + "/" + model_name,
ignore_patterns=[
"*.pt",
"*.bin",
"*.pth",
"original/*",
], # Ensure safetensors
revision=model_revision,
force_download=force_download,
)
volume.commit()
@app.local_entrypoint()
def main(
model_name: str = DEFAULT_NAME,
model_revision: str = DEFAULT_REVISION,
force_download: bool = False,
):
download_model.remote(model_name, model_revision, force_download)
- 다음 명령어를 실행하여 Llama-3.1-8B-Instruct 모델을 다운로드합니다.
modal run download_llama.py
- 다운로드가 완료되면, Modal의 Storage에 Llama-3.1-8B-Instruct 모델이 저장되어 있는 것을 확인할 수 있습니다.
Llama-3.1-8B-Instruct 모델을 서버리스로 배포
이제 Modal에서 서버리스로 Llama-3.1-8B-Instruct 모델을 배포해보겠습니다.
- 프로젝트 디렉토리에
vllm_inference.py
파일을 생성합니다.
# ---
# deploy: true
# cmd: ["modal", "serve", "06_gpu_and_ml/llm-serving/vllm_inference.py"]
# pytest: false
# ---
# # Run an OpenAI-Compatible vLLM Server
#
# LLMs do more than just model language: they chat, they produce JSON and XML, they run code, and more.
# This has complicated their interface far beyond "text-in, text-out".
# OpenAI's API has emerged as a standard for that interface,
# and it is supported by open source LLM serving frameworks like [vLLM](https://docs.vllm.ai/en/latest/).
#
# In this example, we show how to run a vLLM server in OpenAI-compatible mode on Modal.
# You can find a video walkthrough of this example on our YouTube channel [here](https://www.youtube.com/watch?v=QmY_7ePR1hM).
#
# Note that the vLLM server is a FastAPI app, which can be configured and extended just like any other.
# Here, we use it to add simple authentication middleware, following the
# [implementation in the vLLM repository](https://github.com/vllm-project/vllm/blob/v0.5.3post1/vllm/entrypoints/openai/api_server.py).
#
# Our examples repository also includes scripts for running clients and load-testing for OpenAI-compatible APIs
# [here](https://github.com/modal-labs/modal-examples/tree/main/06_gpu_and_ml/llm-serving/openai_compatible).
#
# You can find a video walkthrough of this example and the related scripts on the Modal YouTube channel
# [here](https://www.youtube.com/watch?v=QmY_7ePR1hM).
#
# ## Set up the container image
#
# Our first order of business is to define the environment our server will run in:
# the [container `Image`](https://modal.com/docs/guide/custom-container).
# vLLM is can be installed with `pip`.
import modal
vllm_image = modal.Image.debian_slim(python_version="3.10").pip_install(
"vllm==0.5.3post1"
)
# ## Download the model weights
#
# We'll be running a pretrained foundation model -- Meta's LLaMA 3.1 8B
# in the Instruct variant that's trained to chat and follow instructions.
MODELS_DIR = "/Llama-3.1-8B-Instruct"
MODEL_NAME = "meta-llama/Meta-Llama-3.1-8B-Instruct"
MODEL_REVISION = "8c22764a7e3675c50d4c7c9a4edb474456022b16"
# We need to make the weights of that model available to our Modal Functions.
#
# So to follow along with this example, you'll need to download those weights
# onto a Modal Volume by running another script from the
# [examples repository](https://github.com/modal-labs/modal-examples).
try:
volume = modal.Volume.lookup("Llama-3.1-8B-Instruct", create_if_missing=False)
except modal.exception.NotFoundError:
raise Exception("Download models first with modal run download_llama.py")
# ## Build a vLLM engine and serve it
#
# vLLM's OpenAI-compatible server is exposed as a [FastAPI](https://fastapi.tiangolo.com/) router.
#
# FastAPI is a Python web framework that implements the [ASGI standard](https://en.wikipedia.org/wiki/Asynchronous_Server_Gateway_Interface),
# much like [Flask](https://en.wikipedia.org/wiki/Flask_(web_framework)) is a Python web framework
# that implements the [WSGI standard](https://en.wikipedia.org/wiki/Web_Server_Gateway_Interface).
#
# Modal offers [first-class support for ASGI (and WSGI) apps](https://modal.com/docs/guide/webhooks). We just need to decorate a function that returns the app
# with `@modal.asgi_app()` (or `@modal.wsgi_app()`) and then add it to the Modal app with the `app.function` decorator.
#
# The function below first imports the FastAPI router from the vLLM library, then adds authentication compatible with OpenAI client libraries. You might also add more routes here.
#
# Then, the function creates an `AsyncLLMEngine`, the core of the vLLM server. It's responsible for loading the model, running inference, and serving responses.
#
# After attaching that engine to the FastAPI app via the `api_server` module of the vLLM library, we return the FastAPI app
# so it can be served on Modal.
app = modal.App("Llama-3.1-8B-Instruct-App")
N_GPU = 1 # tip: for best results, first upgrade to more powerful GPUs, and only then increase GPU count
TOKEN = "super-secret-token" # auth token. for production use, replace with a modal.Secret
MINUTES = 60 # seconds
HOURS = 60 * MINUTES
@app.function(
image=vllm_image,
gpu=modal.gpu.A100(count=N_GPU, size="40GB"),
container_idle_timeout=5 * MINUTES,
timeout=24 * HOURS,
allow_concurrent_inputs=100,
volumes={MODELS_DIR: volume},
)
@modal.asgi_app()
def serve():
import fastapi
import vllm.entrypoints.openai.api_server as api_server
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.entrypoints.logger import RequestLogger
from vllm.entrypoints.openai.serving_chat import OpenAIServingChat
from vllm.entrypoints.openai.serving_completion import (
OpenAIServingCompletion,
)
from vllm.usage.usage_lib import UsageContext
volume.reload() # ensure we have the latest version of the weights
# create a fastAPI app that uses vLLM's OpenAI-compatible router
web_app = fastapi.FastAPI(
title=f"OpenAI-compatible {MODEL_NAME} server",
description="Run an OpenAI-compatible LLM server with vLLM on modal.com",
version="0.0.1",
docs_url="/docs",
)
# security: CORS middleware for external requests
http_bearer = fastapi.security.HTTPBearer(
scheme_name="Bearer Token",
description="See code for authentication details.",
)
web_app.add_middleware(
fastapi.middleware.cors.CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# security: inject dependency on authed routes
async def is_authenticated(api_key: str = fastapi.Security(http_bearer)):
if api_key.credentials != TOKEN:
raise fastapi.HTTPException(
status_code=fastapi.status.HTTP_401_UNAUTHORIZED,
detail="Invalid authentication credentials",
)
return {"username": "authenticated_user"}
router = fastapi.APIRouter(dependencies=[fastapi.Depends(is_authenticated)])
# wrap vllm's router in auth router
router.include_router(api_server.router)
# add authed vllm to our fastAPI app
web_app.include_router(router)
engine_args = AsyncEngineArgs(
model=MODELS_DIR + "/" + MODEL_NAME,
tensor_parallel_size=N_GPU,
gpu_memory_utilization=0.90,
max_model_len=8096,
enforce_eager=False, # capture the graph for faster inference, but slower cold starts (30s > 20s)
)
engine = AsyncLLMEngine.from_engine_args(
engine_args, usage_context=UsageContext.OPENAI_API_SERVER
)
model_config = get_model_config(engine)
request_logger = RequestLogger(max_log_len=2048)
api_server.openai_serving_chat = OpenAIServingChat(
engine,
model_config=model_config,
served_model_names=[MODEL_NAME],
chat_template=None,
response_role="assistant",
lora_modules=[],
prompt_adapters=[],
request_logger=request_logger,
)
api_server.openai_serving_completion = OpenAIServingCompletion(
engine,
model_config=model_config,
served_model_names=[MODEL_NAME],
lora_modules=[],
prompt_adapters=[],
request_logger=request_logger,
)
return web_app
# ## Deploy the server
#
# To deploy the API on Modal, just run
# ```bash
# modal deploy vllm_inference.py
# ```
#
# This will create a new app on Modal, build the container image for it, and deploy.
#
# ## Interact with the server
#
# Once it is deployed, you'll see a URL appear in the command line,
# something like `https://your-workspace-name--example-vllm-openai-compatible-serve.modal.run`.
#
# You can find [interactive Swagger UI docs](https://swagger.io/tools/swagger-ui/)
# at the `/docs` route of that URL, i.e. `https://your-workspace-name--example-vllm-openai-compatible-serve.modal.run/docs`.
# These docs describe each route and indicate the expected input and output
# and translate requests into `curl` commands. They also demonstrate authentication.
#
# For simple routes like `/health`, which checks whether the server is responding,
# you can even send a request directly from the docs.
#
# To interact with the API programmatically, you can use the Python `openai` library.
#
# See the `client.py` script in the examples repository
# [here](https://github.com/modal-labs/modal-examples/tree/main/06_gpu_and_ml/llm-serving/openai_compatible)
# to take it for a spin:
#
# ```bash
# # pip install openai==1.13.3
# python openai_compatible/client.py
# ```
#
# We also include a basic example of a load-testing setup using
# `locust` in the `load_test.py` script [here](https://github.com/modal-labs/modal-examples/tree/main/06_gpu_and_ml/llm-serving/openai_compatibl):
#
# ```bash
# modal run openai_compatible/load_test.py
# ```
#
# ## Addenda
#
# The rest of the code in this example is utility code.
def get_model_config(engine):
import asyncio
try: # adapted from vLLM source -- https://github.com/vllm-project/vllm/blob/507ef787d85dec24490069ffceacbd6b161f4f72/vllm/entrypoints/openai/api_server.py#L235C1-L247C1
event_loop = asyncio.get_running_loop()
except RuntimeError:
event_loop = None
if event_loop is not None and event_loop.is_running():
# If the current is instanced by Ray Serve,
# there is already a running event loop
model_config = event_loop.run_until_complete(engine.get_model_config())
else:
# When using single vLLM without engine_use_ray
model_config = asyncio.run(engine.get_model_config())
return model_config
@app.function()
부분에서 배포 시 사용할 GPU 종류, GPU의 수, 컨테이너 Timeout, Concurrent Request 수 등을 설정할 수 있습니다.
- 아래의 명령어를 실행하여 Llama-3.1-8B-Instruct 모델을 배포합니다.
modal deploy vllm_inference.py
- 배포가 완료되면 아래와 같이 Modal에서 확인할 수 있습니다.
배포된 서버에 API 요청 보내기
간단한 client.py
스크립트를 사용하여 배포된 서버에 API 요청을 보내보겠습니다.
- 먼저, OpenAI 라이브러리를 설치합니다.
pip install openai
- 아래의 코드를
client.py
파일에 작성합니다.
"""This simple script shows how to interact with an OpenAI-compatible server from a client."""
import argparse
import modal
from openai import OpenAI
class Colors:
"""ANSI color codes"""
GREEN = "\033[0;32m"
RED = "\033[0;31m"
BLUE = "\033[0;34m"
GRAY = "\033[0;90m"
BOLD = "\033[1m"
END = "\033[0m"
def get_completion(client, model_id, messages, args):
completion_args = {
"model": model_id,
"messages": messages,
"frequency_penalty": args.frequency_penalty,
"max_tokens": args.max_tokens,
"n": args.n,
"presence_penalty": args.presence_penalty,
"seed": args.seed,
"stop": args.stop,
"stream": args.stream,
"temperature": args.temperature,
"top_p": args.top_p,
}
completion_args = {
k: v for k, v in completion_args.items() if v is not None
}
try:
response = client.chat.completions.create(**completion_args)
return response
except Exception as e:
print(Colors.RED, f"Error during API call: {e}", Colors.END, sep="")
return None
def main():
parser = argparse.ArgumentParser(description="OpenAI Client CLI")
parser.add_argument(
"--model",
type=str,
default=None,
help="The model to use for completion, defaults to the first available model",
)
parser.add_argument(
"--workspace",
type=str,
default=None,
help="The workspace where the LLM server app is hosted, defaults to your current Modal workspace",
)
parser.add_argument(
"--app-name",
type=str,
default="Llama-3.1-8B-Instruct-App",
help="A Modal App serving an OpenAI-compatible API",
)
parser.add_argument(
"--function-name",
type=str,
default="serve",
help="A Modal Function serving an OpenAI-compatible API. Append `-dev` to use a `modal serve`d Function.",
)
parser.add_argument(
"--api-key",
type=str,
default="super-secret-token",
help="The API key to use for authentication, set in your api.py",
)
# Completion parameters
parser.add_argument("--max-tokens", type=int, default=None)
parser.add_argument("--temperature", type=float, default=0.7)
parser.add_argument("--top-p", type=float, default=0.9)
parser.add_argument("--top-k", type=int, default=0)
parser.add_argument("--frequency-penalty", type=float, default=0)
parser.add_argument("--presence-penalty", type=float, default=0)
parser.add_argument(
"--n",
type=int,
default=1,
help="Number of completions to generate. Streaming and chat mode only support n=1.",
)
parser.add_argument("--stop", type=str, default=None)
parser.add_argument("--seed", type=int, default=None)
# Prompting
parser.add_argument(
"--prompt",
type=str,
default="Compose a limerick about baboons and racoons.",
help="The user prompt for the chat completion",
)
parser.add_argument(
"--system-prompt",
type=str,
default="You are a poetic assistant, skilled in writing satirical doggerel with creative flair.",
help="The system prompt for the chat completion",
)
# UI options
parser.add_argument(
"--no-stream",
dest="stream",
action="store_false",
help="Disable streaming of response chunks",
)
parser.add_argument(
"--chat", action="store_true", help="Enable interactive chat mode"
)
args = parser.parse_args()
client = OpenAI(api_key=args.api_key)
workspace = args.workspace or modal.config._profile
client.base_url = f"https://{workspace}--{args.app_name}-{args.function_name}.modal.run/v1"
if args.model:
model_id = args.model
print(
Colors.BOLD,
f"🧠: Using model {model_id}. This may trigger a model load on first call!",
Colors.END,
sep="",
)
else:
print(
Colors.BOLD,
f"🔎: Looking up available models on server at {client.base_url}. This may trigger a model load!",
Colors.END,
sep="",
)
model = client.models.list().data[0]
model_id = model.id
print(
Colors.BOLD,
f"🧠: Using {model_id}",
Colors.END,
sep="",
)
messages = [
{
"role": "system",
"content": args.system_prompt,
}
]
print(
Colors.BOLD
+ "🧠: Using system prompt: "
+ args.system_prompt
+ Colors.END
)
if args.chat:
print(
Colors.GREEN
+ Colors.BOLD
+ "\nEntering chat mode. Type 'bye' to end the conversation."
+ Colors.END
)
while True:
user_input = input("\nYou: ")
if user_input.lower() in ["bye"]:
break
MAX_HISTORY = 10
if len(messages) > MAX_HISTORY:
messages = messages[:1] + messages[-MAX_HISTORY + 1 :]
messages.append({"role": "user", "content": user_input})
response = get_completion(client, model_id, messages, args)
if response:
if args.stream:
# only stream assuming n=1
print(Colors.BLUE + "\n🤖: ", end="")
assistant_message = ""
for chunk in response:
if chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
print(content, end="")
assistant_message += content
print(Colors.END)
else:
assistant_message = response.choices[0].message.content
print(
Colors.BLUE + "\n🤖:" + assistant_message + Colors.END,
sep="",
)
messages.append(
{"role": "assistant", "content": assistant_message}
)
else:
messages.append({"role": "user", "content": args.prompt})
print(Colors.GREEN + f"\nYou: {args.prompt}" + Colors.END)
response = get_completion(client, model_id, messages, args)
if response:
if args.stream:
print(Colors.BLUE + "\n🤖:", end="")
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
print(Colors.END)
else:
# only case where multiple completions are returned
for i, response in enumerate(response.choices):
print(
Colors.BLUE
+ f"\n🤖 Choice {i+1}:{response.message.content}"
+ Colors.END,
sep="",
)
if __name__ == "__main__":
main()
- 명령어를 통해 스크립트를 실행합니다.
python3 client.py
- 처음 요청을 보내면, 콜드 스타트로 인해 30~40초 정도의 시간이 소요됩니다. 이후에는 2초 내외로 응답이 오게 됩니다.
🧠: Using meta-llama/Meta-Llama-3.1-8B-Instruct
🧠: Using system prompt: You are a poetic assistant, skilled in writing satirical doggerel with creative flair.
You: Compose a limerick about baboons and racoons.
🤖:There once were baboons with a flair,
Whose raccoon friends came with care,
Together they'd play,
On a sunny day,
Their mischief beyond compare.