Jupyter에서 LLM response를 stream 형태의 Markdown으로 보기

1. 배경

OpenAI나 Bedrock API 등의 LLM을 이용하여 프롬프트, 파라미터 조정 등 챗봇 테스트를 진행하고자 하는 경우가 있다. streamlit을 이용하여 챗봇 화면을 구성하여 테스트를 진행할 수도 있으나 jupyter notebook (혹은 jupyter lab)에서 수행하는 것이 더 적절할 때가 있다. 이럴 때 stream 형태(ChatGPT, Copilot처럼 토큰마다 답변이 생성되어 나오는 형태)로 답변을 받고자 했다.

물론 토큰 단위로 `print(..., end='')`하면 출력되는 값을 실시간으로 확인할 수 있지만 나는 답변을 Markdown으로 바꿔서 확인하고 싶었다.

챗봇 클래스를 만들어 이전 대화가 이어지게 하는 건 덤.

2. OpenAI API를 이용한 챗봇 클래스 구현

아래는 완성한 전체 코드이다. 코드를 보면서 설명을 하고자 한다.

from openai import OpenAI
from collections import deque
import tiktoken
from IPython.display import Markdown, DisplayHandle

def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613"):
    "Return the number of tokens used by a list of messages."
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        # print("Warning: model not found. Using cl100k_base encoding.")
        encoding = tiktoken.get_encoding("cl100k_base")
    if model in {
        "gpt-3.5-turbo-0613",
        "gpt-3.5-turbo-16k-0613",
        "gpt-4-0314",
        "gpt-4-32k-0314",
        "gpt-4-0613",
        "gpt-4-32k-0613",
        }:
        tokens_per_message = 3
        tokens_per_name = 1
    elif model == "gpt-3.5-turbo-0301":
        tokens_per_message = 4  # every message follows <|start|>{role/name}\n{content}<|end|>\n
        tokens_per_name = -1  # if there's a name, the role is omitted
    elif "gpt-3.5-turbo" in model:
        return num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613")
    elif "gpt-4" in model:
        return num_tokens_from_messages(messages, model="gpt-4-0613")
    else:
        raise NotImplementedError(
            f"num_tokens_from_messages() is not implemented for model {model}. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens."
        )
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3  # every reply is primed with <|start|>assistant<|message|>
    return num_tokens


class Chatbot:
    def __init__(self, api_key, max_history=10, gpt_model='gpt-4o', max_tokens=4096, system_msg="You are an assistant."):
        # OpenAI API client 초기화
        self.client = OpenAI(api_key=api_key)
        self.gpt_model = gpt_model
        self.max_tokens = max_tokens
        # 대화 기록을 기록할 deque 정의
        self.chat_history = deque(maxlen=max_history)
        self.system_msg = [{"role":"system", "content":system_msg}]

        
    def adjust_max_tokens(self):
        messages = self.system_msg + list(self.chat_history)
        num_tokens = num_tokens_from_messages(messages, model=self.gpt_model)

        # 메시지에 이미 있는 토큰 수를 기반으로 max_tokens 조정
        if self.gpt_model == 'gpt-3.5-turbo':
            # max_tokens만큼 확보
            while num_tokens > 4096 - self.max_tokens and self.chat_history:
                self.chat_history.popleft()
                num_tokens = num_tokens_from_messages(self.system_msg + list(self.chat_history), model=self.gpt_model)
        elif self.gpt_model == 'gpt-4':
            while num_tokens > 8192 - self.max_tokens and self.chat_history:
                self.chat_history.popleft()
                num_tokens = num_tokens_from_messages(self.system_msg + list(self.chat_history), model=self.gpt_model)
        elif self.gpt_model in ['gpt-4-turbo', 'gpt-4o']:
            # 전체 토큰 수가 50000개를 넘지 않도록 조정 (요금, 속도를 위해 축소)
            # 필요에 따라 128000개로 조정
            max_tokens = 50000
            while num_tokens > max_tokens and self.chat_history:
                self.chat_history.popleft()
                num_tokens = num_tokens_from_messages(self.system_msg + list(self.chat_history), model=self.gpt_model)
                
                
                
    def ask(self, question):
        # 새로운 질문을 deque에 추가
        self._add_to_history({"role":"user", "content":question})
        self.adjust_max_tokens()  # 토큰 수 조정

        # 시스템 메시지와 이전 대화를 합쳐 prompt 생성
        prompt = self.system_msg + list(self.chat_history)

        dh = DisplayHandle()  # DisplayHandle 인스턴스 생성
        dh.display(Markdown(''))
        
        # API request
        answer = ''
        stream = self.client.chat.completions.create(
            model=self.gpt_model,
            max_tokens=self.max_tokens,
            messages=prompt,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0,
            stream=True
        )

        # OpenAI GPT API 호출
        for chunk in stream:
            resp = chunk.choices[0].delta.content
            if resp is not None:
                answer += resp
                dh.update(Markdown(answer))
        
        # 대화 기록 추가
        self._add_to_history({"role":"assistant", "content":answer})
        
        return answer

    def _add_to_history(self, message):
        self.chat_history.append(message)

`num_tokens_from_messages` 함수는 openai cookbook github에서 가져왔다.

2-1. 챗봇 클래스 정의

class Chatbot:
    def __init__(self, api_key, max_history=10, gpt_model='gpt-4o', max_tokens=4096, system_msg="You are an assistant."):
        # OpenAI API client 초기화
        self.client = OpenAI(api_key=api_key)
        self.gpt_model = gpt_model
        self.max_tokens = max_tokens
        # 대화 기록을 기록할 deque 정의
        self.chat_history = deque(maxlen=max_history)
        self.system_msg = [{"role":"system", "content":system_msg}]

- `api_key`: OpenAI API key

- `max_history`: 기록할 최대 대화(사용자+LLM) 길이

API에 넘길 대화의 길이가 길면 (1) 토큰 제한에 걸리거나 (2) 답변을 받는데 오래 걸리며 (3) 요금이 많이 나오는 문제가 있을 수 있기 때문에 제한을 걸어두었다. 그나마 GPT-4o 모델이 출시되면서 이런 제한사항들의 부담이 줄어들긴 하였다.

- `gpt_model`: 사용할 GPT 모델

- `max_tokens`: OpenAI API에 사용할 최대 토큰 수 (답변의 최대 토큰 수)

- `system_msg`: 챗봇의 역할 지정(프롬프트 엔지니어링)

여기서는 `deque`를 이용하여 기록을 메모리에 저장하였지만 서비스 레벨에서 쓰는 경우 대화 이력을 MongoDB 같은 데이터베이스에 저장해두고 사용하게 된다. (대화창 id, 사용자 key, 시간 등)

2-2. 대화 길이 조절

    def adjust_max_tokens(self):
        messages = self.system_msg + list(self.chat_history)
        num_tokens = num_tokens_from_messages(messages, model=self.gpt_model)

        # 메시지에 이미 있는 토큰 수를 기반으로 max_tokens 조정
        if self.gpt_model == 'gpt-3.5-turbo':
            # max_tokens만큼 확보
            while num_tokens > 4096 - self.max_tokens and self.chat_history:
                self.chat_history.popleft()
                num_tokens = num_tokens_from_messages(self.system_msg + list(self.chat_history), model=self.gpt_model)
        elif self.gpt_model == 'gpt-4':
            while num_tokens > 8192 - self.max_tokens and self.chat_history:
                self.chat_history.popleft()
                num_tokens = num_tokens_from_messages(self.system_msg + list(self.chat_history), model=self.gpt_model)
        elif self.gpt_model in ['gpt-4-turbo', 'gpt-4o']:
            # 전체 토큰 수가 50000개를 넘지 않도록 조정 (요금, 속도를 위해 축소)
            # 필요에 따라 128000개로 조정
            max_tokens = 50000
            while num_tokens > max_tokens and self.chat_history:
                self.chat_history.popleft()
                num_tokens = num_tokens_from_messages(self.system_msg + list(self.chat_history), model=self.gpt_model)

앞서 `max_history`로 대화 길이를 조절한다곤 했으나, 입력이 너무 길면 OpenAI API에서 최대 토큰수를 초과했다고 오류가 나게 되므로 이를 방지하고자 했다.

▼ gpt-4o 모델에 입력 메시지로 'hi', `max_tokens=999999`를 준 경우의 에러

BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 128000 tokens. However, you requested 1000007 tokens (8 in the messages, 999999 in the completion). Please reduce the length of the messages or completion.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}

위 에러메시지를 보면 입력(messages)+출력(completion)을 합쳐서 128000 토큰까지 사용할 수 있다. 그렇기 때문에 사용자가 원하는 출력 메시지 길이(`max_tokens`)를 보장한 나머지를 입력으로 사용하게끔 채팅 이력을 조절해준다.

예를 들어 `max_tokens`가 2048이고 모델은 gpt-3.5-turbo, 메시지의 토큰 수가 [1000, 1000, 1500, 500]인 경우가 있다고 생각해보자. (구버전 기준) gpt-3.5-turbo의 최대 토큰 합계는 4096이고 사용자가 원하는 출력의 토큰 수는 2048이므로 입력 토큰 수가 최대 2048이 되어야한다. 그러면 대화를 앞에서부터 하나씩 날려가면서, 전체 토큰 수가 2048 이하가 되도록 [1500, 500]만 남기게 된다.

2-3. API 호출 및 대화 streaming

    def ask(self, question):
        # 새로운 질문을 deque에 추가
        self._add_to_history({"role":"user", "content":question})
        self.adjust_max_tokens()  # 토큰 수 조정

        # 시스템 메시지와 이전 대화를 합쳐 prompt 생성
        prompt = self.system_msg + list(self.chat_history)

        dh = DisplayHandle()  # DisplayHandle 인스턴스 생성
        dh.display(Markdown(''))
        
        # API request
        answer = ''
        stream = self.client.chat.completions.create(
            model=self.gpt_model,
            max_tokens=self.max_tokens,
            messages=prompt,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0,
            stream=True
        )

        # OpenAI GPT API 호출
        for chunk in stream:
            resp = chunk.choices[0].delta.content
            if resp is not None:
                answer += resp
                dh.update(Markdown(answer))
        
        # 대화 기록 추가
        self._add_to_history({"role":"assistant", "content":answer})
        
        return answer

먼저 사용자 질문을 대화 기록에 추가하고 토큰 수를 조정한다. 이후 `stream=True` 파라미터와 함께 API를 호출하게 된다. 여기서 `IPython.display`의 `DisplayHandle` 인스턴스를 이용하면 토큰을 받을 때마다 Markdown 형식으로 출력을 업데이트할 수 있다.

https://github.com/woojangchang/TIL/blob/master/LLM/Chatbot_jupyter.py

728x90

저작자표시 비영리 동일조건

'데이터 분석 > LLM' 카테고리의 다른 글

LangChain을 이용한 RAG - (5) 생성 편 (0)	2024.07.22
LangChain을 이용한 RAG - (4) 검색 편 (0)	2024.07.15
LangChain을 이용한 RAG - (3) 벡터 DB 편 (0)	2024.07.11
LangChain을 이용한 RAG - (2) 문서 임베딩 편 (0)	2024.07.06
LangChain을 이용한 RAG - (1) 이론편 (0)	2024.07.01

복습 블로그