[패스트캠퍼스 챌린지 41일차] Model Tracking

이전 글에서는 console, 그리고 오픈소스 CML을 활용하여 모델 metric을 markdown으로 report 형태로 떨구어주는 것을 다루었습니다. 하지만 이부분의 단점으로는 각 commit마다 report가 달리는데 이를 비교하는 것에 대해서는 부족함을 느낄 수 있습니다. DVC를 이용해서 같은 공간에서 비교 및 tracking이 되는지 확인해보고자 합니다.

이번에는 다른 데이터를 바탕으로 Tracking하는 내용을 다루어보도록 하겠습니다.(데이터는 그렇게 중요한 부분은 아닙니다.) 이 Github을 Fork를 수행합니다.

먼저 간단한 처리를 위해 fork 이후에 저장되어있는 process_data.py를 실행하면 전처리된 데이터가 떨어집니다. 이제 train.py 를 수행합니다. (만약에 "## AttributeError: 'Series' object has no attribute 'to_numpy' " 라는 에러가 나타난다면, "pip install --upgrade pandas" 를 수행해주시길 바랍니다.)

- train.py

import pandas as pd 
import numpy as np
from sklearn.linear_model import LogisticRegression
# from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn import preprocessing
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.model_selection import train_test_split
import json
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer

df = pd.read_csv("data_processed.csv")

#### Get features ready to model! 
y = df.pop("cons_general").to_numpy()
y[y< 4] = 0
y[y>= 4] = 1

X = df.to_numpy()
X = preprocessing.scale(X) # Is standard
# Impute NaNs

imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit(X)
X = imp.transform(X)


# Linear model
# clf = QuadraticDiscriminantAnalysis()
clf = LogisticRegression()
yhat = cross_val_predict(clf, X, y, cv=5)

acc = np.mean(yhat==y)
tn, fp, fn, tp = confusion_matrix(y, yhat).ravel()
specificity = tn / (tn+fp)
sensitivity = tp / (tp + fn)

# Now print to file
with open("metrics.json", 'w') as outfile:
        json.dump({ "accuracy": acc, "specificity": specificity, "sensitivity":sensitivity}, outfile)

# Let's visualize within several slices of the dataset
score = yhat == y
score_int = [int(s) for s in score]
df['pred_accuracy'] = score_int

# Bar plot by region

sns.set_color_codes("dark")
ax = sns.barplot(x="region", y="pred_accuracy", data=df, palette = "Greens_d")
ax.set(xlabel="Region", ylabel = "Model accuracy")
plt.savefig("by_region.png",dpi=80)

이제 dvc를 활용해보겠습니다.

pip install dvc

파이프라인 빌드를 합니다. 우선 초기화를 실시하고, dvc.yaml 파일을 생성하겠습니다.

- initialize

# 초기화
dvc init
# dvc.yaml 생성
dvc run -n process -d process_data.py -d data_raw.csv -o data_processed.csv --no-exec python process_data.py

- dvc.yaml

stages:
  process:
    cmd: python process_data.py
    deps:
    - process_data.py
    - data_raw.csv
    outs:
    - data_processed.csv
  train:
    cmd: python train.py
    deps:
    - train.py
    - data_processed.csv
    outs:
    - by_region.png
    metrics:
    - metrics.json:
        cache: false

그 후, reproduction을 수행합니다. 파이프라인에 따라 재생성을 합니다.

dvc repro

이제 아래 그림처럼 다른 경로에 train.yaml 을 생성합니다.

- train.yaml

여기서 diff 가 나오는데 git diff와 차이점은, 수치비교 보다는 코드의 변화를 알기위해 보통 사용하고 파일 변경이력을 알기 어려운 것에 반해 dvc diff는 수치 비교와 모델링 전략 변경을 알 수 있습니다.

name: dvc-cml
on: [push]
jobs:
  run:
    runs-on: [ubuntu-latest]
    container: docker://dvcorg/cml-py3:latest
    steps:
      - uses: actions/checkout@v2
      - name: cml_run
        env:
          repo_token: ${{ secrets.GITHUB_TOKEN }}
        run: |
          pip install -r requirements.txt
          dvc repro 

          git fetch --prune ## https://git-scm.com/docs/git-fetch
          dvc metrics diff --show-md master > report.md

          echo "## Validating results by region"
          cml-publish by_region.png --md >> report.md
          cml-send-comment report.md

그 후, 모델링을 변경하면서 테스트를 할 예정입니다. 아래 그림처럼 새로운 branch "experiment" 를 생성하겠습니다.

그 후 위의 train.py의 일부를 수정합니다. 주석을 대체할 예정입니다. Logistic reg 대신 판별분석모형을 활용합니다.

# from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
...
# clf = LogisticRegression()
clf = QuadraticDiscriminantAnalysis()

이제 이를 commit하고 build를 수행하겠습니다.

commit 된 부분을 한번 찾아가보면 아래와 같은 결과를 알 수 있습니다. Old는 logistic reg, New 부분은 판별분석모형입니다. 이 부분이 시사하는 바는, branch 별로 모형이 바뀌었을 때 전과 후의 결과, 그리고 변화량(Change)을 확인을 할 수 있습니다.

다음 글에서는 Jenkins 관련 글로 찾아뵙겠습니다.

https://bit.ly/37BpXiC

패스트캠퍼스 [직장인 실무교육]

프로그래밍, 영상편집, UX/UI, 마케팅, 데이터 분석, 엑셀강의, The RED, 국비지원, 기업교육, 서비스 제공.

fastcampus.co.kr

* 본 포스팅은 패스트캠퍼스 환급 챌린지 참여를 위해 작성되었습니다.

저작자표시

'AI > MLOps' 카테고리의 다른 글

[패스트캠퍼스 챌린지 43일차] Jenkins CI Pipeline Build (0)	2022.03.07
[패스트캠퍼스 챌린지 42일차] Jenkins (0)	2022.03.06
[패스트캠퍼스 챌린지 40일차] Model Tracking - CML (0)	2022.03.04
[패스트캠퍼스 챌린지 39일차] Github Actions CICD - Push (0)	2022.03.03
[패스트캠퍼스 챌린지 38일차] Github Actions CICD - Build (0)	2022.03.02

'AI > MLOps' 카테고리의 다른 글

검색 태그

티스토리툴바