Last Updated on 2021-05-31 by Clay
『旅館評論情感分析』是一個適合訓練分類模型的自然語言處理(NLP)任務。今天是我隨意開始嘗試的第一天,於是我找了個 GitHub 上的旅館評論資料集專案,並訓練了一個由全連接層(fully-connected layer)組成的分類模型,並以 accuracy、precision、recall、F1-score 作為我的指標分數評估模型好壞。
前言
扣掉 『PM25 預測模型』不提, IMDB 電影評論情緒分析 (二分類) 是我踏入機器學習的開始,也是我接觸 NLP (自然語言處理) 的開端。慢慢慢慢地學習到了現在,我很想再次挑戰看看自己到底成長了多少,於是乎再次找了個類似『評論二分類』的題目,試試看自己能作到什麼程度。絕對不是因為我最近太無聊了。
由於這是亂試開始的第一天,所以我可能會胡亂講些有的沒的,總之目標就是:
- 我找到了個網路上的『旅館評論資料集』,已經標好了『正面』或『負面』
- 我使用 Accuracy、Precision、Recall、F1 當作我的指標,分數越高越好
- 今天首先就拿個 Fully Connected 胡亂組成的模型當作 Baseline
好了,廢話不多說,那麼以下就開始吧!
旅館評論資料集
首先介紹資料集。這份資料集我是在 https://github.com/Chunshan-Theta/NLPLab 這裡找到的,在此感謝整理資料並分享的開發者。
基本上正面的評論都放在名叫 positiveReviews 的資料夾中、負面的評論都放在 negativeReviews 的資料夾中,基本上與 IMDB 常見的分法相仿,使用起來非常方便。
正面的評論一共有 196,337 筆,負面的評論一共有 145,321 筆。
是繁體中文的資料,大致上長相如下:
每一筆資料都獨立地儲存在一個文件裡。
前處理
由於我很懶,而且想隨便弄個簡單的 Baseline 出來,所以一切從簡。我的處理方式相當簡單粗暴,連斷詞都沒做,基本上如下:
- 使用 For-loop 讀取所有的檔案
- 將每份評論每個 “Character” 斷開,並給予每個相異字轉成一個特定的 Index (電腦只能處理數值型態的資料,不能處理文字型態)
- 將每份評論 Padding 0 ,直到評論長度為 50,超過的評論則截斷
- 每份評論依照不同的資料夾,給予不同的 Label
- 使用 Scikit-Learn 的 train_test_split() 切割 Training data 以及 Test data (可以參閱我之前寫過的《Scikit-Learn 當中使用 train_test_split 切割資料》)
- 將 Training data 以及 Test data 分別儲存至不同的 Json 檔 (可以參閱我之前寫過的《Python 中使用 Json 套件處理、讀寫 Json 格式文件》)
程式碼如下:
# -*- coding: utf-8 -*-
import os
import json
from sklearn.model_selection import train_test_split
# Settings
pos_path = 'data/positiveReviews/'
neg_path = 'data/negativeReviews/'
seed = 666
index = 1
c2ix = dict()
# Positive data
pos_data = []
for file in os.listdir(pos_path):
text = list(open(pos_path+file, 'r', encoding='utf-8').read())
data_temp = []
for c in text:
if c not in c2ix:
c2ix[c] = index
index += 1
data_temp.append(c2ix[c])
while len(data_temp) < 50:
data_temp.append(0)
if len(data_temp) > 50:
data_temp = data_temp[:50]
pos_data.append((data_temp, 1))
# Negative data
neg_data = []
for file in os.listdir(neg_path):
text = list(open(neg_path + file, 'r', encoding='utf-8').read())
data_temp = []
for c in text:
if c not in c2ix:
c2ix[c] = index
index += 1
data_temp.append(c2ix[c])
while len(data_temp) < 50:
data_temp.append(0)
if len(data_temp) > 50:
data_temp = data_temp[:50]
neg_data.append((data_temp, 0))
# Split data
train, test = train_test_split(pos_data+neg_data, random_state=seed, train_size=0.7)
# Save json file
with open('train_data.json', 'w', encoding='utf-8') as f:
json.dump(train, f)
with open('test_data.json', 'w', encoding='utf-8') as f:
json.dump(test, f)
print('Finished.')
客製資料集
由於我是使用 PyTorch 來搭建這個分類器,故需要使用 PyTorch 當中的 Dataset 來製作自己的資料集。
這裡就直接看程式碼吧:
# -*- coding: utf-8 -*- import os import json from sklearn.model_selection import train_test_split # Settings pos_path = 'data/positiveReviews/' neg_path = 'data/negativeReviews/' seed = 666 index = 1 c2ix = dict() # Positive data pos_data = [] for file in os.listdir(pos_path): text = list(open(pos_path+file, 'r', encoding='utf-8').read()) data_temp = [] for c in text: if c not in c2ix: c2ix[c] = index index += 1 data_temp.append(c2ix[c]) while len(data_temp) < 50: data_temp.append(0) if len(data_temp) > 50: data_temp = data_temp[:50] pos_data.append((data_temp, 1)) # Negative data neg_data = [] for file in os.listdir(neg_path): text = list(open(neg_path + file, 'r', encoding='utf-8').read()) data_temp = [] for c in text: if c not in c2ix: c2ix[c] = index index += 1 data_temp.append(c2ix[c]) while len(data_temp) < 50: data_temp.append(0) if len(data_temp) > 50: data_temp = data_temp[:50] neg_data.append((data_temp, 0)) # Split data train, test = train_test_split(pos_data+neg_data, random_state=seed, train_size=0.7) # Save json file with open('train_data.json', 'w', encoding='utf-8') as f: json.dump(train, f) with open('test_data.json', 'w', encoding='utf-8') as f: json.dump(test, f) print('Finished.')
模型定義
這裡我另外寫了個 model.py 來定義模型,我習慣這樣將訓練模型的每一個部份都拆得清清楚楚。 (當然,我有一些朋友認為我的程式碼並不清楚 XDDD 每個人都有自己的 Coding Style)
# -*- coding: utf-8 -*- import torch.nn as nn # Model class fully_connected_model(nn.Module): def __init__(self): super(fully_connected_model, self).__init__() self.main = nn.Sequential( nn.Linear(5000, 2048), nn.ReLU(), nn.Linear(2048, 1024), nn.ReLU(), nn.Linear(1024, 256), nn.ReLU(), nn.Linear(256, 16), nn.ReLU(), nn.Linear(16, 1), nn.Sigmoid() ) def forward(self, input): return self.main(input)
還記得剛剛我定義的每個評論都是長度 50 的嗎?這裡就直接接 Fully Connected Layer,從 50 逐漸減少到只剩 1 個神經元接 Sigmoid (可以參考我曾經寫過的《Machine Learning 補充筆記: Sigmoid function》)
Sigmoid 函數會將我們輸入的值映射到 [0, 1] 之間,所以之後我們只需要決定一個閥值 (Threshold) 在 0-1 之間,就可以明確地決定我們輸出的預測分類是哪一種。
訓練模型
這裡我迭代了 20 次,並將模型儲存起來。
# -*- coding: utf-8 -*- import time import torch import torch.nn as nn import torch.optim as optim import torch.utils.data as data from model import fully_connected_model from customDataset import reviewDataset start_time = time.time() # GPU device = 'cuda:0' if torch.cuda.is_available() else 'cpu' print('GPU State:', device) # Loss def loss_function(inputs, targets): return nn.BCELoss()(inputs, targets) # Model model = fully_connected_model().to(device) print(model) # Settings epochs = 20 lr = 0.002 batch_size = 16 optimizer = optim.Adam(model.parameters(), lr=lr) # DataLoader train_set = reviewDataset(train=True) train_loader = data.DataLoader(train_set, batch_size=batch_size, shuffle=True) # Train for epoch in range(epochs): epoch += 1 for times, data in enumerate(train_loader): times += 1 inputs = data[0].to(device) labels = data[1].to(device) # Zero gradients optimizer.zero_grad() # Forward & Backward outputs = model(inputs).to(device) loss = loss_function(outputs, labels) loss.backward() optimizer.step() # Display loss if times % 100 == 0 or times == len(train_loader): print('[{}/{}, {}/{}] loss: {:.3f}'.format(epoch, epochs, times, len(train_loader), loss.item())) print('Training Finished.') # Saved torch.save(model, 'fc.pth') print('Model saved.')
# -*- coding: utf-8 -*- import time import torch import torch.nn as nn import torch.optim as optim import torch.utils.data as data from model import fully_connected_model from customDataset import reviewDataset start_time = time.time() # GPU device = 'cuda:0' if torch.cuda.is_available() else 'cpu' print('GPU State:', device) # Loss def loss_function(inputs, targets): return nn.BCELoss()(inputs, targets) # Model model = fully_connected_model().to(device) print(model) # Settings epochs = 20 lr = 0.002 batch_size = 16 optimizer = optim.Adam(model.parameters(), lr=lr) # DataLoader train_set = reviewDataset(train=True) train_loader = data.DataLoader(train_set, batch_size=batch_size, shuffle=True) # Train for epoch in range(epochs): epoch += 1 for times, data in enumerate(train_loader): times += 1 inputs = data[0].to(device) labels = data[1].to(device) # Zero gradients optimizer.zero_grad() # Forward & Backward outputs = model(inputs).to(device) loss = loss_function(outputs, labels) loss.backward() optimizer.step() # Display loss if times % 100 == 0 or times == len(train_loader): print('[{}/{}, {}/{}] loss: {:.3f}'.format(epoch, epochs, times, len(train_loader), loss.item())) print('Training Finished.') # Saved torch.save(model, 'fc.pth') print('Model saved.')
評估模型好壞
終於到最後一步了,就如同開頭所講的,我使用 Accuracy、Precision、Recall、F1 等指標來評估我模型的好壞。
首先,我將模型載入,然後再來預測 Test data:
# -*- coding: utf-8 -*- import torch import torch.utils.data as data from customDataset import reviewDataset from sklearn import metrics # GPU device = 'cuda:0' if torch.cuda.is_available() else 'cpu' print('GPU State:', device) # Settings batch_size = 16 threshold = torch.tensor([0.5]).to(device) # Data test_set = reviewDataset(train=False) test_loader = data.DataLoader(test_set, batch_size=batch_size) # Model model = torch.load('fc.pth') model.eval() print(model) # Test pred = [] true = [] with torch.no_grad(): for data in test_loader: inputs = data[0].to(device) inputs = inputs.view(-1, 5000) labels = data[1].to(device) outputs = model(inputs).to(device) outputs = (outputs>threshold).float()*1 for n in range(len(outputs)): pred.append(outputs[n].tolist()[0]) true.append(labels[n].tolist()) print('Accuracy: {:.3f}%'.format(metrics.accuracy_score(true, pred))) print('Precision: {:.3f}%'.format(metrics.precision_score(true, pred))) print('Recall: {:.3f}%'.format(metrics.recall_score(true, pred))) print('F1: {:.3f}%'.format(metrics.f1_score(true, pred)))
Output:
Accuracy: 0.756%
Precision: 0.766%
Recall: 0.828%
F1: 0.796%
可以看到最後的 F1 居然是 0.796 %!這其實已經高得超乎我想像了。畢竟我都在亂做啊!
不過仔細想想,這樣的確還不夠高 —— 如果今天要做個能判斷客戶留言的模型,然後它每五次就會分錯一次…
……不會被老闆宰掉嗎?
好,總之接下來再開始努力吧!
Pingback: [PyTorch] 旅館評論情感分析實戰紀錄 (1) - Clay-Technology World
評估模型的輸出錯了,加上%後數字應該要乘上100。
Accuracy: 0.756%
Precision: 0.766%
Recall: 0.828%
F1: 0.796%
謝謝,的確是應該要乘上 100,或是把 % 去掉。
感謝告知。