Skip to content

[PyTorch] 旅館評論情感分析實戰紀錄 (0)

『旅館評論情感分析』是一個適合訓練分類模型的自然語言處理(NLP)任務。今天是我隨意開始嘗試的第一天,於是我找了個 GitHub 上的旅館評論資料集專案,並訓練了一個由全連接層(fully-connected layer)組成的分類模型,並以 accuracy、precision、recall、F1-score 作為我的指標分數評估模型好壞。


前言

扣掉 『PM25 預測模型』不提, IMDB 電影評論情緒分析 (二分類) 是我踏入機器學習的開始,也是我接觸 NLP (自然語言處理) 的開端。慢慢慢慢地學習到了現在,我很想再次挑戰看看自己到底成長了多少,於是乎再次找了個類似『評論二分類』的題目,試試看自己能作到什麼程度。絕對不是因為我最近太無聊了。

由於這是亂試開始的第一天,所以我可能會胡亂講些有的沒的,總之目標就是:

  1. 我找到了個網路上的『旅館評論資料集』,已經標好了『正面』或『負面』
  2. 我使用 Accuracy、Precision、Recall、F1 當作我的指標,分數越高越好
  3. 今天首先就拿個 Fully Connected 胡亂組成的模型當作 Baseline

好了,廢話不多說,那麼以下就開始吧!


旅館評論資料集

首先介紹資料集。這份資料集我是在 https://github.com/Chunshan-Theta/NLPLab 這裡找到的,在此感謝整理資料並分享的開發者。

基本上正面的評論都放在名叫 positiveReviews 的資料夾中、負面的評論都放在 negativeReviews 的資料夾中,基本上與 IMDB 常見的分法相仿,使用起來非常方便。

正面的評論一共有 196,337 筆,負面的評論一共有 145,321 筆。

是繁體中文的資料,大致上長相如下:

每一筆資料都獨立地儲存在一個文件裡。


前處理

由於我很懶,而且想隨便弄個簡單的 Baseline 出來,所以一切從簡。我的處理方式相當簡單粗暴,連斷詞都沒做,基本上如下:

  1. 使用 For-loop 讀取所有的檔案
  2. 將每份評論每個 “Character” 斷開,並給予每個相異字轉成一個特定的 Index (電腦只能處理數值型態的資料,不能處理文字型態)
  3. 將每份評論 Padding 0 ,直到評論長度為 50,超過的評論則截斷
  4. 每份評論依照不同的資料夾,給予不同的 Label
  5. 使用 Scikit-Learn 的 train_test_split() 切割 Training data 以及 Test data (可以參閱我之前寫過的《Scikit-Learn 當中使用 train_test_split 切割資料》)
  6. 將 Training data 以及 Test data 分別儲存至不同的 Json 檔 (可以參閱我之前寫過的《Python 中使用 Json 套件處理、讀寫 Json 格式文件》)

程式碼如下:

# -*- coding: utf-8 -*-
import os
import json
from sklearn.model_selection import train_test_split


# Settings
pos_path = 'data/positiveReviews/'
neg_path = 'data/negativeReviews/'
seed = 666
index = 1
c2ix = dict()


# Positive data
pos_data = []
for file in os.listdir(pos_path):
    text = list(open(pos_path+file, 'r', encoding='utf-8').read())

    data_temp = []
    for c in text:
        if c not in c2ix:
            c2ix[c] = index
            index += 1

        data_temp.append(c2ix[c])

    while len(data_temp) < 50:
        data_temp.append(0)

    if len(data_temp) > 50:
        data_temp = data_temp[:50]

    pos_data.append((data_temp, 1))


# Negative data
neg_data = []
for file in os.listdir(neg_path):
    text = list(open(neg_path + file, 'r', encoding='utf-8').read())

    data_temp = []
    for c in text:
        if c not in c2ix:
            c2ix[c] = index
            index += 1

        data_temp.append(c2ix[c])

    while len(data_temp) < 50:
        data_temp.append(0)

    if len(data_temp) > 50:
        data_temp = data_temp[:50]

    neg_data.append((data_temp, 0))


# Split data
train, test = train_test_split(pos_data+neg_data, random_state=seed, train_size=0.7)


# Save json file
with open('train_data.json', 'w', encoding='utf-8') as f:
    json.dump(train, f)

with open('test_data.json', 'w', encoding='utf-8') as f:
    json.dump(test, f)

print('Finished.')

客製資料集

由於我是使用 PyTorch 來搭建這個分類器,故需要使用 PyTorch 當中的 Dataset 來製作自己的資料集。

這裡就直接看程式碼吧:

# -*- coding: utf-8 -*-
import os
import json
from sklearn.model_selection import train_test_split


# Settings
pos_path = 'data/positiveReviews/'
neg_path = 'data/negativeReviews/'
seed = 666
index = 1
c2ix = dict()


# Positive data
pos_data = []
for file in os.listdir(pos_path):
    text = list(open(pos_path+file, 'r', encoding='utf-8').read())

    data_temp = []
    for c in text:
        if c not in c2ix:
            c2ix[c] = index
            index += 1

        data_temp.append(c2ix[c])

    while len(data_temp) < 50:
        data_temp.append(0)

    if len(data_temp) > 50:
        data_temp = data_temp[:50]

    pos_data.append((data_temp, 1))


# Negative data
neg_data = []
for file in os.listdir(neg_path):
    text = list(open(neg_path + file, 'r', encoding='utf-8').read())

    data_temp = []
    for c in text:
        if c not in c2ix:
            c2ix[c] = index
            index += 1

        data_temp.append(c2ix[c])

    while len(data_temp) < 50:
        data_temp.append(0)

    if len(data_temp) > 50:
        data_temp = data_temp[:50]

    neg_data.append((data_temp, 0))


# Split data
train, test = train_test_split(pos_data+neg_data, random_state=seed, train_size=0.7)


# Save json file
with open('train_data.json', 'w', encoding='utf-8') as f:
    json.dump(train, f)

with open('test_data.json', 'w', encoding='utf-8') as f:
    json.dump(test, f)

print('Finished.')




模型定義

這裡我另外寫了個 model.py 來定義模型,我習慣這樣將訓練模型的每一個部份都拆得清清楚楚。 (當然,我有一些朋友認為我的程式碼並不清楚 XDDD 每個人都有自己的 Coding Style)

# -*- coding: utf-8 -*-
import torch.nn as nn


# Model
class fully_connected_model(nn.Module):
    def __init__(self):
        super(fully_connected_model, self).__init__()
        self.main = nn.Sequential(
            nn.Linear(5000, 2048),
            nn.ReLU(),
            nn.Linear(2048, 1024),
            nn.ReLU(),
            nn.Linear(1024, 256),
            nn.ReLU(),
            nn.Linear(256, 16),
            nn.ReLU(),
            nn.Linear(16, 1),
            nn.Sigmoid()
        )

    def forward(self, input):
        return self.main(input)



還記得剛剛我定義的每個評論都是長度 50 的嗎?這裡就直接接 Fully Connected Layer,從 50 逐漸減少到只剩 1 個神經元接 Sigmoid (可以參考我曾經寫過的《Machine Learning 補充筆記: Sigmoid function》)

Sigmoid 函數會將我們輸入的值映射到 [0, 1] 之間,所以之後我們只需要決定一個閥值 (Threshold) 在 0-1 之間,就可以明確地決定我們輸出的預測分類是哪一種。


訓練模型

這裡我迭代了 20 次,並將模型儲存起來。

# -*- coding: utf-8 -*-
import time
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
from model import fully_connected_model
from customDataset import reviewDataset


start_time = time.time()


# GPU
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
print('GPU State:', device)


# Loss
def loss_function(inputs, targets):
    return nn.BCELoss()(inputs, targets)


# Model
model = fully_connected_model().to(device)
print(model)


# Settings
epochs = 20
lr = 0.002
batch_size = 16
optimizer = optim.Adam(model.parameters(), lr=lr)


# DataLoader
train_set = reviewDataset(train=True)
train_loader = data.DataLoader(train_set, batch_size=batch_size, shuffle=True)


# Train
for epoch in range(epochs):
    epoch += 1

    for times, data in enumerate(train_loader):
        times += 1
        inputs = data[0].to(device)
        labels = data[1].to(device)

        # Zero gradients
        optimizer.zero_grad()

        # Forward & Backward
        outputs = model(inputs).to(device)
        loss = loss_function(outputs, labels)
        loss.backward()
        optimizer.step()

        # Display loss
        if times % 100 == 0 or times == len(train_loader):
            print('[{}/{}, {}/{}] loss: {:.3f}'.format(epoch, epochs, times, len(train_loader), loss.item()))


print('Training Finished.')

# Saved
torch.save(model, 'fc.pth')
print('Model saved.')
# -*- coding: utf-8 -*-
import time
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
from model import fully_connected_model
from customDataset import reviewDataset


start_time = time.time()


# GPU
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
print('GPU State:', device)


# Loss
def loss_function(inputs, targets):
    return nn.BCELoss()(inputs, targets)


# Model
model = fully_connected_model().to(device)
print(model)


# Settings
epochs = 20
lr = 0.002
batch_size = 16
optimizer = optim.Adam(model.parameters(), lr=lr)


# DataLoader
train_set = reviewDataset(train=True)
train_loader = data.DataLoader(train_set, batch_size=batch_size, shuffle=True)


# Train
for epoch in range(epochs):
    epoch += 1

    for times, data in enumerate(train_loader):
        times += 1
        inputs = data[0].to(device)
        labels = data[1].to(device)

        # Zero gradients
        optimizer.zero_grad()

        # Forward & Backward
        outputs = model(inputs).to(device)
        loss = loss_function(outputs, labels)
        loss.backward()
        optimizer.step()

        # Display loss
        if times % 100 == 0 or times == len(train_loader):
            print('[{}/{}, {}/{}] loss: {:.3f}'.format(epoch, epochs, times, len(train_loader), loss.item()))


print('Training Finished.')

# Saved
torch.save(model, 'fc.pth')
print('Model saved.')




評估模型好壞

終於到最後一步了,就如同開頭所講的,我使用 Accuracy、Precision、Recall、F1 等指標來評估我模型的好壞。

首先,我將模型載入,然後再來預測 Test data:

# -*- coding: utf-8 -*-
import torch
import torch.utils.data as data
from customDataset import reviewDataset
from sklearn import metrics


# GPU
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
print('GPU State:', device)

# Settings
batch_size = 16
threshold = torch.tensor([0.5]).to(device)

# Data
test_set = reviewDataset(train=False)
test_loader = data.DataLoader(test_set, batch_size=batch_size)

# Model
model = torch.load('fc.pth')
model.eval()
print(model)

# Test
pred = []
true = []

with torch.no_grad():
    for data in test_loader:
        inputs = data[0].to(device)
        inputs = inputs.view(-1, 5000)

        labels = data[1].to(device)

        outputs = model(inputs).to(device)
        outputs = (outputs>threshold).float()*1

        for n in range(len(outputs)):
            pred.append(outputs[n].tolist()[0])
            true.append(labels[n].tolist())

print('Accuracy: {:.3f}%'.format(metrics.accuracy_score(true, pred)))
print('Precision: {:.3f}%'.format(metrics.precision_score(true, pred)))
print('Recall: {:.3f}%'.format(metrics.recall_score(true, pred)))
print('F1: {:.3f}%'.format(metrics.f1_score(true, pred)))



Output:

Accuracy: 0.756%
Precision: 0.766%
Recall: 0.828%
F1: 0.796%

可以看到最後的 F1 居然是 0.796 %!這其實已經高得超乎我想像了。畢竟我都在亂做啊!

不過仔細想想,這樣的確還不夠高 —— 如果今天要做個能判斷客戶留言的模型,然後它每五次就會分錯一次…

……不會被老闆宰掉嗎?

好,總之接下來再開始努力吧!

3 thoughts on “[PyTorch] 旅館評論情感分析實戰紀錄 (0)”

  1. Pingback: [PyTorch] 旅館評論情感分析實戰紀錄 (1) - Clay-Technology World

  2. 評估模型的輸出錯了,加上%後數字應該要乘上100。

    Accuracy: 0.756%
    Precision: 0.766%
    Recall: 0.828%
    F1: 0.796%

Leave a Reply