[PyTorch] 旅館評論情感分析實戰紀錄 (0)

Last Updated on 2021-05-31 by Clay

『旅館評論情感分析』是一個適合訓練分類模型的自然語言處理（NLP）任務。今天是我隨意開始嘗試的第一天，於是我找了個 GitHub 上的旅館評論資料集專案，並訓練了一個由全連接層（fully-connected layer）組成的分類模型，並以 accuracy、precision、recall、F1-score 作為我的指標分數評估模型好壞。

前言

扣掉『PM25 預測模型』不提， IMDB 電影評論情緒分析 (二分類) 是我踏入機器學習的開始，也是我接觸 NLP (自然語言處理) 的開端。慢慢慢慢地學習到了現在，我很想再次挑戰看看自己到底成長了多少，於是乎再次找了個類似『評論二分類』的題目，試試看自己能作到什麼程度。~~絕對不是因為我最近太無聊了。~~

由於這是亂試開始的第一天，所以我可能會胡亂講些有的沒的，總之目標就是：

我找到了個網路上的『旅館評論資料集』，已經標好了『正面』或『負面』
我使用 Accuracy、Precision、Recall、F1 當作我的指標，分數越高越好
今天首先就拿個 Fully Connected 胡亂組成的模型當作 Baseline

好了，廢話不多說，那麼以下就開始吧！

旅館評論資料集

首先介紹資料集。這份資料集我是在 https://github.com/Chunshan-Theta/NLPLab 這裡找到的，在此感謝整理資料並分享的開發者。

基本上正面的評論都放在名叫 positiveReviews 的資料夾中、負面的評論都放在 negativeReviews 的資料夾中，基本上與 IMDB 常見的分法相仿，使用起來非常方便。

正面的評論一共有 196,337 筆，負面的評論一共有 145,321 筆。

是繁體中文的資料，大致上長相如下：

每一筆資料都獨立地儲存在一個文件裡。

前處理

由於我很懶，而且想隨便弄個簡單的 Baseline 出來，所以一切從簡。我的處理方式相當簡單粗暴，連斷詞都沒做，基本上如下：

使用 For-loop 讀取所有的檔案
將每份評論每個 "Character" 斷開，並給予每個相異字轉成一個特定的 Index (電腦只能處理數值型態的資料，不能處理文字型態)
將每份評論 Padding 0 ，直到評論長度為 50，超過的評論則截斷
每份評論依照不同的資料夾，給予不同的 Label
使用 Scikit-Learn 的 train_test_split() 切割 Training data 以及 Test data (可以參閱我之前寫過的《Scikit-Learn 當中使用 train_test_split 切割資料》)
將 Training data 以及 Test data 分別儲存至不同的 Json 檔 (可以參閱我之前寫過的《Python 中使用 Json 套件處理、讀寫 Json 格式文件》)

程式碼如下：

# -*- coding: utf-8 -*-

import os

import json

from sklearn.model_selection import train_test_split





# Settings

pos_path = 'data/positiveReviews/'

neg_path = 'data/negativeReviews/'

seed = 666

index = 1

c2ix = dict()





# Positive data

pos_data = []

for file in os.listdir(pos_path):

       text = list(open(pos_path+file, 'r', encoding='utf-8').read())



       data_temp = []

       for c in text:

               if c not in c2ix:

                       c2ix[c] = index

                       index += 1



               data_temp.append(c2ix[c])



       while len(data_temp) < 50:

               data_temp.append(0)



       if len(data_temp) > 50:

               data_temp = data_temp[:50]



       pos_data.append((data_temp, 1))





# Negative data

neg_data = []

for file in os.listdir(neg_path):

       text = list(open(neg_path + file, 'r', encoding='utf-8').read())



       data_temp = []

       for c in text:

               if c not in c2ix:

                       c2ix[c] = index

                       index += 1



               data_temp.append(c2ix[c])



       while len(data_temp) < 50:

               data_temp.append(0)



       if len(data_temp) > 50:

               data_temp = data_temp[:50]



       neg_data.append((data_temp, 0))





# Split data

train, test = train_test_split(pos_data+neg_data, random_state=seed, train_size=0.7)





# Save json file

with open('train_data.json', 'w', encoding='utf-8') as f:

       json.dump(train, f)



with open('test_data.json', 'w', encoding='utf-8') as f:

       json.dump(test, f)



print('Finished.')

客製資料集

由於我是使用 PyTorch 來搭建這個分類器，故需要使用 PyTorch 當中的 Dataset 來製作自己的資料集。

這裡就直接看程式碼吧：

# -*- coding: utf-8 -*-
import os
import json
from sklearn.model_selection import train_test_split


# Settings
pos_path = 'data/positiveReviews/'
neg_path = 'data/negativeReviews/'
seed = 666
index = 1
c2ix = dict()


# Positive data
pos_data = []
for file in os.listdir(pos_path):
    text = list(open(pos_path+file, 'r', encoding='utf-8').read())

    data_temp = []
    for c in text:
        if c not in c2ix:
            c2ix[c] = index
            index += 1

        data_temp.append(c2ix[c])

    while len(data_temp) < 50:
        data_temp.append(0)

    if len(data_temp) > 50:
        data_temp = data_temp[:50]

    pos_data.append((data_temp, 1))


# Negative data
neg_data = []
for file in os.listdir(neg_path):
    text = list(open(neg_path + file, 'r', encoding='utf-8').read())

    data_temp = []
    for c in text:
        if c not in c2ix:
            c2ix[c] = index
            index += 1

        data_temp.append(c2ix[c])

    while len(data_temp) < 50:
        data_temp.append(0)

    if len(data_temp) > 50:
        data_temp = data_temp[:50]

    neg_data.append((data_temp, 0))


# Split data
train, test = train_test_split(pos_data+neg_data, random_state=seed, train_size=0.7)


# Save json file
with open('train_data.json', 'w', encoding='utf-8') as f:
    json.dump(train, f)

with open('test_data.json', 'w', encoding='utf-8') as f:
    json.dump(test, f)

print('Finished.')

# -*- coding: utf-8 -*-
import os
import json
from sklearn.model_selection import train_test_split


# Settings
pos_path = 'data/positiveReviews/'
neg_path = 'data/negativeReviews/'
seed = 666
index = 1
c2ix = dict()


# Positive data
pos_data = []
for file in os.listdir(pos_path):
    text = list(open(pos_path+file, 'r', encoding='utf-8').read())

    data_temp = []
    for c in text:
        if c not in c2ix:
            c2ix[c] = index
            index += 1

        data_temp.append(c2ix[c])

    while len(data_temp) < 50:
        data_temp.append(0)

    if len(data_temp) > 50:
        data_temp = data_temp[:50]

    pos_data.append((data_temp, 1))


# Negative data
neg_data = []
for file in os.listdir(neg_path):
    text = list(open(neg_path + file, 'r', encoding='utf-8').read())

    data_temp = []
    for c in text:
        if c not in c2ix:
            c2ix[c] = index
            index += 1

        data_temp.append(c2ix[c])

    while len(data_temp) < 50:
        data_temp.append(0)

    if len(data_temp) > 50:
        data_temp = data_temp[:50]

    neg_data.append((data_temp, 0))


# Split data
train, test = train_test_split(pos_data+neg_data, random_state=seed, train_size=0.7)


# Save json file
with open('train_data.json', 'w', encoding='utf-8') as f:
    json.dump(train, f)

with open('test_data.json', 'w', encoding='utf-8') as f:
    json.dump(test, f)

print('Finished.')

模型定義

這裡我另外寫了個 model.py 來定義模型，我習慣這樣將訓練模型的每一個部份都拆得清清楚楚。 (當然，我有一些朋友認為我的程式碼並不清楚 XDDD 每個人都有自己的 Coding Style)

# -*- coding: utf-8 -*-
import torch.nn as nn


# Model
class fully_connected_model(nn.Module):
    def __init__(self):
        super(fully_connected_model, self).__init__()
        self.main = nn.Sequential(
            nn.Linear(5000, 2048),
            nn.ReLU(),
            nn.Linear(2048, 1024),
            nn.ReLU(),
            nn.Linear(1024, 256),
            nn.ReLU(),
            nn.Linear(256, 16),
            nn.ReLU(),
            nn.Linear(16, 1),
            nn.Sigmoid()
        )

    def forward(self, input):
        return self.main(input)

# -*- coding: utf-8 -*-
import torch.nn as nn


# Model
class fully_connected_model(nn.Module):
    def __init__(self):
        super(fully_connected_model, self).__init__()
        self.main = nn.Sequential(
            nn.Linear(5000, 2048),
            nn.ReLU(),
            nn.Linear(2048, 1024),
            nn.ReLU(),
            nn.Linear(1024, 256),
            nn.ReLU(),
            nn.Linear(256, 16),
            nn.ReLU(),
            nn.Linear(16, 1),
            nn.Sigmoid()
        )

    def forward(self, input):
        return self.main(input)

還記得剛剛我定義的每個評論都是長度 50 的嗎？這裡就直接接 Fully Connected Layer，從 50 逐漸減少到只剩 1 個神經元接 Sigmoid (可以參考我曾經寫過的《Machine Learning 補充筆記: Sigmoid function》)

Sigmoid 函數會將我們輸入的值映射到 [0, 1] 之間，所以之後我們只需要決定一個閥值 (Threshold) 在 0-1 之間，就可以明確地決定我們輸出的預測分類是哪一種。

訓練模型

這裡我迭代了 20 次，並將模型儲存起來。

# -*- coding: utf-8 -*-
import time
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
from model import fully_connected_model
from customDataset import reviewDataset


start_time = time.time()


# GPU
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
print('GPU State:', device)


# Loss
def loss_function(inputs, targets):
    return nn.BCELoss()(inputs, targets)


# Model
model = fully_connected_model().to(device)
print(model)


# Settings
epochs = 20
lr = 0.002
batch_size = 16
optimizer = optim.Adam(model.parameters(), lr=lr)


# DataLoader
train_set = reviewDataset(train=True)
train_loader = data.DataLoader(train_set, batch_size=batch_size, shuffle=True)


# Train
for epoch in range(epochs):
    epoch += 1

    for times, data in enumerate(train_loader):
        times += 1
        inputs = data[0].to(device)
        labels = data[1].to(device)

        # Zero gradients
        optimizer.zero_grad()

        # Forward & Backward
        outputs = model(inputs).to(device)
        loss = loss_function(outputs, labels)
        loss.backward()
        optimizer.step()

        # Display loss
        if times % 100 == 0 or times == len(train_loader):
            print('[{}/{}, {}/{}] loss: {:.3f}'.format(epoch, epochs, times, len(train_loader), loss.item()))


print('Training Finished.')

# Saved
torch.save(model, 'fc.pth')
print('Model saved.')

# -*- coding: utf-8 -*-
import time
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
from model import fully_connected_model
from customDataset import reviewDataset


start_time = time.time()


# GPU
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
print('GPU State:', device)


# Loss
def loss_function(inputs, targets):
    return nn.BCELoss()(inputs, targets)


# Model
model = fully_connected_model().to(device)
print(model)


# Settings
epochs = 20
lr = 0.002
batch_size = 16
optimizer = optim.Adam(model.parameters(), lr=lr)


# DataLoader
train_set = reviewDataset(train=True)
train_loader = data.DataLoader(train_set, batch_size=batch_size, shuffle=True)


# Train
for epoch in range(epochs):
    epoch += 1

    for times, data in enumerate(train_loader):
        times += 1
        inputs = data[0].to(device)
        labels = data[1].to(device)

        # Zero gradients
        optimizer.zero_grad()

        # Forward & Backward
        outputs = model(inputs).to(device)
        loss = loss_function(outputs, labels)
        loss.backward()
        optimizer.step()

        # Display loss
        if times % 100 == 0 or times == len(train_loader):
            print('[{}/{}, {}/{}] loss: {:.3f}'.format(epoch, epochs, times, len(train_loader), loss.item()))


print('Training Finished.')

# Saved
torch.save(model, 'fc.pth')
print('Model saved.')

# -*- coding: utf-8 -*-
import time
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
from model import fully_connected_model
from customDataset import reviewDataset


start_time = time.time()


# GPU
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
print('GPU State:', device)


# Loss
def loss_function(inputs, targets):
    return nn.BCELoss()(inputs, targets)


# Model
model = fully_connected_model().to(device)
print(model)


# Settings
epochs = 20
lr = 0.002
batch_size = 16
optimizer = optim.Adam(model.parameters(), lr=lr)


# DataLoader
train_set = reviewDataset(train=True)
train_loader = data.DataLoader(train_set, batch_size=batch_size, shuffle=True)


# Train
for epoch in range(epochs):
    epoch += 1

    for times, data in enumerate(train_loader):
        times += 1
        inputs = data[0].to(device)
        labels = data[1].to(device)

        # Zero gradients
        optimizer.zero_grad()

        # Forward & Backward
        outputs = model(inputs).to(device)
        loss = loss_function(outputs, labels)
        loss.backward()
        optimizer.step()

        # Display loss
        if times % 100 == 0 or times == len(train_loader):
            print('[{}/{}, {}/{}] loss: {:.3f}'.format(epoch, epochs, times, len(train_loader), loss.item()))


print('Training Finished.')

# Saved
torch.save(model, 'fc.pth')
print('Model saved.')

# -*- coding: utf-8 -*-
import time
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
from model import fully_connected_model
from customDataset import reviewDataset


start_time = time.time()


# GPU
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
print('GPU State:', device)


# Loss
def loss_function(inputs, targets):
    return nn.BCELoss()(inputs, targets)


# Model
model = fully_connected_model().to(device)
print(model)


# Settings
epochs = 20
lr = 0.002
batch_size = 16
optimizer = optim.Adam(model.parameters(), lr=lr)


# DataLoader
train_set = reviewDataset(train=True)
train_loader = data.DataLoader(train_set, batch_size=batch_size, shuffle=True)


# Train
for epoch in range(epochs):
    epoch += 1

    for times, data in enumerate(train_loader):
        times += 1
        inputs = data[0].to(device)
        labels = data[1].to(device)

        # Zero gradients
        optimizer.zero_grad()

        # Forward & Backward
        outputs = model(inputs).to(device)
        loss = loss_function(outputs, labels)
        loss.backward()
        optimizer.step()

        # Display loss
        if times % 100 == 0 or times == len(train_loader):
            print('[{}/{}, {}/{}] loss: {:.3f}'.format(epoch, epochs, times, len(train_loader), loss.item()))


print('Training Finished.')

# Saved
torch.save(model, 'fc.pth')
print('Model saved.')

評估模型好壞

終於到最後一步了，就如同開頭所講的，我使用 Accuracy、Precision、Recall、F1 等指標來評估我模型的好壞。

首先，我將模型載入，然後再來預測 Test data：

# -*- coding: utf-8 -*-
import torch
import torch.utils.data as data
from customDataset import reviewDataset
from sklearn import metrics


# GPU
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
print('GPU State:', device)

# Settings
batch_size = 16
threshold = torch.tensor([0.5]).to(device)

# Data
test_set = reviewDataset(train=False)
test_loader = data.DataLoader(test_set, batch_size=batch_size)

# Model
model = torch.load('fc.pth')
model.eval()
print(model)

# Test
pred = []
true = []

with torch.no_grad():
    for data in test_loader:
        inputs = data[0].to(device)
        inputs = inputs.view(-1, 5000)

        labels = data[1].to(device)

        outputs = model(inputs).to(device)
        outputs = (outputs>threshold).float()*1

        for n in range(len(outputs)):
            pred.append(outputs[n].tolist()[0])
            true.append(labels[n].tolist())

print('Accuracy: {:.3f}%'.format(metrics.accuracy_score(true, pred)))
print('Precision: {:.3f}%'.format(metrics.precision_score(true, pred)))
print('Recall: {:.3f}%'.format(metrics.recall_score(true, pred)))
print('F1: {:.3f}%'.format(metrics.f1_score(true, pred)))

# -*- coding: utf-8 -*-
import torch
import torch.utils.data as data
from customDataset import reviewDataset
from sklearn import metrics


# GPU
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
print('GPU State:', device)

# Settings
batch_size = 16
threshold = torch.tensor([0.5]).to(device)

# Data
test_set = reviewDataset(train=False)
test_loader = data.DataLoader(test_set, batch_size=batch_size)

# Model
model = torch.load('fc.pth')
model.eval()
print(model)

# Test
pred = []
true = []

with torch.no_grad():
    for data in test_loader:
        inputs = data[0].to(device)
        inputs = inputs.view(-1, 5000)

        labels = data[1].to(device)

        outputs = model(inputs).to(device)
        outputs = (outputs>threshold).float()*1

        for n in range(len(outputs)):
            pred.append(outputs[n].tolist()[0])
            true.append(labels[n].tolist())

print('Accuracy: {:.3f}%'.format(metrics.accuracy_score(true, pred)))
print('Precision: {:.3f}%'.format(metrics.precision_score(true, pred)))
print('Recall: {:.3f}%'.format(metrics.recall_score(true, pred)))
print('F1: {:.3f}%'.format(metrics.f1_score(true, pred)))

Output:

Accuracy: 0.756%
Precision: 0.766%
Recall: 0.828%
F1: 0.796%

可以看到最後的 F1 居然是 0.796 %！這其實已經高得超乎我想像了。~~畢竟我都在亂做啊！~~

不過仔細想想，這樣的確還不夠高 —— 如果今天要做個能判斷客戶留言的模型，然後它每五次就會分錯一次...

......不會被老闆宰掉嗎？

好，總之接下來再開始努力吧！

3 thoughts on “[PyTorch] 旅館評論情感分析實戰紀錄 (0)”

Pingback: [PyTorch] 旅館評論情感分析實戰紀錄 (1) - Clay-Technology World
匿名訪客 2020-09-13 at 14:30


評估模型的輸出錯了，加上%後數字應該要乘上100。

Accuracy: 0.756%
Precision: 0.766%
Recall: 0.828%
F1: 0.796%
ccs96307

ccs96307 2020-09-13 at 14:36


謝謝，的確是應該要乘上 100，或是把 % 去掉。
感謝告知。