[PyTorch] 旅館評論情感分析實戰紀錄 (1)

接續之前旅館評論分類的工作，詳情請參閱《[PyTorch] 旅館評論情感分析實戰紀錄 (0)》這篇文章，我再次對原本的分類模型進行了簡單的改良。至於改良了什麼呢？還記得我在上一篇文章中提到我只是胡亂地將每個字斷開，個別給相異字一個 Index 去代表該字——簡單來講，就只是把文字資料轉成單個數字，好用來進行 Machine Learning。

而這樣的作法其實在自然語言處理 (NLP) 的任務當中是相當少見的，至少從我開始學習以來沒怎麼見過。我常用的方法是使用 Python 中的 Gensim 套件去建立 Word2Vec 的模型，將文字對應在向量空間當中，每個『詞彙』都有著自己的『向量』。

詳細 Gensim 的使用方法可以參閱我的這篇文章《在 Python 中使用 Gensim 將文字轉成向量》。

雖然今天只做了這麼一點的工作，但效果卻好得出奇，再次讓我見識到了 Word2Vec 的神奇。

以下就簡單講講我怎麼做的。

旅館評論資料集

老樣子，首先再次介紹資料集。

這份資料集我是在 https://github.com/Chunshan-Theta/NLPLab 這裡找到的，在此感謝整理資料並分享的開發者。

基本上正面的評論都放在名叫 positiveReviews 的資料夾中、負面的評論都放在 negativeReviews 的資料夾中，基本上與 IMDB 常見的分法相仿，使用起來非常方便。

正面的評論一共有 196,337 筆，負面的評論一共有 145,321 筆。

是繁體中文的資料，大致上長相如下：

每一筆資料都獨立地儲存在一個文件裡。

前處理

這次的前處理與上次截然不同。這次我首先拿了 Wiki 上的中文資料訓練 Word2Vec 的模型，步驟還是可以參閱這篇《在 Python 中使用 Gensim 將文字轉成向量》。

不過再開始使用前，我還要額外計算出所謂的『平均向量』：

import json
import numpy as np
from gensim.models import word2vec

model = word2vec.Word2Vec.load('word2vec.model')
vec_avg = sum(np.array(model[word]) for word in model.wv.vocab)/len(model.wv.vocab)

with open('vec_avg.json', 'w', encoding='utf-8') as f:
    json.dump(vec_avg.tolist(), f)

import json
import numpy as np
from gensim.models import word2vec

model = word2vec.Word2Vec.load(‘word2vec.model’)
vec_avg = sum(np.array(model[word]) for word in model.wv.vocab)/len(model.wv.vocab)

with open(‘vec_avg.json’, ‘w’, encoding=’utf-8′) as f:
    json.dump(vec_avg.tolist(), f)

之所以這麼做事有道理的：仔細想想就會發現，萬一我們旅館評論裡的詞彙 Wiki 當中沒有怎麼辦？丟進 W2V 模型裡轉換可是會報錯的。故此，我們計算出平均的向量，專門用來填充沒看過的詞彙。

然後再次進行 Training data 以及 Test data 的切割：

# -*- coding: utf-8 -*-
import os
import json
import jieba
from sklearn.model_selection import train_test_split
from gensim.models import word2vec


# Settings
pos_path = 'data/positiveReviews/'
neg_path = 'data/negativeReviews/'
seed = 666
w2v_model = word2vec.Word2Vec.load('word2vec.model')
vec_padding = [0 for _ in range(100)]
with open('vec_avg.json', 'r', encoding='utf-8') as f:
    vec_avg = json.load(f)


# Positive data
pos_data = []
for file in os.listdir(pos_path):
    text = open(pos_path+file, 'r', encoding='utf-8').read()
    words = [w for w in jieba.lcut(text) if w != ' ']

    data_temp = []
    for w in words:
        try:
            word_vec = w2v_model[w].tolist()
        except:
            word_vec = vec_avg

        data_temp.append(word_vec)

    while len(data_temp) < 50:
        data_temp.append(vec_padding)

    if len(data_temp) > 50:
        data_temp = data_temp[:50]

    pos_data.append((data_temp, 1))


# Negative data
neg_data = []
for file in os.listdir(neg_path):
    text = open(neg_path + file, 'r', encoding='utf-8').read()
    words = [w for w in jieba.lcut(text) if w != ' ']

    data_temp = []
    for w in words:
        try:
            word_vec = w2v_model[w].tolist()
        except:
            word_vec = vec_avg

        data_temp.append(word_vec)

    while len(data_temp) < 50:
        data_temp.append(vec_padding)

    if len(data_temp) > 50:
        data_temp = data_temp[:50]

    neg_data.append((data_temp, 0))


# Split data
train, test = train_test_split(pos_data+neg_data, random_state=seed, train_size=0.7)


# Save json file
with open('train_data.json', 'w', encoding='utf-8') as f:
    json.dump(train, f)

with open('test_data.json', 'w', encoding='utf-8') as f:
    json.dump(test, f)

print(len(train))
print(len(test))
print('Finished.')

# -*- coding: utf-8 -*-
import os
import json
import jieba
from sklearn.model_selection import train_test_split
from gensim.models import word2vec


# Settings
pos_path = ‘data/positiveReviews/’
neg_path = ‘data/negativeReviews/’
seed = 666
w2v_model = word2vec.Word2Vec.load(‘word2vec.model’)
vec_padding = [0 for _ in range(100)]
with open(‘vec_avg.json’, ‘r’, encoding=’utf-8′) as f:
    vec_avg = json.load(f)


# Positive data
pos_data = []
for file in os.listdir(pos_path):
    text = open(pos_path+file, ‘r’, encoding=’utf-8′).read()
    words = [w for w in jieba.lcut(text) if w != ‘ ‘]

    data_temp = []
    for w in words:
        try:
            word_vec = w2v_model[w].tolist()
        except:
            word_vec = vec_avg

        data_temp.append(word_vec)

    while len(data_temp) < 50:
        data_temp.append(vec_padding)

    if len(data_temp) > 50:
        data_temp = data_temp[:50]

    pos_data.append((data_temp, 1))


# Negative data
neg_data = []
for file in os.listdir(neg_path):
    text = open(neg_path + file, ‘r’, encoding=’utf-8′).read()
    words = [w for w in jieba.lcut(text) if w != ‘ ‘]

    data_temp = []
    for w in words:
        try:
            word_vec = w2v_model[w].tolist()
        except:
            word_vec = vec_avg

        data_temp.append(word_vec)

    while len(data_temp) < 50:
        data_temp.append(vec_padding)

    if len(data_temp) > 50:
        data_temp = data_temp[:50]

    neg_data.append((data_temp, 0))


# Split data
train, test = train_test_split(pos_data+neg_data, random_state=seed, train_size=0.7)


# Save json file
with open(‘train_data.json’, ‘w’, encoding=’utf-8′) as f:
    json.dump(train, f)

with open(‘test_data.json’, ‘w’, encoding=’utf-8′) as f:
    json.dump(test, f)

print(len(train))
print(len(test))
print(‘Finished.’)

客製資料集

與之前相同，由於我是使用 PyTorch 來搭建這個分類器，故需要使用 PyTorch 當中的 Dataset 來製作自己的資料集。

這裡就直接看程式碼吧：

# -*- coding: utf-8 -*-
import json
import torch
from torch.utils.data.dataset import Dataset


# Data
class reviewDataset(Dataset):
    def __init__(self, train):
        self.train = train
        self.data = []
        self.label = []

        # Train
        if self.train:
            with open('train_data.json', 'r', encoding='utf-8') as f:
                self.train_data = json.load(f)

            for train_data in self.train_data:
                self.data.append(train_data[0])
                self.label.append(train_data[1])

        # Test
        else:
            with open('test_data.json', 'r', encoding='utf-8') as f:
                self.test_data = json.load(f)

            for test_data in self.test_data:
                self.data.append(test_data[0])
                self.label.append(test_data[1])

        # List convert to tensor
        self.data = torch.tensor(self.data).float()
        self.label = torch.tensor(self.label).float()

    def __getitem__(self, index):
        return (self.data[index], self.label[index])

    def __len__(self):
        return self.label.shape[0]

# -*- coding: utf-8 -*-
import json
import torch
from torch.utils.data.dataset import Dataset


# Data
class reviewDataset(Dataset):
    def __init__(self, train):
        self.train = train
        self.data = []
        self.label = []

        # Train
        if self.train:
            with open(‘train_data.json’, ‘r’, encoding=’utf-8′) as f:
                self.train_data = json.load(f)

            for train_data in self.train_data:
                self.data.append(train_data[0])
                self.label.append(train_data[1])

        # Test
        else:
            with open(‘test_data.json’, ‘r’, encoding=’utf-8′) as f:
                self.test_data = json.load(f)

            for test_data in self.test_data:
                self.data.append(test_data[0])
                self.label.append(test_data[1])

        # List convert to tensor
        self.data = torch.tensor(self.data).float()
        self.label = torch.tensor(self.label).float()

    def __getitem__(self, index):
        return (self.data[index], self.label[index])

    def __len__(self):
        return self.label.shape[0]

這次與前一次的資料集定義相同，因為就算資料維度變了也不需要更改返回的 data 以及 label。

模型定義

模型定義就與上次不同了，因應不同的維度，輸入的尺寸需要變動：

# -*- coding: utf-8 -*-
import torch.nn as nn


# Model
class fully_connected_model(nn.Module):
    def __init__(self):
        super(fully_connected_model, self).__init__()
        self.main = nn.Sequential(
            nn.Linear(5000, 2048),
            nn.ReLU(),
            nn.Linear(2048, 1024),
            nn.ReLU(),
            nn.Linear(1024, 256),
            nn.ReLU(),
            nn.Linear(256, 16),
            nn.ReLU(),
            nn.Linear(16, 1),
            nn.Sigmoid()
        )

    def forward(self, input):
        return self.main(input)

# -*- coding: utf-8 -*-
import torch.nn as nn


# Model
class fully_connected_model(nn.Module):
    def __init__(self):
        super(fully_connected_model, self).__init__()
        self.main = nn.Sequential(
            nn.Linear(5000, 2048),
            nn.ReLU(),
            nn.Linear(2048, 1024),
            nn.ReLU(),
            nn.Linear(1024, 256),
            nn.ReLU(),
            nn.Linear(256, 16),
            nn.ReLU(),
            nn.Linear(16, 1),
            nn.Sigmoid()
        )

    def forward(self, input):
        return self.main(input)

訓練

# -*- coding: utf-8 -*-
import time
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
from model import fully_connected_model
from customDataset import reviewDataset


start_time = time.time()


# GPU
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
print('GPU State:', device)


# Loss
def loss_function(inputs, targets):
    return nn.BCELoss()(inputs, targets)


# Model
model = fully_connected_model().to(device)
print(model)


# Settings
epochs = 20
lr = 0.002
batch_size = 16
optimizer = optim.Adam(model.parameters(), lr=lr)


# DataLoader
train_set = reviewDataset(train=True)
train_loader = data.DataLoader(train_set, batch_size=batch_size, shuffle=True)


# Train
for epoch in range(epochs):
    epoch += 1

    for times, data in enumerate(train_loader):
        times += 1
        inputs = data[0].to(device)
        inputs = inputs.view(-1, 5000)

        labels = data[1].to(device)

        # Zero gradients
        optimizer.zero_grad()

        # Forward & Backward
        outputs = model(inputs).to(device)
        loss = loss_function(outputs, labels)
        loss.backward()
        optimizer.step()

        # Display loss
        if times % 100 == 0 or times == len(train_loader):
            print('[{}/{}, {}/{}] loss: {:.3f}'.format(epoch, epochs, times, len(train_loader), loss.item()))


print('Training Finished.')

# Saved
torch.save(model, 'fc.pth')
print('Model saved.')

# -*- coding: utf-8 -*-
import time
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
from model import fully_connected_model
from customDataset import reviewDataset


start_time = time.time()


# GPU
device = ‘cuda:0’ if torch.cuda.is_available() else ‘cpu’
print(‘GPU State:’, device)


# Loss
def loss_function(inputs, targets):
    return nn.BCELoss()(inputs, targets)


# Model
model = fully_connected_model().to(device)
print(model)


# Settings
epochs = 20
lr = 0.002
batch_size = 16
optimizer = optim.Adam(model.parameters(), lr=lr)


# DataLoader
train_set = reviewDataset(train=True)
train_loader = data.DataLoader(train_set, batch_size=batch_size, shuffle=True)


# Train
for epoch in range(epochs):
    epoch += 1

    for times, data in enumerate(train_loader):
        times += 1
        inputs = data[0].to(device)
        inputs = inputs.view(-1, 5000)

        labels = data[1].to(device)

        # Zero gradients
        optimizer.zero_grad()

        # Forward & Backward
        outputs = model(inputs).to(device)
        loss = loss_function(outputs, labels)
        loss.backward()
        optimizer.step()

        # Display loss
        if times % 100 == 0 or times == len(train_loader):
            print(‘[{}/{}, {}/{}] loss: {:.3f}’.format(epoch, epochs, times, len(train_loader), loss.item()))


print(‘Training Finished.’)

# Saved
torch.save(model, ‘fc.pth’)
print(‘Model saved.’)

跟之前相同的訓練流程，只是需要將輸入模型的資料使用 view() Resize 尺寸，畢竟模型可接受的尺寸與之前不同。

評估模型好壞

終於來到最後關頭了，上次的 F1 是 0.796％，現在使用了 Word2Vec 之後又能進步多少呢？

# -*- coding: utf-8 -*-
import torch
import torch.utils.data as data
from customDataset import reviewDataset
from sklearn import metrics


# GPU
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
print('GPU State:', device)

# Settings
batch_size = 16
threshold = torch.tensor([0.5]).to(device)

# Data
test_set = reviewDataset(train=False)
test_loader = data.DataLoader(test_set, batch_size=batch_size)

# Model
model = torch.load('fc.pth')
model.eval()
print(model)

# Test
pred = []
true = []

with torch.no_grad():
    for data in test_loader:
        inputs = data[0].to(device)
        inputs = inputs.view(-1, 5000)

        labels = data[1].to(device)

        outputs = model(inputs).to(device)
        outputs = (outputs>threshold).float()*1

        for n in range(len(outputs)):
            pred.append(outputs[n].tolist()[0])
            true.append(labels[n].tolist())

print('Accuracy: {:.3f}%'.format(metrics.accuracy_score(true, pred)))
print('Precision: {:.3f}%'.format(metrics.precision_score(true, pred)))
print('Recall: {:.3f}%'.format(metrics.recall_score(true, pred)))
print('F1: {:.3f}%'.format(metrics.f1_score(true, pred)))

# -*- coding: utf-8 -*-
import torch
import torch.utils.data as data
from customDataset import reviewDataset
from sklearn import metrics


# GPU
device = ‘cuda:0’ if torch.cuda.is_available() else ‘cpu’
print(‘GPU State:’, device)

# Settings
batch_size = 16
threshold = torch.tensor([0.5]).to(device)

# Data
test_set = reviewDataset(train=False)
test_loader = data.DataLoader(test_set, batch_size=batch_size)

# Model
model = torch.load(‘fc.pth’)
model.eval()
print(model)

# Test
pred = []
true = []

with torch.no_grad():
    for data in test_loader:
        inputs = data[0].to(device)
        inputs = inputs.view(-1, 5000)

        labels = data[1].to(device)

        outputs = model(inputs).to(device)
        outputs = (outputs>threshold).float()*1

        for n in range(len(outputs)):
            pred.append(outputs[n].tolist()[0])
            true.append(labels[n].tolist())

print(‘Accuracy: {:.3f}%’.format(metrics.accuracy_score(true, pred)))
print(‘Precision: {:.3f}%’.format(metrics.precision_score(true, pred)))
print(‘Recall: {:.3f}%’.format(metrics.recall_score(true, pred)))
print(‘F1: {:.3f}%’.format(metrics.f1_score(true, pred)))

Output:

Accuracy: 0.940
Precision: 0.948
Recall: 0.947
F1: 0.947

效果提昇地非常顯著！

後記

老實說，當我看到訓練出來的模型成果時，我有點不確定接下來還會不會繼續寫這個系列。應該說這個旅館評論的分類打從一開始直接以 Character 轉成 Index 時的分數就比我想像中高很多了。應該是因為評論使用的字正面跟負面真的差很多吧！

使用 Word2Vec 之後，效果更是提昇到我認為再調整一下模型就到極限的狀態。或許之後會試試看使用 RNN、LSTM、GRU 等等的神經網路？

[PyTorch] 旅館評論情感分析實戰紀錄 (1)

旅館評論資料集

前處理

客製資料集

模型定義

訓練

評估模型好壞

後記

相關

Leave a Reply取消回覆

[PyTorch] 旅館評論情感分析實戰紀錄 (1)

旅館評論資料集

前處理

客製資料集

模型定義

訓練

評估模型好壞

後記

分享此文：

相關

Leave a Reply取消回覆