機器學習_ML-文字分詞
分成多步驟主要是因為自己在練習的時候沒有辦法一次性的把所有的流程處理掉,所以就自己把作業流程分一下。主要參考了『Python機器學習 Sebastian Raschka著』,這是第八章的內容。
還有參考了『今天不學機器學習,明天就被機器取代:從Python入手+演算法』!
STEP-1_下載檔案,並且持久化檔案
# 範例檔請至http://ai.stanford.edu/~amaas/data/sentiment/下載
# 這邊主要是針對案例檔抽取50000筆來做使用,並且透過pickle持久化檔案。
import os
import pandas as pd
import pyprind
import pickle
def writeobj(path, bunchobj):
file_obj = open(path, "wb")
pickle.dump(bunchobj, file_obj)
file_obj.close()
pBar = pyprind.ProgBar(50000)
labels = {'pos': 1, 'neg': 0}
df = pd.DataFrame()
for s in ('test', 'train'):
for l in ('pos', 'neg'):
path = 'D:\\python\\ML\\aclImdb\%s\%s' % (s, l)
for file in os.listdir(path):
# os.path.join用來結合path與file
# 案例來看會等於D:\python\ML\aclImdb\test\neg\xxx.txt
# 所以該處利用join來結合連結與檔案名之後做檔案開啟的動作
with open(os.path.join(path, file), encoding='utf8') as infile:
txt = infile.read()
# ignore_index=True_重置 index
# labels內接的是小寫的L,不是1,對應上面第二層迴圈的L
df = df.append([[txt, labels[l]]], ignore_index=True)
pBar.update()
writeobj('D:\\python\\ML\\df.dat', df)
STEP-2_下載檔案,並且持久化檔案
# 這邊主要是測試讀取在01所做的檔案讀取測試。
# 試著用type去確認型態的話,會發現是pandas的物件型態。
import os
import pandas as pd
import pyprind
import pickle
def readobj(path):
with open(path, "rb") as file:
context = pickle.load(file)
return context
path = 'D:\\python\\ML\\df.dat'
# df = pd.DataFrame()
df = readobj(path)
print(df)
# print(type(df))
STEP-3
# 將稍早透過pickle建置的持久檔讀出轉入csv
# 並且亂數排序過
import os
import pandas as pd
import numpy as np
import pickle
def readobj(path):
with open(path, "rb") as file:
context = pickle.load(file)
return context
df = pd.DataFrame()
path = 'D:\\python\\ML\\df.dat'
df = readobj(path)
# numpy.random.seed的部份,主要用來讓每次的亂數排序都是一致的。
# 如seed(0),不管執行幾次所得的結果都是一致的!
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
# header的部份是拿來定義欄位名稱使用
df.to_csv('D:\\python\\ML\\move_data.csv', index=False, header=('review', 'sentiment'))
STEP-4
# -*- coding: UTF-8 -*-
# 將稍早透過pickle建置csv讀出測試
import pandas as pd
import codecs
df = pd.DataFrame()
path = 'D:\\python\\ML\\move_data.csv'
# 目前很常遇到的就是編碼的問題,所以最後找到的解法是import codecs再透過with codecs來oepn。
with codecs.open(path, "r", encoding='utf-8', errors='ignore') as fdata:
df = pd.read_csv(fdata)
print(df.head(3))
STEP-5
# -*- coding: UTF-8 -*-
import pandas as pd
import re
import codecs
def preprocessor(text):
# re.sub主要用來替代字串
# re.sub(正則式, 替代成什麼, 來源字串)
text = re.sub('<[^>]*>', '', text)
# re.findall用來取得所有match的資料,是一個list
emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
# 把文字以外的符號去除之後,再把剛才的表情符號加到最後
text = re.sub('[\W]+', ' ', text.lower()) +\
' '.join(emoticons).replace('-', '')
return text
df = pd.DataFrame()
path = 'D:\\python\\ML\\move_data.csv'
with codecs.open(path, "r", encoding='utf-8', errors='ignore') as fdata:
df = pd.read_csv(fdata)
# dd ='is seven.<br /><br />Title (Brazil): Not Available :)'
# print(preprocessor(dd))
# 將欄位資料做清除整理不必要的符號。
df['review'] = df['review'].apply(preprocessor)
print(df)
STEP-6
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
# 實做CountVectorizer的時候可以透過ngram_range來做分詞的長度設置
# ngram_range(2,2)表示從2到2,如果是(2,3)的話,即表示字串從2~3長。
count = CountVectorizer(ngram_range=(1, 1))
docs = np.array([
'The sun is shining',
'The weather is sweet',
'The sun is shining and the weather is sweet'
])
bag = count.fit_transform(docs)
# 列印詞彙的內容
# {'the': 5, 'sun': 3, 'is': 1, 'shining': 2, 'weather': 6, 'sweet': 4, 'and': 0}
print(count.vocabulary_)
# 列印出相對應的索引值
# and的索引是0 在第1句中,and沒有出現,故首值為0,依此下去列出三個文本的狀態
print(bag.toarray())
# 實做tfidf
tfidf = TfidfTransformer()
np.set_printoptions(precision=2)
# print(tfidf.fit_transform(count.fit_transform(docs)).toarray())
# 上句與下句一樣
print(tfidf.fit_transform(bag).toarray())
STEP-7
# 波特詞幹還原演算法
# 可透過nltk.download('stopwords')來下載停用詞
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
porter = PorterStemmer()
stop = stopwords.words('english')
def tokenizer_porter(text):
return[porter.stem(word) for word in text.split()]
print([w for w in tokenizer_porter('runners like running and thus they run')[-10:] if w not in stop])
STEP-8
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import codecs
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
def tokenizer_porter(text):
return[porter.stem(word) for word in text.split()]
def tokenizer(text):
return text.split()
df = pd.DataFrame()
path = 'D:\\python\\ML\\move_data.csv'
porter = PorterStemmer()
stop = stopwords.words('english')
# 目前很常遇到的就是編碼的問題,所以最後找到的解法是import codecs再透過with codecs來oepn。
with codecs.open(path, "r",encoding='utf-8', errors='ignore') as fdata:
df = pd.read_csv(fdata)
x_train = df.loc[:500, 'review'].values
y_train = df.loc[:500, 'sentiment'].values
x_test = df.loc[500:, 'review'].values
y_test = df.loc[500:, 'sentiment'].values
tfidf = TfidfVectorizer(strip_accents=None,
lowercase=False,
preprocessor=None)
param_grid = [{'vect__ngram_range': [(1, 1)],
'vect__stop_words': [stop, None],
'vect__tokenizer': [tokenizer, tokenizer_porter],
'clf__penalty': ['l1', 'l2'],
'clf__C': [1.0, 10.0, 100.0]},
{'vect__ngram_range': [(1, 1)],
'vect__stop_words': [stop, None],
'vect__tokenizer': [tokenizer, tokenizer_porter],
'vect__use_idf':[False],
'vect__norm':[None],
'clf__penalty': ['l1', 'l2'],
'clf__C': [1.0, 10.0, 100.0]},
]
lr_tfidf = Pipeline([('vect', tfidf),
('clf', LogisticRegression(random_state=0))])
gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
scoring='accuracy',
cv=5,
verbose=1,
n_jobs=-1)
if __name__ == '__main__':
gs_lr_tfidf.fit(x_train, y_train)
print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)
STEP-9
import numpy as np
import re
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
import pyprind
stop = stopwords.words('english')
path = "D:\\python\\ML\\move_data.csv"
def tokenizer(text):
text = re.sub('<[^>]*>', '', text)
emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
text = re.sub('[\W]+', ' ', text.lower()) +\
' '.join(emoticons).replace('-', '')
tokenized = [w for w in text.split() if w not in stop]
return tokenized
def stream_docs(path):
with open(path, 'r') as csv:
# pass掉第一行標題行
next(csv)
for line in csv:
text, label = line[:-3], int(line[-2])
# yield用法就如同return,不同的是值會留著下次呼叫的時候繼續用!
# 再另篇來說明!
yield text, label
# 內建函數 (function) next() ,回傳參數 (parameter) 迭代器中下一個數值
# 測試是否有成功呼叫!記得要做一點點就測試一下才會好除錯!
# print(next(stream_docs(path)))
# 定義函數來處理串流
def get_minibatch(doc_stream, size):
docs, y = [], []
try:
for _ in range(size):
text, label = next(doc_stream)
docs.append(text)
y.append(label)
except StopIteration:
return None, None
return docs, y
vect = HashingVectorizer(decode_error='ignore',
n_features=2**21,
preprocessor=None,
tokenizer=tokenizer)
clf = SGDClassifier(loss='log', random_state=1, n_iter=1)
doc_stream = stream_docs(path=path)
pbar = pyprind.ProgBar(45)
classes = np.array([0,1])
for _ in range(45):
x_train, y_train = get_minibatch(doc_stream, size=1000)
if not x_train:
break
x_train = vect.transform(x_train)
clf.partial_fit(x_train, y_train, classes=classes)
pbar.update()
x_test, y_test = get_minibatch(doc_stream, size=5000)
x_test = vect.transform(x_test)
print('Accuracy: %.3f' % clf.score(x_test, y_test))
# 再透過最後的這5000筆來更新模型
clf = clf.partial_fit(x_test, y_test)