利用sklearn做文本分类(特征提取、knnsvm聚类) 下载本文

内容发布更新时间 : 2024/6/20 3:14:45星期一 下面是文章的全部内容请认真阅读。

利用sklearn做文本分类(特征提取、knnsvm聚类)

数据挖掘入门与实战 公众号: datadw 分为以下几个过程:

加载数据集 提feature 分类

Naive Bayes KNN SVM聚类

20newsgroups官网

http://qwone.com/~jason/20Newsgroups/ 上给出了3个数据集,这里我们用最原始的20news-19997.tar.gz

http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz

1.加载数据集

从20news-19997.tar.gz下载数据集,解压到

scikit_learn_data文件夹下,加载数据,详见code注释。

[python]view plaincopy

#first extract the 20 news_group dataset to /scikit_learn_data

fromsklearn.datasets importfetch_20newsgroups #all categories

#newsgroup_train = fetch_20newsgroups(subset='train') #part categories

categories = ['comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x'];

newsgroup_train = fetch_20newsgroups(subset = 'train',categories = categories);

可以检验是否load好了: [python]view plaincopy

#print category names frompprint importpprint

pprint(list(newsgroup_train.target_names))

结果:

['comp.graphics',

'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x'] 2. 提feature:

刚才load进来的newsgroup_train就是一篇篇document,我们要从中提取feature,即词频啊神马的,用fit_transform Method 1. HashingVectorizer,规定feature个数 [python]view plaincopy

#newsgroup_train.data is the original documents, but we need to extract the

#feature vectors inorder to model the text data fromsklearn.feature_extraction.text importHashingVectorizer

vectorizer = HashingVectorizer(stop_words = 'english',non_negative = True, n_features = 10000)

fea_train = vectorizer.fit_transform(newsgroup_train.data) fea_test = vectorizer.fit_transform(newsgroups_test.data);