利用sklearn做文本分类(特征提取、knnsvm聚类)-南京廖华答案网

利用sklearn做文本分类(特征提取、knnsvm聚类) 下载本文

内容发布更新时间 : 2025/7/4 15:11:56星期一下面是文章的全部内容请认真阅读。

利用sklearn做文本分类(特征提取、knnsvm聚类)

数据挖掘入门与实战公众号： datadw 分为以下几个过程：

加载数据集提feature 分类

Naive Bayes KNN SVM聚类

20newsgroups官网

http://qwone.com/~jason/20Newsgroups/ 上给出了3个数据集，这里我们用最原始的20news-19997.tar.gz

http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz

1.加载数据集

从20news-19997.tar.gz下载数据集，解压到

scikit_learn_data文件夹下，加载数据，详见code注释。

[python]view plaincopy

#first extract the 20 news_group dataset to /scikit_learn_data

fromsklearn.datasets importfetch_20newsgroups #all categories

#newsgroup_train = fetch_20newsgroups(subset='train') #part categories

categories = ['comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x'];

newsgroup_train = fetch_20newsgroups(subset = 'train',categories = categories);

可以检验是否load好了： [python]view plaincopy

#print category names frompprint importpprint

pprint(list(newsgroup_train.target_names))

结果：

['comp.graphics',

'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x'] 2. 提feature：

刚才load进来的newsgroup_train就是一篇篇document，我们要从中提取feature，即词频啊神马的，用fit_transform Method 1. HashingVectorizer，规定feature个数 [python]view plaincopy

#newsgroup_train.data is the original documents, but we need to extract the

#feature vectors inorder to model the text data fromsklearn.feature_extraction.text importHashingVectorizer

vectorizer = HashingVectorizer(stop_words = 'english',non_negative = True, n_features = 10000)

fea_train = vectorizer.fit_transform(newsgroup_train.data) fea_test = vectorizer.fit_transform(newsgroups_test.data);