内容发布更新时间 : 2025/2/13 4:04:02星期一 下面是文章的全部内容请认真阅读。
利用sklearn做文本分类(特征提取、knnsvm聚类)
数据挖掘入门与实战 公众号: datadw 分为以下几个过程:
加载数据集 提feature 分类
Naive Bayes KNN SVM聚类
20newsgroups官网
http://qwone.com/~jason/20Newsgroups/ 上给出了3个数据集,这里我们用最原始的20news-19997.tar.gz
http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz
1.加载数据集
从20news-19997.tar.gz下载数据集,解压到
scikit_learn_data文件夹下,加载数据,详见code注释。
[python]view plaincopy
#first extract the 20 news_group dataset to /scikit_learn_data
fromsklearn.datasets importfetch_20newsgroups #all categories
#newsgroup_train = fetch_20newsgroups(subset='train') #part categories
categories = ['comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x'];
newsgroup_train = fetch_20newsgroups(subset = 'train',categories = categories);
可以检验是否load好了: [python]view plaincopy
#print category names frompprint importpprint
pprint(list(newsgroup_train.target_names))
结果:
['comp.graphics',
'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x'] 2. 提feature:
刚才load进来的newsgroup_train就是一篇篇document,我们要从中提取feature,即词频啊神马的,用fit_transform Method 1. HashingVectorizer,规定feature个数 [python]view plaincopy
#newsgroup_train.data is the original documents, but we need to extract the
#feature vectors inorder to model the text data fromsklearn.feature_extraction.text importHashingVectorizer
vectorizer = HashingVectorizer(stop_words = 'english',non_negative = True, n_features = 10000)
fea_train = vectorizer.fit_transform(newsgroup_train.data) fea_test = vectorizer.fit_transform(newsgroups_test.data);