加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
文件
该仓库未声明开源许可证文件(LICENSE),使用请关注具体项目描述及其代码上游依赖。
克隆/下载
text2stemedwords.py 1.55 KB
一键复制 编辑 原始数据 按行查看 历史
张亚飞 提交于 2019-03-29 21:17 . 3月29日提交
# encoding=utf-8
"""
Created on 2019年3月1日
@author: yuqi
"""
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize
type = '低关注'
if __name__ == '__main__':
with open('data/%s_title.txt' % type, 'w', encoding='utf-8') as op:
with open('data/%s.txt' % type, 'r', encoding='utf-8') as f:
for line in f:
line = line.strip()
fields = line.split('\t')
title = fields[1].lower()
abstract = ''
if len(fields) == 3:
abstract = fields[2].lower()
# print(title)
# print(abstract)
words_abstract = [word for word in nltk.word_tokenize(title)]
# print(words_abstract)
english_stopwords = stopwords.words('english')
words_abstract_filtered = [word for word in words_abstract if word not in english_stopwords]
# print(words_abstract_filtered)
english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%', '-']
words_abstract_filtered = [word for word in words_abstract_filtered if word not in english_punctuations]
# print(words_abstract_filtered)
st = PorterStemmer()
words_abstract_stemed = [st.stem(word) for word in words_abstract_filtered]
# print(words_abstract_stemed)
op.write('%s\n' % ' '.join(words_abstract_stemed))
Loading...
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化