代码拉取完成,页面将自动刷新
同步操作将从 张亚飞/LDA和pagerank和doc2vec等相关模型 强制同步,此操作会覆盖自 Fork 仓库以来所做的任何修改,且无法恢复!!!
确定后同步将在后台操作,完成时将刷新页面,请耐心等待。
# encoding=utf-8
"""
Created on 2019年3月1日
@author: yuqi
"""
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize
type = '低关注'
if __name__ == '__main__':
with open('data/%s_title.txt' % type, 'w', encoding='utf-8') as op:
with open('data/%s.txt' % type, 'r', encoding='utf-8') as f:
for line in f:
line = line.strip()
fields = line.split('\t')
title = fields[1].lower()
abstract = ''
if len(fields) == 3:
abstract = fields[2].lower()
# print(title)
# print(abstract)
words_abstract = [word for word in nltk.word_tokenize(title)]
# print(words_abstract)
english_stopwords = stopwords.words('english')
words_abstract_filtered = [word for word in words_abstract if word not in english_stopwords]
# print(words_abstract_filtered)
english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%', '-']
words_abstract_filtered = [word for word in words_abstract_filtered if word not in english_punctuations]
# print(words_abstract_filtered)
st = PorterStemmer()
words_abstract_stemed = [st.stem(word) for word in words_abstract_filtered]
# print(words_abstract_stemed)
op.write('%s\n' % ' '.join(words_abstract_stemed))
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。