加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
文件
该仓库未声明开源许可证文件(LICENSE),使用请关注具体项目描述及其代码上游依赖。
克隆/下载
get_git.py 1.55 KB
一键复制 编辑 原始数据 按行查看 历史
snow212-cn 提交于 2020-01-22 13:51 . Add files via upload
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import time
import os
site = 'https://www.liaoxuefeng.com'
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'}
def save_soup(title, s, i = 0):
for img in s.select('img'):
img['src'] = site + img['data-src']
for ifr in s.select('iframe'):
ifr.extract()
s.head.append(s.new_tag('meta', charset='utf-8'))
title_tag = s.new_tag('title')
title_tag.string = title
s.head.append(title_tag)
h1 = soup.new_tag('h1')
h1.string = title
s.body.insert(0, h1)
with open('%s.%s.html' % (i, title), 'w') as f:
f.write(str(s))
r = requests.get(site + '/wiki/896043488029600', headers = headers)
soup = BeautifulSoup( r.content)
links = soup.body.select('#x-wiki-index>div')[0].find_all('a')
div = soup.body.select('#x-content>div.x-wiki-content.x-main-content')[0]
save_soup('Git教程', BeautifulSoup(div.prettify(),'html5lib'))
for i, a in enumerate(links):
title = a.string.replace('/', '/')
filename = '%s.%s.html' % (i, title)
if os.path.exists(filename):
print('file exists ' + filename)
continue
print('getting page:%s %s' % (title , a['href']))
r = requests.get(site + a['href'], headers = headers)
soup = BeautifulSoup(r.content)
div = soup.body.select('#x-content>div.x-wiki-content.x-main-content')[0]
save_soup(title, BeautifulSoup(div.prettify(),'html5lib'), i)
time.sleep(1)
print('all finished')
Loading...
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化