同步操作将从 朱平齐/RuiJi.Net 强制同步,此操作会覆盖自 Fork 仓库以来所做的任何修改,且无法恢复!!!
确定后同步将在后台操作,完成时将刷新页面,请耐心等待。
[广告]
年级大了, 熬不动了,卖卖茶叶,喜欢喝茶的猿人们,来看看如何泡出一杯美味的茶吧
泡泡茶 https://www.paopaocha.top/
RuiJi Scraper is a RuiJi expression based browser plug-in that uses visual rule editing and generates RuiJi expressions for RuiJi.Net. firefox
We cannot withdraw donations from open collective, so we have to shut down the presentation and document servers of ruiji.net
If you would like to restart the donation support project, please contact us by email 416803633@qq.com
RuiJi.Net is a distributed crawl framework written in netcore.
RuiJi.Net is a self host webapi written using Microsoft.AspNetCore.Owin. Major features include distribute crawler, distribute Extractor and managed cookie.
RuiJi.Net support ip polling that using the server public network address and proxy server.
Building http://doc.ruijihg.com/]
Feature | Support |
---|---|
webheader | custom |
method | get/post |
auto redirection | support |
cookie | managed/custom |
service point ip | auto/custom Bind |
encoding | auto detect/by specify |
response | raw/string |
proxy | http |
Type |
---|
CSS |
REGEX |
REGEXSPLIT |
TEXTRANGE |
EXCLUDE |
REGEXREPLACE |
JPATH |
XPATH |
CLEAR |
EXPRESSION |
SELECTORPROCESSOR |
var crawler = new RuiJiCrawler();
var request = new Request("https://www.baidu.com");
var response = crawler.Request(request);
var crawler = new RuiJiCrawler();
var request = new Request("https://www.baidu.com");
request.Ip = "192.168.31.196";
var response = crawler.Request(request);
var crawler = new RuiJiCrawler();
var request = new Request("https://www.baidu.com");
request.Proxy = new RequestProxy("223.93.172.248", 3128);
var response = crawler.Request(request);
var crawler = new RuiJiCrawler();
var request = new Request("https://www.oschina.net/blog");
var response = crawler.Request(request);
var content = response.Data.ToString();
var parser = new RuiJiParser();
var eb = parser.ParseExtract("css a.blog-title-link[href]\nexp https://my.oschina.net/*/blog/*");
var result = RuiJiExtractor.Extract(content, eb.Block);
var crawler = new RuiJiCrawler();
var request = new Request("http://www.ruijihg.com/archives/category/tech/bigdata");
var response = crawler.Request(request);
var content = response.Data.ToString();
var parser = new RuiJiParser();
var eb = parser.ParseExtract(@"[tile]\ncss article:html
[meta]
#title
css .entry-header:text
#summary
css .entry-header + p:text
ex /Read more »/ -e");
var result = RuiJiExtractor.Extract(content, eb.Block);
var crawler = new RuiJiCrawler();
var request = new Request("https://my.oschina.net/zhupingqi/blog/1826317");
var response = crawler.Request(request);
var content = response.Data.ToString();
var parser = new RuiJiParser();
var eb = parser.ParseExtract(@"[meta]
#title
css h1.header:text
#author
css div.blog-meta .avatar + span:text
#date
css div.blog-meta > div.item:first:text
regS /发布于/ 1
#words_i
css div.blog-meta > div.item:eq(1):text
regS / / 1
#content
css #articleContent:html");
var result = RuiJiExtractor.Extract(content, eb.Block);
detect mine
var crawler = new RuiJiCrawler();
var request = new Request("http://img10.jiuxian.com/2018/0111/cd51bb851410404388155b3ec2c505cf4.jpg");
var response = crawler.Request(request);
var ex = response.Extensions;
downloaded ZooKeeper from Apache mirrors http://mirrors.hust.edu.cn/apache/zookeeper/zookeeper-3.4.12/
Add the same file as zoo_sample.cfg in folder conf and rename it to zoo.cfg. and change dataDir with your
Please confirm whether the Java runtime environment is installed
run bin/zkServer.cmd in you zookeepr folder
Start up zookeeper
Compile RuiJi.Net.Cmd and run RuiJi.Net.Cmd.exe
if You see the following information
Server Start At http://x.x.x.x:x
proxy x.x.x.x:x ready to startup!
try connect to zookeeper server : x.x.x.x:2181
zookeeper server connected!
the service startup is complete!
var request = new Request("http://www.ruijihg.com/%e5%bc%80%e5%8f%91/");
var response = Crawler.Request(request);
if (response.StatusCode != System.Net.HttpStatusCode.OK)
return;
var content = response.Data.ToString();
var block = new ExtractBlock();
block.Selectors = new List<ISelector>
{
new CssSelector(".entry-content",CssTypeEnum.InnerHtml)
};
block.TileSelector = new ExtractTile
{
Selectors = new List<ISelector>
{
new CssSelector(".pt-cv-content-item",CssTypeEnum.InnerHtml)
}
};
block.TileSelector.Metas.AddMeta("title", new List<ISelector> {
new CssSelector(".pt-cv-title")
});
block.TileSelector.Metas.AddMeta("url", new List<ISelector> {
new CssSelector(".pt-cv-readmore","href")
});
var r = Extractor.Extract(new ExtractRequest {
Block = block,
Content = content
});
RuiJi Expression is a way to quickly add the rules of page extraction. The ruiji expressions are as simple and understandable as possible.Before we start, we should first understand the rule model of RuiJi.Net.
The RuiJi expression uses the structure described in the figure above to extract the pages that need to be extracted, and the extraction unit is Block, as shown in the following figure.
Selectors is a list of selector Tiles is a region that needs to be repeatedly extracted Metas is the metadata that needs to be extracted Blocks is a subBlock that needs to be extracted within Block
If you need to extract http://www.ruijihg.com/开发, you need to observe the structure of the page first.You can use F12 to look at the structure of the page
First, make sure that the result of the Block selector is unique.
The definition of Block can be as follows
#content
css .pt-cv-view:ohtml
Continue adding tile
[tile]
#tiles
css .pt-cv-content-item:ohtml
[meta]
#title
css .pt-cv-title:text
#content
css .pt-cv-content:html
ex 阅读更多... -e
You may notice \t, because both block and tile contain meta, so the tile selector part and tile meta are \t as the current tile flag.
The complete Block description structure is as follows
[Block]
#blockname
selector
[blocks]
@subblockname1
@subblockname2
[tile]
#tilename
tile selector
[meta]
#meta1
selector
#meta2
selector
[meta]
#blockmeta1
selector
#blockmeta2
selector
Please contact me with any suggestion
my website : www.ruijihg.com
QQ交流群: 545931923
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。