克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
MIT

DotnetSpider

免责申明:本框架是为了帮助开发人员简化开发流程、提高开发效率,请勿使用此框架做任何违法国家法律的事情,使用者所做任何事情也与本框架的作者无关。

Build Status NuGet Member project of .NET Core Community GitHub license

DotnetSpider, a .NET Standard web crawling library. It is a lightweight, efficient, and fast high-level web crawling & scraping framework.

If you want to get the latest beta packages, you should add the myget feed:

<add key="myget.org" value="https://www.myget.org/F/zlzforever/api/v3/index.json" protocolVersion="3" />

DESIGN

DESIGN IMAGE

DEVELOP ENVIROMENT

  1. Visual Studio 2017 (15.3 or later) or Jetbrains Rider

  2. .NET Core 2.2 or later

  3. Docker

  4. MySql

     docker run --name mysql -d -p 3306:3306 --restart always -e MYSQL_ROOT_PASSWORD=1qazZAQ! mysql:5.7
    
  5. Redis (option)

     docker run --name redis -d -p 6379:6379 --restart always redis
    
  6. SqlServer

     docker run --name sqlserver -d -p 1433:1433 --restart always  -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=1qazZAQ!' mcr.microsoft.com/mssql/server:2017-latest
    
  7. PostgreSQL (option)

     docker run --name postgres -d  -p 5432:5432 --restart always -e POSTGRES_PASSWORD=1qazZAQ! postgres
    
  8. MongoDb (option)

     docker run --name mongo -d -p 27017:27017 --restart always mongo
    
  9. RabbitMQ

    docker run -d --restart always --name rabbimq -p 4369:4369 -p 5671-5672:5671-5672 -p 25672:25672 -p 15671-15672:15671-15672 \
           -e RABBITMQ_DEFAULT_USER=user -e RABBITMQ_DEFAULT_PASS=password \
           rabbitmq:3-management
    
  10. Docker remote api for mac

    docker run -d  --restart always --name socat -v /var/run/docker.sock:/var/run/docker.sock -p 2376:2375 bobrik/socat TCP4-LISTEN:2375,fork,reuseaddr UNIX-CONNECT:/var/run/docker.sock
    
  11. HBase

    docker run -d --restart always --name hbase -p 20550:8080 -p 8085:8085 -p 9090:9090 -p 9095:9095 -p 16010:16010 dajobe/hbase
    

MORE DOCUMENTS

https://github.com/dotnetcore/DotnetSpider/wiki

SAMPLES

Please see the Project DotnetSpider.Sample in the solution.

BASE USAGE

Base usage Codes

ADDITIONAL USAGE: Configurable Entity Spider

View complete Codes

public class EntitySpider : Spider
{
    public EntitySpider(IOptions<SpiderOptions> options, SpiderServices services, ILogger<Spider> logger) : base(
        options, services, logger)
    {
    }

    #region Nested type: CnblogsEntry

    [Schema("cnblogs", "news")]
    [EntitySelector(Expression = ".//div[@class='news_block']", Type = SelectorType.XPath)]
    [GlobalValueSelector(Expression = ".//a[@class='current']", Name = "类别", Type = SelectorType.XPath)]
    [FollowRequestSelector(XPaths = new[]
    {
        "//div[@class='pager']"
    })]
    public class CnblogsEntry : EntityBase<CnblogsEntry>
    {
        public int Id { get; set; }

        [Required]
        [StringLength(200)]
        [ValueSelector(Expression = "类别", Type = SelectorType.Environment)]
        public string Category { get; set; }

        [Required]
        [StringLength(200)]
        [ValueSelector(Expression = "网站", Type = SelectorType.Environment)]
        public string WebSite { get; set; }

        [StringLength(200)]
        [ValueSelector(Expression = "//title")]
        [ReplaceFormatter(NewValue = "", OldValue = " - 博客园")]
        public string Title { get; set; }

        [StringLength(40)]
        [ValueSelector(Expression = "GUID", Type = SelectorType.Environment)]
        public string Guid { get; set; }

        [ValueSelector(Expression = ".//h2[@class='news_entry']/a")]
        public string News { get; set; }

        [ValueSelector(Expression = ".//h2[@class='news_entry']/a/@href")]
        public string Url { get; set; }

        [ValueSelector(Expression = ".//div[@class='entry_summary']")]
        public string PlainText { get; set; }

        [ValueSelector(Expression = "DATETIME", Type = SelectorType.Environment)]
        public DateTime CreationTime { get; set; }

        protected override void Configure()
        {
            HasIndex(x => x.Title);
            HasIndex(x => new
            {
                x.WebSite,
                x.Guid
            }, true);
        }
    }

    #endregion

    public static async Task RunAsync()
    {
        var builder = Builder.CreateDefaultBuilder<EntitySpider>();
        builder.UseSerilog();
        builder.UseQueueDistinctBfsScheduler<HashSetDuplicateRemover>();
        await builder.Build()
            .RunAsync();
    }

    protected override async Task InitializeAsync(CancellationToken stoppingToken)
    {
        AddDataFlow(new DataParser<CnblogsEntry>());
        AddDataFlow(GetDefaultStorage());
        await AddRequestsAsync(new Request("https://news.cnblogs.com/n/page/1/", new Dictionary<string, string>
        {
            {
                "网站", "博客园"
            }
        }), new Request("https://news.cnblogs.com/n/page/2/", new Dictionary<string, string>
        {
            {
                "网站", "博客园"
            }
        }));
    }

    protected override (string Id, string Name) GetIdAndName()
    {
        return (ObjectId.NewId.ToString(), "博客园");
    }
}

Distributed spider

Read this document

Puppeteer downloader

Coming soon

NOTICE

when you use redis scheduler, please update your redis config:

timeout 0
tcp-keepalive 60

Dependencies

Package License
Bert.RateLimiters Apache 2.0
MessagePack MIT
Newtonsoft.Json MIT
Dapper Apache 2.0
HtmlAgilityPack MIT
ZCJ.HashedWheelTimer MIT
murmurhash Apache 2.0
Serilog.AspNetCore Apache 2.0
Serilog.Sinks.Console Apache 2.0
Serilog.Sinks.RollingFile Apache 2.0
Serilog.Sinks.PeriodicBatching Apache 2.0
MongoDB.Driver Apache 2.0
MySqlConnector MIT
AutoMapper.Extensions.Microsoft.DependencyInjection MIT
Docker.DotNet MIT
BuildBundlerMinifier Apache 2.0
Pomelo.EntityFrameworkCore.MySql MIT
Quartz.AspNetCore Apache 2.0
Quartz.AspNetCore.MySqlConnector Apache 2.0
Npgsql PostgreSQL License
RabbitMQ.Client Apache 2.0
Polly BSD 3-C

AREAS FOR IMPROVEMENTS

QQ Group: 477731655 Email: zlzforever@163.com

MIT License Copyright (c) 2019 Lewis Zou Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

简介

https://github.com/dotnetcore/DotnetSpider 展开 收起
MIT
取消

发行版

暂无发行版

贡献者

全部

近期动态

不能加载更多了
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化