DotnetSpider

DotnetSpider, a .NET Standard web crawling library similar to WebMagic and Scrapy. It is a lightweight, efficient and fast high-level web crawling & scraping framework for .NET

DESIGN

DEVELOP ENVIROMENT

Visual Studio 2017 (15.3 or later)
.NET Core 2.0 or later
Storage data to mysql. Download MySql grant all on . to 'root'@'localhost' IDENTIFIED BY '' with grant option; flush privileges;

OPTIONAL ENVIROMENT

Distributed crawler. Download Redis for Windows
SqlServer.
PostgreSQL.
MongoDb

SAMPLES

Please see the Projet DotnetSpider.Sample in the solution.

BASE USAGE

Base usage Codes

ADDITIONAL USAGE: Configurable Entity Spider

View complete Codes

public class EntityModelSpider
{
	public static void Run()
	{
		Spider spider = new Spider();
		spider.Run();
	}

	private class Spider : EntitySpider
	{
		protected override void OnInit(params string[] arguments)
		{
			var word = "可乐|雪碧";
			AddRequest(string.Format("http://news.baidu.com/ns?word={0}&tn=news&from=news&cl=2&pn=0&rn=20&ct=1", word), new Dictionary<string, dynamic> { { "Keyword", word } });
			AddEntityType<BaiduSearchEntry>();
			AddPipeline(new ConsoleEntityPipeline());
		}

		[Schema("baidu", "baidu_search_entity_model")]
		[Entity(Expression = ".//div[@class='result']", Type = SelectorType.XPath)]
		class BaiduSearchEntry : BaseEntity
		{
			[Column]
			[Field(Expression = "Keyword", Type = SelectorType.Enviroment)]
			public string Keyword { get; set; }

			[Column]
			[Field(Expression = ".//h3[@class='c-title']/a")]
			[ReplaceFormatter(NewValue = "", OldValue = "<em>")]
			[ReplaceFormatter(NewValue = "", OldValue = "</em>")]
			public string Title { get; set; }

			[Column]
			[Field(Expression = ".//h3[@class='c-title']/a/@href")]
			public string Url { get; set; }

			[Column]
			[Field(Expression = ".//div/p[@class='c-author']/text()")]
			[ReplaceFormatter(NewValue = "-", OldValue = "&nbsp;")]
			public string Website { get; set; }

			[Column]
			[Field(Expression = ".//div/span/a[@class='c-cache']/@href")]
			public string Snapshot { get; set; }

			[Column]
			[Field(Expression = ".//div[@class='c-summary c-row ']", Option = FieldOptions.InnerText)]
			[ReplaceFormatter(NewValue = "", OldValue = "<em>")]
			[ReplaceFormatter(NewValue = "", OldValue = "</em>")]
			[ReplaceFormatter(NewValue = " ", OldValue = "&nbsp;")]
			public string Details { get; set; }

			[Column(Length = 0)]
			[Field(Expression = ".", Option = FieldOptions.InnerText)]
			[ReplaceFormatter(NewValue = "", OldValue = "<em>")]
			[ReplaceFormatter(NewValue = "", OldValue = "</em>")]
			[ReplaceFormatter(NewValue = " ", OldValue = "&nbsp;")]
			public string PlainText { get; set; }
		}
	}
}

public static void Main()
{
	EntityModelSpider.Run();
}

Run via Startup

Command: -s:[spider type name | TaskName attribute] -i:[identity] -a:[arg1,arg2...] --tid:[taskId] -n:[name] -c:[configuration file path or name]

-s: Type name of spider or TaskNameAttribute for example: DotnetSpider.Sample.BaiduSearchSpiderl
-i: Set identity.
-a: Pass arguments to spider's Run method.
--tid: Set task id.
-n: Set name.
-c: Set config file path, for example you want to run with a customize config: -e:app.my.config

WebDriver Support

When you want to collect a page JS loaded, there is only one thing to do, set the downloader to WebDriverDownloader.

Downloader=new WebDriverDownloader(Browser.Chrome);

See a complete sample

NOTE:

Make sure there is a ChromeDriver.exe in bin forlder when you try to use Chrome. You can install it to your project via NUGET manager: Chromium.ChromeDriver
Make sure you already add a *.webdriver Firefox profile when you try to use Firefox: https://support.mozilla.org/en-US/kb/profile-manager-create-and-remove-firefox-profiles
Make sure there is a PhantomJS.exe in bin folder when you try to use PhantomJS. You can install it to your project via NUGET manager: PhantomJS

Storage log and status to database

DotnetSpider.Hub

https://github.com/zlzforever/DotnetSpider.Hub

Dependences a ci platform for example I use teamcity right now.
Dependences Scheduler.NET https://github.com/zlzforever/Scheduler.NET
More documents continue...

NOTICE

when you use redis scheduler, please update your redis config:

timeout 0
tcp-keepalive 60

Buy me a coffee

AREAS FOR IMPROVEMENTS

QQ Group: 477731655 Email: zlzforever@163.com

Kinnco/DotnetSpider