All Projects → microfisher → Strong Web Crawler

microfisher / Strong Web Crawler

基于C#.NET+PhantomJS+Sellenium的高级网络爬虫程序。可执行Javascript代码、触发各类事件、操纵页面Dom结构。

Projects that are alternatives of or similar to Strong Web Crawler

Maman
Rust Web Crawler saving pages on Redis
Stars: ✭ 39 (-83.61%)
Mutual labels:  crawler, web-crawler
Terpene Profile Parser For Cannabis Strains
Parser and database to index the terpene profile of different strains of Cannabis from online databases
Stars: ✭ 63 (-73.53%)
Mutual labels:  crawler, web-crawler
Crawlab
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
Stars: ✭ 8,392 (+3426.05%)
Mutual labels:  crawler, web-crawler
Awesome Crawler
A collection of awesome web crawler,spider in different languages
Stars: ✭ 4,793 (+1913.87%)
Mutual labels:  crawler, web-crawler
Awesome Web Scraper
A collection of awesome web scaper, crawler.
Stars: ✭ 147 (-38.24%)
Mutual labels:  web-crawler, phantomjs
Spidr
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
Stars: ✭ 656 (+175.63%)
Mutual labels:  crawler, web-crawler
Goose Parser
Universal scrapping tool, which allows you to extract data using multiple environments
Stars: ✭ 211 (-11.34%)
Mutual labels:  crawler, phantomjs
Spidy
The simple, easy to use command line web crawler.
Stars: ✭ 257 (+7.98%)
Mutual labels:  crawler, web-crawler
Crawlab Lite
Lite version of Crawlab. 轻量版 Crawlab 爬虫管理平台
Stars: ✭ 122 (-48.74%)
Mutual labels:  crawler, web-crawler
Pspider
简单易用的Python爬虫框架,QQ交流群:597510560
Stars: ✭ 1,611 (+576.89%)
Mutual labels:  crawler, web-crawler
Spider Flow
新一代爬虫平台,以图形化方式定义爬虫流程,不写代码即可完成爬虫。
Stars: ✭ 365 (+53.36%)
Mutual labels:  crawler, web-crawler
Zhihu Crawler People
A simple distributed crawler for zhihu && data analysis
Stars: ✭ 182 (-23.53%)
Mutual labels:  crawler, web-crawler
Supercrawler
A web crawler. Supercrawler automatically crawls websites. Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.
Stars: ✭ 306 (+28.57%)
Mutual labels:  crawler, web-crawler
Appcrawler
Android应用市场网络爬虫
Stars: ✭ 25 (-89.5%)
Mutual labels:  crawler, phantomjs
Gopa
[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn
Stars: ✭ 277 (+16.39%)
Mutual labels:  crawler, web-crawler
Boj Autocommit
When you solve the problem of Baekjoon Online Judge, it automatically commits and pushes to the remote repository.
Stars: ✭ 60 (-74.79%)
Mutual labels:  crawler, phantomjs
flink-crawler
Continuous scalable web crawler built on top of Flink and crawler-commons
Stars: ✭ 48 (-79.83%)
Mutual labels:  crawler, web-crawler
CrawlBox
Easy way to brute-force web directory.
Stars: ✭ 118 (-50.42%)
Mutual labels:  crawler, web-crawler
Infinitycrawler
A simple but powerful web crawler library for .NET
Stars: ✭ 97 (-59.24%)
Mutual labels:  crawler, web-crawler
Abot
Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
Stars: ✭ 1,961 (+723.95%)
Mutual labels:  crawler, web-crawler

基于浏览器内核的高级爬虫

基于C#.NET+PhantomJS+Sellenium的高级网络爬虫程序。可执行Javascript代码、触发各类事件、操纵页面Dom结构、甚至可以移除不喜欢的CSS样式。

很多网站都用Ajax动态加载、翻页,比如携程网的评论数据。如果是用之前那个简单的爬虫,是很难直接抓取到所有评论数据的,我们需要去分析那漫天的Javascript代码寻找API数据接口,还要时刻提防对方增加数据陷阱或修改API接口地。

如果通过高级爬虫,就可以完全无视这些问题,无论他们如何加密Javascript代码来隐藏API接口,最终的数据都必要呈现在网站页面上的Dom结构中,不然普通用户也就没法看到了。所以我们可以完全不分析API数据接口,直接从Dom中提取数据,甚至都不需要写那复杂的正则表达式。

主要特性

  • 支持Ajax请求事件的触发及捕获;
  • 支持异步并发抓取;
  • 支持自动事件通知;
  • 支持代理切换;
  • 支持操作Cookies;

运行截图

  • 抓取酒店数据

抓取酒店数据

  • 抓取评论数据

抓取酒店评论

示例代码

    /// <summary>
    /// 抓取酒店评论
    /// </summary>
	static void Main(string[] args)
    {
        var hotelUrl = "http://hotels.ctrip.com/hotel/434938.html";
        var hotelCrawler = new StrongCrawler();
        hotelCrawler.OnStart += (s, e) =>
        {
            Console.WriteLine("爬虫开始抓取地址:" + e.Uri.ToString());
        };
        hotelCrawler.OnError += (s, e) =>
        {
            Console.WriteLine("爬虫抓取出现错误:" + e.Uri.ToString() + ",异常消息:" + e.Exception.ToString());
        };
        hotelCrawler.OnCompleted += (s, e) =>
        {
            HotelCrawler(e);
        };
        var operation = new Operation
        {
            Action = (x) => {
                //通过Selenium驱动点击页面的“酒店评论”
                x.FindElement(By.XPath("//*[@id='commentTab']")).Click();
            },
            Condition = (x) => {
                //判断Ajax评论内容是否已经加载成功
                return x.FindElement(By.XPath("//*[@id='commentList']")).Displayed && x.FindElement(By.XPath("//*[@id='hotel_info_comment']/div[@id='commentList']")).Displayed && !x.FindElement(By.XPath("//*[@id='hotel_info_comment']/div[@id='commentList']")).Text.Contains("点评载入中");
            },
            Timeout = 5000
        };

        hotelCrawler.Start(new Uri(hotelUrl), null, operation);//不操作JS先将参数设置为NULL

        Console.ReadKey();
    }
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].