爬虫框架的依赖本身----由开源github作者处获取 在本地install将内容打包成jar放入maven管理仓库
 
 
 
 
Go to file
bingoko d3bbece202 Add PhantomJS support for selenium
The configuration file is config.ini
The dependencies are updated in pom.xml.
Update SeleniumDownloader and WebDriverPool to support PhantomJS. 
NOTE: The versions of GhostDriver, Selenium, and PhantomJS are stable
and validated.

A GooglePlay Example is under samples package: GooglePlayProcessor.java
2015-07-11 15:34:21 +01:00
assets rename assets for spell mistake 2014-04-12 13:42:32 +08:00
en_docs docs 2014-05-03 06:14:31 +08:00
webmagic-avalon update version to snapshot 2014-05-05 07:03:55 +08:00
webmagic-core 修正site.setHttpProxy()不起作用的bug 2015-03-09 15:54:15 +08:00
webmagic-extension fix bug:MultiPagePipeline and DoubleKeyMap concurrent bug 2015-02-13 15:03:13 +08:00
webmagic-samples new sample 2014-06-10 17:38:54 +08:00
webmagic-saxon update version 2014-06-04 17:33:30 +08:00
webmagic-scripts update version 2014-06-04 17:33:30 +08:00
webmagic-selenium Add PhantomJS support for selenium 2015-07-11 15:34:21 +01:00
zh_docs contributor 2014-06-04 22:27:16 +08:00
.gitignore change_gitignore 2014-05-19 15:56:22 +08:00
.travis.yml remove submodule 2014-04-01 23:08:28 +08:00
README.md contributor 2014-06-04 22:26:56 +08:00
pom.xml update assertj to test scope 2014-06-25 19:01:27 +08:00
release-note.md #34 Close reader in FileCacheQueueScheduler 2013-11-08 14:59:09 +08:00
user-manual.md deperate in user manual 2014-05-03 06:29:37 +08:00
webmagic-avalon.md scripts readme 2013-11-28 12:04:05 +08:00

README.md

logo

Readme in Chinese

User Manual (Chinese)

Build Status

A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler.

Features:

  • Simple core with high flexibility.
  • Simple API for html extracting.
  • Annotation with POJO to customize a crawler, no configuration.
  • Multi-thread and Distribution support.
  • Easy to be integrated.

Install:

Add dependencies to your pom.xml:

<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-core</artifactId>
    <version>0.5.2</version>
</dependency>
<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-extension</artifactId>
    <version>0.5.2</version>
</dependency>

WebMagic use slf4j with slf4j-log4j12 implementation. If you customized your slf4j implementation, please exclude slf4j-log4j12.

<exclusions>
    <exclusion>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
    </exclusion>
</exclusions>

Get Started:

First crawler:

Write a class implements PageProcessor. For example, I wrote a crawler of github repository infomation.

public class GithubRepoPageProcessor implements PageProcessor {

    private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);

    @Override
    public void process(Page page) {
        page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all());
        page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString());
        page.putField("name", page.getHtml().xpath("//h1[@class='entry-title public']/strong/a/text()").toString());
        if (page.getResultItems().get("name")==null){
            //skip this page
            page.setSkip(true);
        }
        page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()"));
    }

    @Override
    public Site getSite() {
        return site;
    }

    public static void main(String[] args) {
        Spider.create(new GithubRepoPageProcessor()).addUrl("https://github.com/code4craft").thread(5).run();
    }
}
  • page.addTargetRequests(links)

    Add urls for crawling.

You can also use annotation way:

@TargetUrl("https://github.com/\\w+/\\w+")
@HelpUrl("https://github.com/\\w+")
public class GithubRepo {

    @ExtractBy(value = "//h1[@class='entry-title public']/strong/a/text()", notNull = true)
    private String name;

    @ExtractByUrl("https://github\\.com/(\\w+)/.*")
    private String author;

    @ExtractBy("//div[@id='readme']/tidyText()")
    private String readme;

    public static void main(String[] args) {
        OOSpider.create(Site.me().setSleepTime(1000)
                , new ConsolePageModelPipeline(), GithubRepo.class)
                .addUrl("https://github.com/code4craft").thread(5).run();
    }
}

Docs and samples:

Documents: http://webmagic.io/docs/

The architecture of webmagic (refered to Scrapy)

image

Javadocs: http://code4craft.github.io/webmagic/docs/en/

There are some samples in webmagic-samples package.

Lisence:

Lisenced under Apache 2.0 lisence

Contributors:

Thanks these people for commiting source code, reporting bugs or suggesting for new feature:

Thanks:

To write webmagic, I refered to the projects below :

Mail-list:

https://groups.google.com/forum/#!forum/webmagic-java

http://list.qq.com/cgi-bin/qf_invite?id=023a01f505246785f77c5a5a9aff4e57ab20fcdde871e988

QQ Group: 373225642

Bitdeli Badge