Merge tag 'webmagic-0.4.0' of github.com:code4craft/webmagic

[maven-release-plugin]  copy for tag webmagic-0.4.0

Conflicts:
	pom.xml
	webmagic-core/pom.xml
	webmagic-extension/pom.xml
master
yihua.huang 2013-11-06 22:48:26 +08:00
commit e40b48e77b
49 changed files with 1240 additions and 580 deletions

View File

@ -25,12 +25,12 @@ Add dependencies to your pom.xml:
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.3.2</version>
<version>0.4.0</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.3.2</version>
<version>0.4.0</version>
</dependency>
## Get Started:

View File

@ -28,12 +28,12 @@ Add dependencies to your project:
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.3.2</version>
<version>0.4.0</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.3.2</version>
<version>0.4.0</version>
</dependency>
## Get Started:

11
pom.xml
View File

@ -6,7 +6,7 @@
<version>7</version>
</parent>
<groupId>us.codecraft</groupId>
<version>0.3.2</version>
<version>0.4.0</version>
<modelVersion>4.0.0</modelVersion>
<packaging>pom</packaging>
<properties>
@ -36,7 +36,7 @@
<connection>scm:git:git@github.com:code4craft/webmagic.git</connection>
<developerConnection>scm:git:git@github.com:code4craft/webmagic.git</developerConnection>
<url>git@github.com:code4craft/webmagic.git</url>
<tag>HEAD</tag>
<tag>webmagic-0.4.0</tag>
</scm>
<licenses>
<license>
@ -62,7 +62,12 @@
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.2.4</version>
<version>4.3.1</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>15.0</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>

View File

@ -1,5 +1,7 @@
Release Notes
----
See old versions in [https://github.com/code4craft/webmagic/releases](https://github.com/code4craft/webmagic/releases)
*2012-9-4* `version0.3.0`
* Change default XPath selector from HtmlCleaner to [Xsoup](https://github.com/code4craft/xsoup).

View File

@ -1,5 +1,5 @@
webmagic使用手册
------
========
>webmagic是一个开源的Java垂直爬虫框架目标是简化爬虫的开发流程让开发者专注于逻辑功能的开发。webmagic的核心非常简单但是覆盖爬虫的整个流程也是很好的学习爬虫开发的材料。
>web爬虫是一种技术webmagic致力于将这种技术的实现成本降低但是出于对资源提供者的尊重webmagic不会做反封锁的事情包括验证码破解、代理切换、自动登录等。
@ -16,8 +16,9 @@ webmagic使用手册
<div style="page-break-after:always"></div>
--------
## 快速开始
## 下载及安装
### 使用maven
@ -26,12 +27,12 @@ webmagic使用maven管理依赖在项目中添加对应的依赖即可使用w
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.3.2</version>
<version>0.4.0</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.3.2</version>
<version>0.4.0</version>
</dependency>
#### 项目结构
@ -66,9 +67,11 @@ webmagic还包含两个可用的扩展包因为这两个包都依赖了比较
在**bin/lib**目录下有项目依赖的所有jar包直接在IDE里import即可。
### 第一个爬虫
--------
#### 定制PageProcessor
## 第一个爬虫
### 定制PageProcessor
PageProcessor是webmagic-core的一部分定制一个PageProcessor即可实现自己的爬虫逻辑。以下是抓取osc博客的一段代码
@ -137,10 +140,13 @@ webmagic-extension包括了注解方式编写爬虫的方法只需基于一
这个例子定义了一个Model类Model类的字段'title'、'content'、'tags'均为要抽取的属性。这个类在Pipeline里是可以复用的。
注解的详细使用方式见后文中webmagic-extension注解模块。
注解的详细使用方式见后文中webmagic-extension注解模块。
<div style="page-break-after:always"></div>
--------
## 模块详细介绍
## webmagic-core
@ -213,7 +219,7 @@ Spider还包括一个方法test(String url),该方法只抓取一个单独的
webmagic包括一个对于页面正文的自动抽取的类**SmartContentSelector**。相信用过Evernote Clearly都会对其自动抽取正文的技术印象深刻。这个技术又叫**Readability**。当然webmagic对Readability的实现还比较粗略但是仍有一些学习价值。
webmagic的XPath解析使用了作者另一个开源项目基于Jsoup的XPath解析器[Xsoup](https://github.com/code4craft/xsoup)Xsoup对XPath的语法进行了一些扩展支持一些自定义的函数。
webmagic的XPath解析使用了作者另一个开源项目基于Jsoup的XPath解析器[Xsoup](https://github.com/code4craft/xsoup)Xsoup对XPath的语法进行了一些扩展支持一些自定义的函数。这些函数的使用方式都是在XPath末尾加上`/name-of-function()`,例如:`"//div[@class='BlogStat']/regex('\\d+-\\d+-\\d+\\s+\\d+:\\d+')"`。
<table>
<tr>
@ -325,6 +331,8 @@ webmagic目前不支持持久化到数据库但是结合其他工具持久
<div style="page-break-after:always"></div>
-----
## webmagic-extension
webmagic-extension是为了开发爬虫更方便而实现的一些功能模块。这些功能完全基于webmagic-core的框架包括注解形式编写爬虫、分页、分布式等功能。
@ -355,6 +363,10 @@ webmagic-extension包括注解模块。为什么会有注解方式
@ExtractBy(value = "//div[@class='BlogTags']/a/text()", multi = true)
private List<String> tags;
@Formatter("yyyy-MM-dd HH:mm")
@ExtractBy("//div[@class='BlogStat']/regex('\\d+-\\d+-\\d+\\s+\\d+:\\d+')")
private Date date;
public static void main(String[] args) {
OOSpider.create(
Site.me().addStartUrl("http://my.oschina.net/flashsword/blog"),
@ -395,10 +407,21 @@ webmagic-extension包括注解模块。为什么会有注解方式
* #### 类型转换
webmagic的注解模式支持对抽取结果进行类型转换这样抽取结果并不需要是String类型而可以是任意类型。webmagic内置了基本类型的支持(需要保证抽取结果能够被转换到对应类型)。
```java
@ExtractBy("//ul[@class='pagehead-actions']/li[1]//a[@class='social-count js-social-count']/text()")
private int star;
```
抽取结果也可以是`java.util.Date`类型,不过需要指定日期格式化的方式:
```java
@Formatter("yyyy-MM-dd HH:mm")
@ExtractBy("//div[@class='BlogStat']/regex('\\d+-\\d+-\\d+\\s+\\d+:\\d+')")
private Date date;
```
你也可以编写一个实现`ObjectFormatter`接口的类,进行自己的类型解析。要使用自己的类,需要调用`ObjectFormatters.put()`对这个类进行注册。
* #### AfterExtractor

View File

@ -3,7 +3,7 @@
<parent>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-parent</artifactId>
<version>0.3.2</version>
<version>0.4.0</version>
</parent>
<modelVersion>4.0.0</modelVersion>
@ -20,6 +20,12 @@
<artifactId>junit</artifactId>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>15.0</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>

View File

@ -68,4 +68,13 @@ public class ResultItems {
this.skip = skip;
return this;
}
@Override
public String toString() {
return "ResultItems{" +
"fields=" + fields +
", request=" + request +
", skip=" + skip +
'}';
}
}

View File

@ -1,5 +1,6 @@
package us.codecraft.webmagic;
import org.apache.http.HttpHost;
import us.codecraft.webmagic.utils.UrlUtils;
import java.util.*;
@ -8,8 +9,8 @@ import java.util.*;
* Object contains setting for crawler.<br>
*
* @author code4crafter@gmail.com <br>
* @since 0.1.0
* @see us.codecraft.webmagic.processor.PageProcessor
* @since 0.1.0
*/
public class Site {
@ -24,18 +25,32 @@ public class Site {
/**
* startUrls is the urls the crawler to start with.
*/
private List<String> startUrls = new ArrayList<String>();
private List<Request> startRequests = new ArrayList<Request>();
private int sleepTime = 3000;
private int sleepTime = 5000;
private int retryTimes = 0;
private int cycleRetryTimes = 0;
private int timeOut = 5000;
private static final Set<Integer> DEFAULT_STATUS_CODE_SET = new HashSet<Integer>();
private Set<Integer> acceptStatCode = DEFAULT_STATUS_CODE_SET;
private Map<String, String> headers = new HashMap<String, String>();
private HttpHost httpProxy;
private boolean useGzip = true;
public static interface HeaderConst {
public static final String REFERER = "Referer";
}
static {
DEFAULT_STATUS_CODE_SET.add(200);
}
@ -131,6 +146,20 @@ public class Site {
return charset;
}
public int getTimeOut() {
return timeOut;
}
/**
* set timeout for downloader in ms
*
* @param timeOut
*/
public Site setTimeOut(int timeOut) {
this.timeOut = timeOut;
return this;
}
/**
* Set acceptStatCode.<br>
* When status code of http response is in acceptStatCodes, it will be processed.<br>
@ -158,23 +187,44 @@ public class Site {
* get start urls
*
* @return start urls
* @see #getStartRequests
* @deprecated
*/
@Deprecated
public List<String> getStartUrls() {
return startUrls;
return UrlUtils.convertToUrls(startRequests);
}
public List<Request> getStartRequests() {
return startRequests;
}
/**
* Add a url to start url.<br>
* Because urls are more a Spider's property than Site, move it to {@link Spider#addUrl(String...)}}
*
* @deprecated
* @see Spider#addUrl(String...)
* @param startUrl
* @return this
*/
public Site addStartUrl(String startUrl) {
this.startUrls.add(startUrl);
if (domain == null) {
if (startUrls.size() > 0) {
domain = UrlUtils.getDomain(startUrls.get(0));
}
return addStartRequest(new Request(startUrl));
}
/**
* Add a url to start url.<br>
* Because urls are more a Spider's property than Site, move it to {@link Spider#addRequest(Request...)}}
*
* @deprecated
* @see Spider#addRequest(Request...)
* @param startUrl
* @return this
*/
public Site addStartRequest(Request startRequest) {
this.startRequests.add(startRequest);
if (domain == null && startRequest.getUrl() != null) {
domain = UrlUtils.getDomain(startRequest.getUrl());
}
return this;
}
@ -202,7 +252,7 @@ public class Site {
}
/**
* Get retry times when download fail immediately, 0 by default.<br>
* Get retry times immediately when download fail, 0 by default.<br>
*
* @return retry times when download fail
*/
@ -210,6 +260,23 @@ public class Site {
return retryTimes;
}
public Map<String, String> getHeaders() {
return headers;
}
/**
* Put an Http header for downloader. <br/>
* Use {@link #addCookie(String, String)} for cookie and {@link #setUserAgent(String)} for user-agent. <br/>
*
* @param key key of http header, there are some keys constant in {@link HeaderConst}
* @param value value of header
* @return
*/
public Site addHeader(String key, String value) {
headers.put(key, value);
return this;
}
/**
* Set retry times when download fail, 0 by default.<br>
*
@ -239,21 +306,34 @@ public class Site {
return this;
}
@Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
public HttpHost getHttpProxy() {
return httpProxy;
}
Site site = (Site) o;
/**
* set up httpProxy for this site
* @param httpProxy
* @return
*/
public Site setHttpProxy(HttpHost httpProxy) {
this.httpProxy = httpProxy;
return this;
}
if (acceptStatCode != null ? !acceptStatCode.equals(site.acceptStatCode) : site.acceptStatCode != null)
return false;
if (!domain.equals(site.domain)) return false;
if (!startUrls.equals(site.startUrls)) return false;
if (charset != null ? !charset.equals(site.charset) : site.charset != null) return false;
if (userAgent != null ? !userAgent.equals(site.userAgent) : site.userAgent != null) return false;
public boolean isUseGzip() {
return useGzip;
}
return true;
/**
* Whether use gzip. <br>
* Default is true, you can set it to false to disable gzip.
*
* @param useGzip
* @return
*/
public Site setUseGzip(boolean useGzip) {
this.useGzip = useGzip;
return this;
}
public Task toTask() {
@ -270,13 +350,60 @@ public class Site {
};
}
@Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
Site site = (Site) o;
if (cycleRetryTimes != site.cycleRetryTimes) return false;
if (retryTimes != site.retryTimes) return false;
if (sleepTime != site.sleepTime) return false;
if (timeOut != site.timeOut) return false;
if (acceptStatCode != null ? !acceptStatCode.equals(site.acceptStatCode) : site.acceptStatCode != null)
return false;
if (charset != null ? !charset.equals(site.charset) : site.charset != null) return false;
if (cookies != null ? !cookies.equals(site.cookies) : site.cookies != null) return false;
if (domain != null ? !domain.equals(site.domain) : site.domain != null) return false;
if (headers != null ? !headers.equals(site.headers) : site.headers != null) return false;
if (startRequests != null ? !startRequests.equals(site.startRequests) : site.startRequests != null)
return false;
if (userAgent != null ? !userAgent.equals(site.userAgent) : site.userAgent != null) return false;
return true;
}
@Override
public int hashCode() {
int result = domain.hashCode();
result = 31 * result + (startUrls != null ? startUrls.hashCode() : 0);
int result = domain != null ? domain.hashCode() : 0;
result = 31 * result + (userAgent != null ? userAgent.hashCode() : 0);
result = 31 * result + (cookies != null ? cookies.hashCode() : 0);
result = 31 * result + (charset != null ? charset.hashCode() : 0);
result = 31 * result + (startRequests != null ? startRequests.hashCode() : 0);
result = 31 * result + sleepTime;
result = 31 * result + retryTimes;
result = 31 * result + cycleRetryTimes;
result = 31 * result + timeOut;
result = 31 * result + (acceptStatCode != null ? acceptStatCode.hashCode() : 0);
result = 31 * result + (headers != null ? headers.hashCode() : 0);
return result;
}
@Override
public String toString() {
return "Site{" +
"domain='" + domain + '\'' +
", userAgent='" + userAgent + '\'' +
", cookies=" + cookies +
", charset='" + charset + '\'' +
", startRequests=" + startRequests +
", sleepTime=" + sleepTime +
", retryTimes=" + retryTimes +
", cycleRetryTimes=" + cycleRetryTimes +
", timeOut=" + timeOut +
", acceptStatCode=" + acceptStatCode +
", headers=" + headers +
'}';
}
}

View File

@ -1,9 +1,12 @@
package us.codecraft.webmagic;
import com.google.common.collect.Lists;
import org.apache.commons.collections.CollectionUtils;
import org.apache.log4j.Logger;
import us.codecraft.webmagic.downloader.Downloader;
import us.codecraft.webmagic.downloader.HttpClientDownloader;
import us.codecraft.webmagic.pipeline.CollectorPipeline;
import us.codecraft.webmagic.pipeline.ResultItemsCollectorPipeline;
import us.codecraft.webmagic.pipeline.ConsolePipeline;
import us.codecraft.webmagic.pipeline.Pipeline;
import us.codecraft.webmagic.processor.PageProcessor;
@ -11,13 +14,18 @@ import us.codecraft.webmagic.scheduler.QueueScheduler;
import us.codecraft.webmagic.scheduler.Scheduler;
import us.codecraft.webmagic.utils.EnvironmentUtil;
import us.codecraft.webmagic.utils.ThreadUtils;
import us.codecraft.webmagic.utils.UrlUtils;
import java.io.Closeable;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collection;
import java.util.List;
import java.util.UUID;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.concurrent.locks.Condition;
import java.util.concurrent.locks.ReentrantLock;
/**
* Entrance of a crawler.<br>
@ -52,381 +60,520 @@ import java.util.concurrent.atomic.AtomicInteger;
*/
public class Spider implements Runnable, Task {
protected Downloader downloader;
protected Downloader downloader;
protected List<Pipeline> pipelines = new ArrayList<Pipeline>();
protected List<Pipeline> pipelines = new ArrayList<Pipeline>();
protected PageProcessor pageProcessor;
protected PageProcessor pageProcessor;
protected List<String> startUrls;
protected List<Request> startRequests;
protected Site site;
protected Site site;
protected String uuid;
protected String uuid;
protected Scheduler scheduler = new QueueScheduler();
protected Scheduler scheduler = new QueueScheduler();
protected Logger logger = Logger.getLogger(getClass());
protected Logger logger = Logger.getLogger(getClass());
protected ExecutorService executorService;
protected ExecutorService executorService;
protected int threadNum = 1;
protected int threadNum = 1;
protected AtomicInteger stat = new AtomicInteger(STAT_INIT);
protected AtomicInteger stat = new AtomicInteger(STAT_INIT);
protected final static int STAT_INIT = 0;
protected boolean exitWhenComplete = true;
protected final static int STAT_RUNNING = 1;
protected final static int STAT_INIT = 0;
protected final static int STAT_STOPPED = 2;
protected final static int STAT_RUNNING = 1;
/**
* create a spider with pageProcessor.
*
* @param pageProcessor
* @return new spider
* @see PageProcessor
*/
public static Spider create(PageProcessor pageProcessor) {
return new Spider(pageProcessor);
}
protected final static int STAT_STOPPED = 2;
/**
* create a spider with pageProcessor.
*
* @param pageProcessor
*/
public Spider(PageProcessor pageProcessor) {
this.pageProcessor = pageProcessor;
this.site = pageProcessor.getSite();
this.startUrls = pageProcessor.getSite().getStartUrls();
}
protected boolean spawnUrl = true;
/**
* Set startUrls of Spider.<br>
* Prior to startUrls of Site.
*
* @param startUrls
* @return this
*/
public Spider startUrls(List<String> startUrls) {
checkIfRunning();
this.startUrls = startUrls;
return this;
}
protected boolean destroyWhenExit = true;
/**
* Set an uuid for spider.<br>
* Default uuid is domain of site.<br>
*
* @param uuid
* @return this
*/
public Spider setUUID(String uuid) {
this.uuid = uuid;
return this;
}
private ReentrantLock newUrlLock = new ReentrantLock();
/**
* set scheduler for Spider
*
* @param scheduler
* @return this
* @Deprecated
* @see #setScheduler(us.codecraft.webmagic.scheduler.Scheduler)
*/
public Spider scheduler(Scheduler scheduler) {
return setScheduler(scheduler);
}
private Condition newUrlCondition = newUrlLock.newCondition();
/**
* set scheduler for Spider
*
* @param scheduler
* @return this
* @see Scheduler
* @since 0.2.1
*/
public Spider setScheduler(Scheduler scheduler) {
checkIfRunning();
this.scheduler = scheduler;
return this;
}
/**
* create a spider with pageProcessor.
*
* @param pageProcessor
* @return new spider
* @see PageProcessor
*/
public static Spider create(PageProcessor pageProcessor) {
return new Spider(pageProcessor);
}
/**
* add a pipeline for Spider
*
* @param pipeline
* @return this
* @see #setPipeline(us.codecraft.webmagic.pipeline.Pipeline)
* @deprecated
*/
public Spider pipeline(Pipeline pipeline) {
return addPipeline(pipeline);
}
/**
* create a spider with pageProcessor.
*
* @param pageProcessor
*/
public Spider(PageProcessor pageProcessor) {
this.pageProcessor = pageProcessor;
this.site = pageProcessor.getSite();
this.startRequests = pageProcessor.getSite().getStartRequests();
}
/**
* add a pipeline for Spider
*
* @param pipeline
* @return this
* @see Pipeline
* @since 0.2.1
*/
public Spider addPipeline(Pipeline pipeline) {
checkIfRunning();
this.pipelines.add(pipeline);
return this;
}
/**
* Set startUrls of Spider.<br>
* Prior to startUrls of Site.
*
* @param startUrls
* @return this
*/
public Spider startUrls(List<String> startUrls) {
checkIfRunning();
this.startRequests = UrlUtils.convertToRequests(startUrls);
return this;
}
/**
* clear the pipelines set
*
* @return this
*/
public Spider clearPipeline() {
pipelines = new ArrayList<Pipeline>();
return this;
}
/**
* Set startUrls of Spider.<br>
* Prior to startUrls of Site.
*
* @param startUrls
* @return this
*/
public Spider startRequest(List<Request> startRequests) {
checkIfRunning();
this.startRequests = startRequests;
return this;
}
/**
* set the downloader of spider
*
* @param downloader
* @return this
* @see #setDownloader(us.codecraft.webmagic.downloader.Downloader)
* @deprecated
*/
public Spider downloader(Downloader downloader) {
return setDownloader(downloader);
}
/**
* Set an uuid for spider.<br>
* Default uuid is domain of site.<br>
*
* @param uuid
* @return this
*/
public Spider setUUID(String uuid) {
this.uuid = uuid;
return this;
}
/**
* set the downloader of spider
*
* @param downloader
* @return this
* @see Downloader
*/
public Spider setDownloader(Downloader downloader) {
checkIfRunning();
this.downloader = downloader;
return this;
}
/**
* set scheduler for Spider
*
* @param scheduler
* @return this
* @Deprecated
* @see #setScheduler(us.codecraft.webmagic.scheduler.Scheduler)
*/
public Spider scheduler(Scheduler scheduler) {
return setScheduler(scheduler);
}
protected void checkComponent() {
if (downloader == null) {
this.downloader = new HttpClientDownloader();
}
if (pipelines.isEmpty()) {
pipelines.add(new ConsolePipeline());
}
downloader.setThread(threadNum);
}
/**
* set scheduler for Spider
*
* @param scheduler
* @return this
* @see Scheduler
* @since 0.2.1
*/
public Spider setScheduler(Scheduler scheduler) {
checkIfRunning();
this.scheduler = scheduler;
return this;
}
@Override
public void run() {
if (!stat.compareAndSet(STAT_INIT, STAT_RUNNING) && !stat.compareAndSet(STAT_STOPPED, STAT_RUNNING)) {
throw new IllegalStateException("Spider is already running!");
}
checkComponent();
if (startUrls != null) {
for (String startUrl : startUrls) {
scheduler.push(new Request(startUrl), this);
}
startUrls.clear();
}
Request request = scheduler.poll(this);
/**
* add a pipeline for Spider
*
* @param pipeline
* @return this
* @see #setPipeline(us.codecraft.webmagic.pipeline.Pipeline)
* @deprecated
*/
public Spider pipeline(Pipeline pipeline) {
return addPipeline(pipeline);
}
/**
* add a pipeline for Spider
*
* @param pipeline
* @return this
* @see Pipeline
* @since 0.2.1
*/
public Spider addPipeline(Pipeline pipeline) {
checkIfRunning();
this.pipelines.add(pipeline);
return this;
}
/**
* clear the pipelines set
*
* @return this
*/
public Spider clearPipeline() {
pipelines = new ArrayList<Pipeline>();
return this;
}
/**
* set the downloader of spider
*
* @param downloader
* @return this
* @see #setDownloader(us.codecraft.webmagic.downloader.Downloader)
* @deprecated
*/
public Spider downloader(Downloader downloader) {
return setDownloader(downloader);
}
/**
* set the downloader of spider
*
* @param downloader
* @return this
* @see Downloader
*/
public Spider setDownloader(Downloader downloader) {
checkIfRunning();
this.downloader = downloader;
return this;
}
protected void initComponent() {
if (downloader == null) {
this.downloader = new HttpClientDownloader();
}
if (pipelines.isEmpty()) {
pipelines.add(new ConsolePipeline());
}
downloader.setThread(threadNum);
if (executorService == null || executorService.isShutdown()) {
executorService = ThreadUtils.newFixedThreadPool(threadNum);
}
if (startRequests != null) {
for (Request request : startRequests) {
scheduler.push(request, this);
}
startRequests.clear();
}
}
@Override
public void run() {
checkRunningStat();
initComponent();
logger.info("Spider " + getUUID() + " started!");
// single thread
if (threadNum <= 1) {
while (request != null && stat.compareAndSet(STAT_RUNNING, STAT_RUNNING)) {
processRequest(request);
request = scheduler.poll(this);
}
} else {
synchronized (this) {
this.executorService = ThreadUtils.newFixedThreadPool(threadNum);
}
// multi thread
final AtomicInteger threadAlive = new AtomicInteger(0);
while (true && stat.compareAndSet(STAT_RUNNING, STAT_RUNNING)) {
if (request == null) {
// when no request found but some thread is alive, sleep a
// while.
try {
Thread.sleep(100);
} catch (InterruptedException e) {
}
} else {
final Request requestFinal = request;
threadAlive.incrementAndGet();
executorService.execute(new Runnable() {
@Override
public void run() {
processRequest(requestFinal);
threadAlive.decrementAndGet();
}
});
}
request = scheduler.poll(this);
if (threadAlive.get() == 0) {
request = scheduler.poll(this);
if (request == null) {
break;
}
}
}
executorService.shutdown();
}
stat.compareAndSet(STAT_RUNNING, STAT_STOPPED);
// release some resources
destroy();
}
final AtomicInteger threadAlive = new AtomicInteger(0);
while (!Thread.currentThread().isInterrupted() && stat.get() == STAT_RUNNING) {
Request request = scheduler.poll(this);
if (request == null) {
if (threadAlive.get() == 0 && exitWhenComplete) {
break;
}
// wait until new url added
waitNewUrl();
} else {
final Request requestFinal = request;
threadAlive.incrementAndGet();
executorService.execute(new Runnable() {
@Override
public void run() {
try {
processRequest(requestFinal);
} catch (Exception e) {
logger.error("download " + requestFinal + " error", e);
} finally {
threadAlive.decrementAndGet();
signalNewUrl();
}
}
});
}
}
stat.set(STAT_STOPPED);
// release some resources
if (destroyWhenExit) {
close();
}
}
protected void destroy() {
destroyEach(downloader);
destroyEach(pageProcessor);
for (Pipeline pipeline : pipelines) {
destroyEach(pipeline);
}
}
private void checkRunningStat() {
while (true) {
int statNow = stat.get();
if (statNow == STAT_RUNNING) {
throw new IllegalStateException("Spider is already running!");
}
if (stat.compareAndSet(statNow, STAT_RUNNING)) {
break;
}
}
}
private void destroyEach(Object object) {
if (object instanceof Closeable) {
try {
((Closeable) object).close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
public void close() {
destroyEach(downloader);
destroyEach(pageProcessor);
for (Pipeline pipeline : pipelines) {
destroyEach(pipeline);
}
executorService.shutdown();
}
/**
* Process specific urls without url discovering.
*
* @param urls
* urls to process
*/
public void test(String... urls) {
checkComponent();
if (urls.length > 0) {
for (String url : urls) {
processRequest(new Request(url));
}
}
}
private void destroyEach(Object object) {
if (object instanceof Closeable) {
try {
((Closeable) object).close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
protected void processRequest(Request request) {
Page page = downloader.download(request, this);
if (page == null) {
sleep(site.getSleepTime());
return;
}
// for cycle retry
if (page.getHtml() == null) {
addRequest(page);
sleep(site.getSleepTime());
return;
}
pageProcessor.process(page);
addRequest(page);
if (!page.getResultItems().isSkip()) {
for (Pipeline pipeline : pipelines) {
pipeline.process(page.getResultItems(), this);
}
}
sleep(site.getSleepTime());
}
/**
* Process specific urls without url discovering.
*
* @param urls urls to process
*/
public void test(String... urls) {
initComponent();
if (urls.length > 0) {
for (String url : urls) {
processRequest(new Request(url));
}
}
}
protected void sleep(int time) {
try {
Thread.sleep(time);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
protected void processRequest(Request request) {
Page page = downloader.download(request, this);
if (page == null) {
sleep(site.getSleepTime());
return;
}
// for cycle retry
if (page.getHtml() == null) {
extractAndAddRequests(page);
sleep(site.getSleepTime());
return;
}
pageProcessor.process(page);
extractAndAddRequests(page);
if (!page.getResultItems().isSkip()) {
for (Pipeline pipeline : pipelines) {
pipeline.process(page.getResultItems(), this);
}
}
sleep(site.getSleepTime());
}
protected void addRequest(Page page) {
if (CollectionUtils.isNotEmpty(page.getTargetRequests())) {
for (Request request : page.getTargetRequests()) {
scheduler.push(request, this);
}
}
}
protected void sleep(int time) {
try {
Thread.sleep(time);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
protected void checkIfRunning() {
if (!stat.compareAndSet(STAT_INIT, STAT_INIT) && !stat.compareAndSet(STAT_STOPPED, STAT_STOPPED)) {
throw new IllegalStateException("Spider is already running!");
}
}
protected void extractAndAddRequests(Page page) {
if (spawnUrl && CollectionUtils.isNotEmpty(page.getTargetRequests())) {
for (Request request : page.getTargetRequests()) {
addRequest(request);
}
}
}
public void runAsync() {
Thread thread = new Thread(this);
thread.setDaemon(false);
thread.start();
}
private void addRequest(Request request) {
if (site.getDomain() == null && request != null && request.getUrl() != null) {
site.setDomain(UrlUtils.getDomain(request.getUrl()));
}
scheduler.push(request, this);
}
public void start() {
runAsync();
}
protected void checkIfRunning() {
if (stat.get() == STAT_RUNNING) {
throw new IllegalStateException("Spider is already running!");
}
}
public void stop() {
if (stat.compareAndSet(STAT_RUNNING, STAT_STOPPED)) {
if (executorService != null) {
executorService.shutdown();
}
logger.info("Spider " + getUUID() + " stop success!");
} else {
logger.info("Spider " + getUUID() + " stop fail!");
}
}
public void runAsync() {
Thread thread = new Thread(this);
thread.setDaemon(false);
thread.start();
}
public void stopAndDestroy() {
stop();
destroy();
}
/**
* Add urls to crawl. <br/>
*
* @param urls
* @return
*/
public Spider addUrl(String... urls) {
for (String url : urls) {
addRequest(new Request(url));
}
signalNewUrl();
return this;
}
/**
* start with more than one threads
*
* @param threadNum
* @return this
*/
public Spider thread(int threadNum) {
checkIfRunning();
this.threadNum = threadNum;
if (threadNum <= 0) {
throw new IllegalArgumentException("threadNum should be more than one!");
}
if (threadNum == 1) {
return this;
}
return this;
}
/**
* Download urls synchronizing.
*
* @param urls
* @return
*/
public <T> List<T> getAll(Collection<String> urls) {
destroyWhenExit = false;
spawnUrl = false;
startRequests.clear();
for (Request request : UrlUtils.convertToRequests(urls)) {
addRequest(request);
}
CollectorPipeline collectorPipeline = getCollectorPipeline();
pipelines.add(collectorPipeline);
run();
spawnUrl = true;
destroyWhenExit = true;
return collectorPipeline.getCollected();
}
/**
* switch off xsoup
*
* @return
*/
public static void xsoupOff() {
EnvironmentUtil.setUseXsoup(false);
}
protected CollectorPipeline getCollectorPipeline() {
return new ResultItemsCollectorPipeline();
}
@Override
public String getUUID() {
if (uuid != null) {
return uuid;
}
if (site != null) {
return site.getDomain();
}
return null;
}
public <T> T get(String url) {
List<String> urls = Lists.newArrayList(url);
List<T> resultItemses = getAll(urls);
if (resultItemses != null && resultItemses.size() > 0) {
return resultItemses.get(0);
} else {
return null;
}
}
@Override
public Site getSite() {
return site;
}
/**
* Add urls with information to crawl.<br/>
*
* @param urls
* @return
*/
public Spider addRequest(Request... requests) {
for (Request request : requests) {
addRequest(request);
}
signalNewUrl();
return this;
}
private void waitNewUrl() {
try {
newUrlLock.lock();
try {
newUrlCondition.await();
} catch (InterruptedException e) {
}
} finally {
newUrlLock.unlock();
}
}
private void signalNewUrl() {
try {
newUrlLock.lock();
newUrlCondition.signalAll();
} finally {
newUrlLock.unlock();
}
}
public void start() {
runAsync();
}
public void stop() {
if (stat.compareAndSet(STAT_RUNNING, STAT_STOPPED)) {
logger.info("Spider " + getUUID() + " stop success!");
} else {
logger.info("Spider " + getUUID() + " stop fail!");
}
}
/**
* start with more than one threads
*
* @param threadNum
* @return this
*/
public Spider thread(int threadNum) {
checkIfRunning();
this.threadNum = threadNum;
if (threadNum <= 0) {
throw new IllegalArgumentException("threadNum should be more than one!");
}
return this;
}
/**
* switch off xsoup
*
* @return
*/
public static void xsoupOff() {
EnvironmentUtil.setUseXsoup(false);
}
public boolean isExitWhenComplete() {
return exitWhenComplete;
}
/**
* Exit when complete. <br/>
* True: exit when all url of the site is downloaded. <br/>
* False: not exit until call stop() manually.<br/>
*
* @param exitWhenComplete
* @return
*/
public Spider setExitWhenComplete(boolean exitWhenComplete) {
this.exitWhenComplete = exitWhenComplete;
return this;
}
public boolean isSpawnUrl() {
return spawnUrl;
}
/**
* Whether add urls extracted to download.<br>
* Add urls to download when it is true, and just download seed urls when it is false. <br>
* DO NOT set it unless you know what it means!
*
* @param spawnUrl
* @return
* @since 0.4.0
*/
public Spider setSpawnUrl(boolean spawnUrl) {
this.spawnUrl = spawnUrl;
return this;
}
@Override
public String getUUID() {
if (uuid != null) {
return uuid;
}
if (site != null) {
return site.getDomain();
}
uuid = UUID.randomUUID().toString();
return uuid;
}
@Override
public Site getSite() {
return site;
}
}

View File

@ -1,13 +1,15 @@
package us.codecraft.webmagic.downloader;
import com.google.common.collect.Sets;
import org.apache.commons.io.IOUtils;
import org.apache.http.Header;
import org.apache.http.HeaderElement;
import org.apache.http.HttpResponse;
import org.apache.http.annotation.ThreadSafe;
import org.apache.http.client.HttpClient;
import org.apache.http.client.entity.GzipDecompressingEntity;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.config.CookieSpecs;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.RequestBuilder;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.util.EntityUtils;
import org.apache.log4j.Logger;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Request;
@ -18,7 +20,8 @@ import us.codecraft.webmagic.selector.PlainText;
import us.codecraft.webmagic.utils.UrlUtils;
import java.io.IOException;
import java.util.HashSet;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
@ -33,7 +36,9 @@ public class HttpClientDownloader implements Downloader {
private Logger logger = Logger.getLogger(getClass());
private int poolSize = 1;
private final Map<String, CloseableHttpClient> httpClients = new HashMap<String, CloseableHttpClient>();
private HttpClientGenerator httpClientGenerator = new HttpClientGenerator();
/**
* A simple method to download a url.
@ -57,63 +62,59 @@ public class HttpClientDownloader implements Downloader {
return (Html) page.getHtml();
}
private CloseableHttpClient getHttpClient(Site site) {
if (site == null) {
return httpClientGenerator.getClient(null);
}
String domain = site.getDomain();
CloseableHttpClient httpClient = httpClients.get(domain);
if (httpClient == null) {
synchronized (this) {
if (httpClient == null) {
httpClient = httpClientGenerator.getClient(site);
httpClients.put(domain, httpClient);
}
}
}
return httpClient;
}
@Override
public Page download(Request request, Task task) {
Site site = null;
if (task != null) {
site = task.getSite();
}
int retryTimes = 0;
Set<Integer> acceptStatCode;
String charset = null;
Map<String, String> headers = null;
if (site != null) {
retryTimes = site.getRetryTimes();
acceptStatCode = site.getAcceptStatCode();
charset = site.getCharset();
headers = site.getHeaders();
} else {
acceptStatCode = new HashSet<Integer>();
acceptStatCode.add(200);
acceptStatCode = Sets.newHashSet(200);
}
logger.info("downloading page " + request.getUrl());
HttpClient httpClient = HttpClientPool.getInstance(poolSize).getClient(site);
RequestBuilder requestBuilder = RequestBuilder.get().setUri(request.getUrl());
if (headers != null) {
for (Map.Entry<String, String> headerEntry : headers.entrySet()) {
requestBuilder.addHeader(headerEntry.getKey(), headerEntry.getValue());
}
}
RequestConfig.Builder requestConfigBuilder = RequestConfig.custom()
.setConnectionRequestTimeout(site.getTimeOut())
.setConnectTimeout(site.getTimeOut())
.setCookieSpec(CookieSpecs.BEST_MATCH);
if (site != null && site.getHttpProxy() != null) {
requestConfigBuilder.setProxy(site.getHttpProxy());
}
requestBuilder.setConfig(requestConfigBuilder.build());
CloseableHttpResponse httpResponse = null;
try {
HttpGet httpGet = new HttpGet(request.getUrl());
HttpResponse httpResponse = null;
int tried = 0;
boolean retry;
do {
try {
httpResponse = httpClient.execute(httpGet);
retry = false;
} catch (IOException e) {
tried++;
if (tried > retryTimes) {
logger.warn("download page " + request.getUrl() + " error", e);
if (site.getCycleRetryTimes() > 0) {
Page page = new Page();
Object cycleTriedTimesObject = request.getExtra(Request.CYCLE_TRIED_TIMES);
if (cycleTriedTimesObject == null) {
page.addTargetRequest(request.setPriority(0).putExtra(Request.CYCLE_TRIED_TIMES, 1));
} else {
int cycleTriedTimes = (Integer) cycleTriedTimesObject;
cycleTriedTimes++;
if (cycleTriedTimes >= site.getCycleRetryTimes()) {
return null;
}
page.addTargetRequest(request.setPriority(0).putExtra(Request.CYCLE_TRIED_TIMES, 1));
}
return page;
}
return null;
}
logger.info("download page " + request.getUrl() + " error, retry the " + tried + " time!");
retry = true;
}
} while (retry);
httpResponse = getHttpClient(site).execute(requestBuilder.build());
int statusCode = httpResponse.getStatusLine().getStatusCode();
if (acceptStatCode.contains(statusCode)) {
handleGzip(httpResponse);
//charset
if (charset == null) {
String value = httpResponse.getEntity().getContentType().getValue();
@ -122,16 +123,44 @@ public class HttpClientDownloader implements Downloader {
return handleResponse(request, charset, httpResponse, task);
} else {
logger.warn("code error " + statusCode + "\t" + request.getUrl());
return null;
}
} catch (Exception e) {
} catch (IOException e) {
logger.warn("download page " + request.getUrl() + " error", e);
if (site.getCycleRetryTimes() > 0) {
return addToCycleRetry(request, site);
}
return null;
} finally {
try {
if (httpResponse != null) {
//ensure the connection is released back to pool
EntityUtils.consume(httpResponse.getEntity());
}
} catch (IOException e) {
logger.warn("close response fail", e);
}
}
return null;
}
private Page addToCycleRetry(Request request, Site site) {
Page page = new Page();
Object cycleTriedTimesObject = request.getExtra(Request.CYCLE_TRIED_TIMES);
if (cycleTriedTimesObject == null) {
page.addTargetRequest(request.setPriority(0).putExtra(Request.CYCLE_TRIED_TIMES, 1));
} else {
int cycleTriedTimes = (Integer) cycleTriedTimesObject;
cycleTriedTimes++;
if (cycleTriedTimes >= site.getCycleRetryTimes()) {
return null;
}
page.addTargetRequest(request.setPriority(0).putExtra(Request.CYCLE_TRIED_TIMES, 1));
}
return page;
}
protected Page handleResponse(Request request, String charset, HttpResponse httpResponse, Task task) throws IOException {
String content = IOUtils.toString(httpResponse.getEntity().getContent(),
charset);
String content = IOUtils.toString(httpResponse.getEntity().getContent(), charset);
Page page = new Page();
page.setHtml(new Html(UrlUtils.fixAllRelativeHrefs(content, request.getUrl())));
page.setUrl(new PlainText(request.getUrl()));
@ -141,19 +170,6 @@ public class HttpClientDownloader implements Downloader {
@Override
public void setThread(int thread) {
poolSize = thread;
}
private void handleGzip(HttpResponse httpResponse) {
Header ceheader = httpResponse.getEntity().getContentEncoding();
if (ceheader != null) {
HeaderElement[] codecs = ceheader.getElements();
for (HeaderElement codec : codecs) {
if (codec.getName().equalsIgnoreCase("gzip")) {
httpResponse.setEntity(
new GzipDecompressingEntity(httpResponse.getEntity()));
}
}
}
httpClientGenerator.setPoolSize(thread);
}
}

View File

@ -0,0 +1,106 @@
package us.codecraft.webmagic.downloader;
import org.apache.http.*;
import org.apache.http.client.CookieStore;
import org.apache.http.client.protocol.ResponseContentEncoding;
import org.apache.http.config.Registry;
import org.apache.http.config.RegistryBuilder;
import org.apache.http.config.SocketConfig;
import org.apache.http.conn.socket.ConnectionSocketFactory;
import org.apache.http.conn.socket.PlainConnectionSocketFactory;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.impl.client.*;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.impl.cookie.BasicClientCookie;
import org.apache.http.protocol.HttpContext;
import us.codecraft.webmagic.Site;
import java.io.IOException;
import java.util.Map;
/**
* @author code4crafter@gmail.com <br>
* @since 0.4.0
*/
public class HttpClientGenerator {
private PoolingHttpClientConnectionManager connectionManager;
public HttpClientGenerator() {
Registry<ConnectionSocketFactory> reg = RegistryBuilder.<ConnectionSocketFactory>create()
.register("http", PlainConnectionSocketFactory.INSTANCE)
.register("https", SSLConnectionSocketFactory.getSocketFactory())
.build();
connectionManager = new PoolingHttpClientConnectionManager(reg);
connectionManager.setDefaultMaxPerRoute(100);
}
public HttpClientGenerator setPoolSize(int poolSize){
connectionManager.setMaxTotal(poolSize);
return this;
}
public CloseableHttpClient getClient(Site site) {
return generateClient(site);
}
private CloseableHttpClient generateClient(Site site) {
HttpClientBuilder httpClientBuilder = HttpClients.custom().setConnectionManager(connectionManager);
if (site != null && site.getUserAgent() != null) {
httpClientBuilder.setUserAgent(site.getUserAgent());
} else {
httpClientBuilder.setUserAgent("");
}
if (site == null || site.isUseGzip()) {
httpClientBuilder.addInterceptorFirst(new HttpRequestInterceptor() {
public void process(
final HttpRequest request,
final HttpContext context) throws HttpException, IOException {
if (!request.containsHeader("Accept-Encoding")) {
request.addHeader("Accept-Encoding", "gzip");
}
}
});
}
SocketConfig socketConfig = SocketConfig.custom().setSoKeepAlive(true).setTcpNoDelay(true).build();
httpClientBuilder.setDefaultSocketConfig(socketConfig);
// Http client has some problem handling compressing entity for redirect
// So I disable it and do it manually
// https://issues.apache.org/jira/browse/HTTPCLIENT-1432
httpClientBuilder.disableContentCompression();
httpClientBuilder.addInterceptorFirst(new HttpResponseInterceptor() {
private ResponseContentEncoding contentEncoding = new ResponseContentEncoding();
public void process(
final HttpResponse response,
final HttpContext context) throws HttpException, IOException {
if (response.getStatusLine().getStatusCode() == 301 || response.getStatusLine().getStatusCode() == 302) {
return;
}
contentEncoding.process(response, context);
}
});
if (site != null) {
httpClientBuilder.setRetryHandler(new DefaultHttpRequestRetryHandler(site.getRetryTimes(), true));
}
generateCookie(httpClientBuilder, site);
return httpClientBuilder.build();
}
private void generateCookie(HttpClientBuilder httpClientBuilder, Site site) {
CookieStore cookieStore = new BasicCookieStore();
if (site.getCookies() != null) {
for (Map.Entry<String, String> cookieEntry : site.getCookies().entrySet()) {
BasicClientCookie cookie = new BasicClientCookie(cookieEntry.getKey(), cookieEntry.getValue());
cookie.setDomain(site.getDomain());
cookieStore.addCookie(cookie);
}
}
httpClientBuilder.setDefaultCookieStore(cookieStore);
}
}

View File

@ -1,93 +0,0 @@
package us.codecraft.webmagic.downloader;
import org.apache.http.HttpVersion;
import org.apache.http.client.CookieStore;
import org.apache.http.client.HttpClient;
import org.apache.http.client.params.ClientPNames;
import org.apache.http.client.params.CookiePolicy;
import org.apache.http.conn.scheme.PlainSocketFactory;
import org.apache.http.conn.scheme.Scheme;
import org.apache.http.conn.scheme.SchemeRegistry;
import org.apache.http.conn.ssl.SSLSocketFactory;
import org.apache.http.impl.client.BasicCookieStore;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.impl.conn.PoolingClientConnectionManager;
import org.apache.http.impl.cookie.BasicClientCookie;
import org.apache.http.params.*;
import us.codecraft.webmagic.Site;
import java.util.Map;
/**
* @author code4crafter@gmail.com <br>
* @since 0.1.0
*/
public class HttpClientPool {
public static volatile HttpClientPool INSTANCE;
public static HttpClientPool getInstance(int poolSize) {
if (INSTANCE == null) {
synchronized (HttpClientPool.class) {
if (INSTANCE == null) {
INSTANCE = new HttpClientPool(poolSize);
}
}
}
return INSTANCE;
}
private int poolSize;
private HttpClientPool(int poolSize) {
this.poolSize = poolSize;
}
public HttpClient getClient(Site site) {
return generateClient(site);
}
private HttpClient generateClient(Site site) {
HttpParams params = new BasicHttpParams();
if (site != null && site.getUserAgent() != null) {
params.setParameter(CoreProtocolPNames.USER_AGENT, site.getUserAgent());
}
params.setIntParameter(CoreConnectionPNames.SO_TIMEOUT, 1000);
params.setIntParameter(CoreConnectionPNames.CONNECTION_TIMEOUT, 2000);
HttpProtocolParamBean paramsBean = new HttpProtocolParamBean(params);
paramsBean.setVersion(HttpVersion.HTTP_1_1);
if (site != null && site.getCharset() != null) {
paramsBean.setContentCharset(site.getCharset());
}
paramsBean.setUseExpectContinue(false);
SchemeRegistry schemeRegistry = new SchemeRegistry();
schemeRegistry.register(new Scheme("http", 80, PlainSocketFactory.getSocketFactory()));
schemeRegistry.register(new Scheme("https", 443, SSLSocketFactory.getSocketFactory()));
PoolingClientConnectionManager connectionManager = new PoolingClientConnectionManager(schemeRegistry);
connectionManager.setMaxTotal(poolSize);
connectionManager.setDefaultMaxPerRoute(100);
DefaultHttpClient httpClient = new DefaultHttpClient(connectionManager, params);
if (site != null) {
generateCookie(httpClient, site);
}
httpClient.getParams().setIntParameter("http.socket.timeout", 60000);
httpClient.getParams().setParameter(ClientPNames.COOKIE_POLICY, CookiePolicy.BEST_MATCH);
return httpClient;
}
private void generateCookie(DefaultHttpClient httpClient, Site site) {
CookieStore cookieStore = new BasicCookieStore();
if (site.getCookies() != null) {
for (Map.Entry<String, String> cookieEntry : site.getCookies().entrySet()) {
BasicClientCookie cookie = new BasicClientCookie(cookieEntry.getKey(), cookieEntry.getValue());
cookie.setDomain(site.getDomain());
cookieStore.addCookie(cookie);
}
}
httpClient.setCookieStore(cookieStore);
}
}

View File

@ -0,0 +1,20 @@
package us.codecraft.webmagic.pipeline;
import java.util.List;
/**
* Pipeline that can collect and store results. <br>
* Used for {@link us.codecraft.webmagic.Spider#getAll(java.util.Collection)}
*
* @author code4crafter@gmail.com
* @since 0.4.0
*/
public interface CollectorPipeline<T> extends Pipeline {
/**
* Get all results collected.
*
* @return collected results
*/
public List<T> getCollected();
}

View File

@ -0,0 +1,26 @@
package us.codecraft.webmagic.pipeline;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import java.util.ArrayList;
import java.util.List;
/**
* @author code4crafter@gmail.com
* @since 0.4.0
*/
public class ResultItemsCollectorPipeline implements CollectorPipeline<ResultItems> {
private List<ResultItems> collector = new ArrayList<ResultItems>();
@Override
public synchronized void process(ResultItems resultItems, Task task) {
collector.add(resultItems);
}
@Override
public List<ResultItems> getCollected() {
return collector;
}
}

View File

@ -0,0 +1,51 @@
package us.codecraft.webmagic.processor.example;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import java.util.ArrayList;
import java.util.List;
/**
* @author code4crafter@gmail.com <br>
* @since 0.4.0
*/
public class BaiduBaikePageProcesser implements PageProcessor {
private Site site = Site.me()//.setHttpProxy(new HttpHost("127.0.0.1",8888))
.setRetryTimes(3).setSleepTime(1000).setUseGzip(true);
@Override
public void process(Page page) {
page.putField("name", page.getHtml().$("h1.title div.lemmaTitleH1","text").toString());
page.putField("description", page.getHtml().xpath("//div[@id='lemmaContent-0']//div[@class='para']/allText()"));
}
@Override
public Site getSite() {
return site;
}
public static void main(String[] args) {
//single download
Spider spider = Spider.create(new BaiduBaikePageProcesser()).thread(2);
String urlTemplate = "http://baike.baidu.com/search/word?word=%s&pic=1&sug=1&enc=utf8";
ResultItems resultItems = spider.<ResultItems>get(String.format(urlTemplate, "水力发电"));
System.out.println(resultItems);
//multidownload
List<String> list = new ArrayList<String>();
list.add(String.format(urlTemplate,"风力发电"));
list.add(String.format(urlTemplate,"太阳能"));
list.add(String.format(urlTemplate,"地热发电"));
list.add(String.format(urlTemplate,"地热发电"));
List<ResultItems> resultItemses = spider.<ResultItems>getAll(list);
for (ResultItems resultItemse : resultItemses) {
System.out.println(resultItemse.getAll());
}
spider.close();
}
}

View File

@ -11,7 +11,7 @@ import us.codecraft.webmagic.processor.PageProcessor;
*/
public class GithubRepoPageProcesser implements PageProcessor {
private Site site = Site.me().addStartUrl("https://github.com/code4craft").setRetryTimes(3).setSleepTime(100);
private Site site = Site.me().setRetryTimes(3).setSleepTime(100);
@Override
public void process(Page page) {
@ -31,6 +31,6 @@ public class GithubRepoPageProcesser implements PageProcessor {
}
public static void main(String[] args) {
Spider.create(new GithubRepoPageProcesser()).thread(5).run();
Spider.create(new GithubRepoPageProcesser()).addUrl("https://github.com/code4craft").thread(5).run();
}
}

View File

@ -12,7 +12,7 @@ import java.util.List;
*/
public class OschinaBlogPageProcesser implements PageProcessor {
private Site site = Site.me().setDomain("my.oschina.net").addStartUrl("http://my.oschina.net/flashsword/blog");
private Site site = Site.me().setDomain("my.oschina.net");
@Override
public void process(Page page) {
@ -34,6 +34,6 @@ public class OschinaBlogPageProcesser implements PageProcessor {
}
public static void main(String[] args) {
Spider.create(new OschinaBlogPageProcesser()).thread(2).run();
Spider.create(new OschinaBlogPageProcesser()).addUrl("http://my.oschina.net/flashsword/blog").thread(2).run();
}
}

View File

@ -4,6 +4,7 @@ import org.apache.http.annotation.ThreadSafe;
import org.apache.log4j.Logger;
import us.codecraft.webmagic.Request;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.utils.NumberUtils;
import java.util.Comparator;
import java.util.HashSet;
@ -30,14 +31,14 @@ public class PriorityScheduler implements Scheduler {
private PriorityBlockingQueue<Request> priorityQueuePlus = new PriorityBlockingQueue<Request>(INITIAL_CAPACITY, new Comparator<Request>() {
@Override
public int compare(Request o1, Request o2) {
return -(new Long(o1.getPriority()).compareTo(o2.getPriority()));
return -NumberUtils.compareLong(o1.getPriority(), o2.getPriority());
}
});
private PriorityBlockingQueue<Request> priorityQueueMinus = new PriorityBlockingQueue<Request>(INITIAL_CAPACITY, new Comparator<Request>() {
@Override
public int compare(Request o1, Request o2) {
return -(new Long(o1.getPriority()).compareTo(o2.getPriority()));
return -NumberUtils.compareLong(o1.getPriority(), o2.getPriority());
}
});

View File

@ -0,0 +1,17 @@
package us.codecraft.webmagic.utils;
/**
* @author yihua.huang@dianping.com
*/
public abstract class NumberUtils {
public static int compareLong(long o1, long o2) {
if (o1 < o2) {
return -1;
} else if (o1 == o2) {
return 0;
} else {
return 1;
}
}
}

View File

@ -1,5 +1,7 @@
package us.codecraft.webmagic.utils;
import com.google.common.util.concurrent.MoreExecutors;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.SynchronousQueue;
import java.util.concurrent.ThreadPoolExecutor;
@ -11,11 +13,15 @@ import java.util.concurrent.TimeUnit;
*/
public class ThreadUtils {
public static ExecutorService newFixedThreadPool(int threadSize) {
if (threadSize <= 1) {
throw new IllegalArgumentException("ThreadSize must be greater than 1!");
}
return new ThreadPoolExecutor(threadSize - 1, threadSize - 1, 0L, TimeUnit.MILLISECONDS,
new SynchronousQueue<Runnable>(), new ThreadPoolExecutor.CallerRunsPolicy());
}
public static ExecutorService newFixedThreadPool(int threadSize) {
if (threadSize <= 0) {
throw new IllegalArgumentException("ThreadSize must be greater than 0!");
}
if (threadSize == 1) {
return MoreExecutors.sameThreadExecutor();
}
return new ThreadPoolExecutor(threadSize - 1, threadSize - 1, 0L, TimeUnit.MILLISECONDS,
new SynchronousQueue<Runnable>(), new ThreadPoolExecutor.CallerRunsPolicy());
}
}

View File

@ -1,10 +1,14 @@
package us.codecraft.webmagic.utils;
import org.apache.commons.lang3.StringUtils;
import us.codecraft.webmagic.Request;
import java.net.MalformedURLException;
import java.net.URL;
import java.nio.charset.Charset;
import java.util.ArrayList;
import java.util.Collection;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
@ -18,7 +22,7 @@ public class UrlUtils {
/**
* canonicalizeUrl
*
* <p/>
* Borrowed from Jsoup.
*
* @param url
@ -85,6 +89,22 @@ public class UrlUtils {
return stringBuilder.toString();
}
public static List<Request> convertToRequests(Collection<String> urls) {
List<Request> requestList = new ArrayList<Request>(urls.size());
for (String url : urls) {
requestList.add(new Request(url));
}
return requestList;
}
public static List<String> convertToUrls(Collection<Request> requests) {
List<String> urlList = new ArrayList<String>(requests.size());
for (Request request : requests) {
urlList.add(request.getUrl());
}
return urlList;
}
private static final Pattern patternForCharset = Pattern.compile("charset\\s*=\\s*['\"]*([^\\s;'\"]*)");
public static String getCharset(String contentType) {

View File

@ -13,6 +13,11 @@
<appender-ref ref="stdout" />
</logger>
<logger name="org.apache" additivity="false">
<level value="warn" />
<appender-ref ref="stdout" />
</logger>
<logger name="net.sf.ehcache" additivity="false">
<level value="warn" />
<appender-ref ref="stdout" />

View File

@ -18,7 +18,7 @@ public class SpiderTest {
public void process(ResultItems resultItems, Task task) {
System.out.println(1);
}
}).thread(2);
}).thread(1);
spider.start();
Thread.sleep(10000);
spider.stop();

View File

@ -22,4 +22,5 @@ public class HttpClientDownloaderTest {
Page download = httpClientDownloader.download(new Request("http://www.diandian.com"), site.toTask());
Assert.assertTrue(download.getHtml().toString().contains("flashsword30"));
}
}

View File

@ -3,7 +3,7 @@
<parent>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-parent</artifactId>
<version>0.3.2</version>
<version>0.4.0</version>
</parent>
<modelVersion>4.0.0</modelVersion>

View File

@ -0,0 +1,49 @@
package us.codecraft.webmagic.example;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.model.OOSpider;
import us.codecraft.webmagic.model.annotation.ExtractBy;
import java.util.ArrayList;
import java.util.List;
/**
* @since 0.4.0
* @author code4crafter@gmail.com
*/
public class BaiduBaike{
@ExtractBy("//h1[@class=title]/div[@class=lemmaTitleH1]/text()")
private String name;
@ExtractBy("//div[@id='lemmaContent-0']//div[@class='para']/allText()")
private String description;
@Override
public String toString() {
return "BaiduBaike{" +
"name='" + name + '\'' +
", description='" + description + '\'' +
'}';
}
public static void main(String[] args) {
OOSpider ooSpider = OOSpider.create(Site.me().setSleepTime(0), BaiduBaike.class);
//single download
String urlTemplate = "http://baike.baidu.com/search/word?word=%s&pic=1&sug=1&enc=utf8";
BaiduBaike baike = ooSpider.<BaiduBaike>get("http://baike.baidu.com/search/word?word=httpclient&pic=1&sug=1&enc=utf8");
System.out.println(baike);
//multidownload
List<String> list = new ArrayList<String>();
list.add(String.format(urlTemplate,"风力发电"));
list.add(String.format(urlTemplate,"太阳能"));
list.add(String.format(urlTemplate,"地热发电"));
list.add(String.format(urlTemplate,"地热发电"));
List<BaiduBaike> resultItemses = ooSpider.<BaiduBaike>getAll(list);
for (BaiduBaike resultItemse : resultItemses) {
System.out.println(resultItemse);
}
ooSpider.close();
}
}

View File

@ -41,8 +41,9 @@ public class GithubRepo implements HasKey {
private String url;
public static void main(String[] args) {
OOSpider.create(Site.me().addStartUrl("https://github.com/code4craft").setSleepTime(100)
, new ConsolePageModelPipeline(), GithubRepo.class).thread(10).run();
OOSpider.create(Site.me().setSleepTime(100)
, new ConsolePageModelPipeline(), GithubRepo.class)
.addUrl("https://github.com/code4craft").thread(10).run();
}
@Override

View File

@ -31,8 +31,9 @@ public class OschinaBlog {
private Date date;
public static void main(String[] args) {
OOSpider.create(Site.me().addStartUrl("http://my.oschina.net/flashsword/blog")
, new JsonFilePageModelPipeline("/data/webmagic/"), OschinaBlog.class).run();
OOSpider.create(Site.me().setSleepTime(0)
, new JsonFilePageModelPipeline("/data/webmagic/"), OschinaBlog.class)
.addUrl("http://my.oschina.net/flashsword/blog").run();
}
public String getTitle() {

View File

@ -2,6 +2,7 @@ package us.codecraft.webmagic.model;
import org.apache.commons.lang3.builder.ToStringBuilder;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.PageModelPipeline;
/**
* Print page model in console.<br>

View File

@ -3,6 +3,7 @@ package us.codecraft.webmagic.model;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.model.annotation.ExtractBy;
import us.codecraft.webmagic.pipeline.PageModelPipeline;
import us.codecraft.webmagic.pipeline.Pipeline;
import java.lang.annotation.Annotation;

View File

@ -2,8 +2,13 @@ package us.codecraft.webmagic.model;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.pipeline.CollectorPipeline;
import us.codecraft.webmagic.pipeline.PageModelPipeline;
import us.codecraft.webmagic.processor.PageProcessor;
import java.util.ArrayList;
import java.util.List;
/**
* The spider for page model extractor.<br>
* In webmagic, we call a POJO containing extract result as "page model". <br>
@ -22,22 +27,27 @@ import us.codecraft.webmagic.processor.PageProcessor;
* {@literal @}ExtractBy(value = "//div[@class='BlogTags']/a/text()", multi = true)
* private List<String> tags;
* }
</pre>
* </pre>
* And start the spider by:
* <pre>
* OOSpider.create(Site.me().addStartUrl("http://my.oschina.net/flashsword/blog")
* ,new JsonFilePageModelPipeline(), OschinaBlog.class).run();
* }
</pre>
* </pre>
*
* @author code4crafter@gmail.com <br>
* @since 0.2.0
*/
public class OOSpider extends Spider {
public class OOSpider<T> extends Spider {
private ModelPageProcessor modelPageProcessor;
private ModelPipeline modelPipeline;
private PageModelPipeline pageModelPipeline;
private List<Class> pageModelClasses = new ArrayList<Class>();
protected OOSpider(ModelPageProcessor modelPageProcessor) {
super(modelPageProcessor);
this.modelPageProcessor = modelPageProcessor;
@ -49,6 +59,7 @@ public class OOSpider extends Spider {
/**
* create a spider
*
* @param site
* @param pageModelPipeline
* @param pageModels
@ -57,13 +68,19 @@ public class OOSpider extends Spider {
this(ModelPageProcessor.create(site, pageModels));
this.modelPipeline = new ModelPipeline();
super.addPipeline(modelPipeline);
if (pageModelPipeline!=null){
for (Class pageModel : pageModels) {
for (Class pageModel : pageModels) {
if (pageModelPipeline != null) {
this.modelPipeline.put(pageModel, pageModelPipeline);
}
pageModelClasses.add(pageModel);
}
}
@Override
protected CollectorPipeline getCollectorPipeline() {
return new PageModelCollectorPipeline<T>(pageModelClasses.get(0));
}
public static OOSpider create(Site site, Class... pageModels) {
return new OOSpider(site, null, pageModels);
}

View File

@ -0,0 +1,46 @@
package us.codecraft.webmagic.model;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.model.annotation.ExtractBy;
import us.codecraft.webmagic.pipeline.CollectorPageModelPipeline;
import us.codecraft.webmagic.pipeline.CollectorPipeline;
import java.lang.annotation.Annotation;
import java.util.List;
/**
* @author code4crafter@gmail.com
* @since 0.4.0
*/
class PageModelCollectorPipeline<T> implements CollectorPipeline<T> {
private final CollectorPageModelPipeline<T> classPipeline = new CollectorPageModelPipeline<T>();
private final Class<?> clazz;
PageModelCollectorPipeline(Class<?> clazz) {
this.clazz = clazz;
}
@Override
public List<T> getCollected() {
return classPipeline.getCollected();
}
@Override
public synchronized void process(ResultItems resultItems, Task task) {
Object o = resultItems.get(clazz.getCanonicalName());
if (o != null) {
Annotation annotation = clazz.getAnnotation(ExtractBy.class);
if (annotation == null || !((ExtractBy) annotation).multi()) {
classPipeline.process((T) o, task);
} else {
List<Object> list = (List<Object>) o;
for (Object o1 : list) {
classPipeline.process((T) o1, task);
}
}
}
}
}

View File

@ -195,7 +195,7 @@ class PageModelExtractor {
private void initClassExtractors() {
Annotation annotation = clazz.getAnnotation(TargetUrl.class);
if (annotation == null) {
targetUrlPatterns.add(Pattern.compile(".*"));
targetUrlPatterns.add(Pattern.compile("(.*)"));
} else {
TargetUrl targetUrl = (TargetUrl) annotation;
String[] value = targetUrl.value();

View File

@ -0,0 +1,23 @@
package us.codecraft.webmagic.pipeline;
import us.codecraft.webmagic.Task;
import java.util.ArrayList;
import java.util.List;
/**
* @author code4crafter@gmail.com
*/
public class CollectorPageModelPipeline<T> implements PageModelPipeline<T> {
private List<T> collected = new ArrayList<T>();
@Override
public synchronized void process(T t, Task task) {
collected.add(t);
}
public List<T> getCollected() {
return collected;
}
}

View File

@ -5,7 +5,6 @@ import org.apache.commons.lang3.builder.ToStringBuilder;
import org.apache.log4j.Logger;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.model.HasKey;
import us.codecraft.webmagic.model.PageModelPipeline;
import us.codecraft.webmagic.utils.FilePersistentBase;
import java.io.FileWriter;

View File

@ -6,7 +6,6 @@ import org.apache.commons.lang3.builder.ToStringBuilder;
import org.apache.log4j.Logger;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.model.HasKey;
import us.codecraft.webmagic.model.PageModelPipeline;
import us.codecraft.webmagic.utils.FilePersistentBase;
import java.io.FileWriter;

View File

@ -1,4 +1,4 @@
package us.codecraft.webmagic.model;
package us.codecraft.webmagic.pipeline;
import us.codecraft.webmagic.Task;

View File

@ -1,7 +1,7 @@
package us.codecraft.webmagic;
import junit.framework.Assert;
import us.codecraft.webmagic.model.PageModelPipeline;
import us.codecraft.webmagic.pipeline.PageModelPipeline;
/**
* @author code4crafter@gmail.com

View File

@ -6,6 +6,7 @@ import us.codecraft.webmagic.MockDownloader;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.example.GithubRepo;
import us.codecraft.webmagic.pipeline.PageModelPipeline;
/**
* @author code4crafter@gmail.com <br>

View File

@ -5,7 +5,7 @@
<parent>
<artifactId>webmagic-parent</artifactId>
<groupId>us.codecraft</groupId>
<version>0.2.1</version>
<version>0.4.0-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>

View File

@ -5,7 +5,7 @@
<parent>
<artifactId>webmagic-parent</artifactId>
<groupId>us.codecraft</groupId>
<version>0.3.2</version>
<version>0.4.0-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>

View File

@ -1,8 +1,9 @@
package us.codecraft.webmagic.model.samples;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.model.ConsolePageModelPipeline;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.model.OOSpider;
import us.codecraft.webmagic.pipeline.PageModelPipeline;
import us.codecraft.webmagic.model.annotation.ExtractBy;
import us.codecraft.webmagic.model.annotation.ExtractByUrl;
import us.codecraft.webmagic.model.annotation.HelpUrl;
@ -18,14 +19,31 @@ public class Kr36NewsModel {
@ExtractBy("//h1[@class='entry-title sep10']")
private String title;
@ExtractBy("//div[@class='mainContent sep-10']")
@ExtractBy("//div[@class='mainContent sep-10']/tidyText()")
private String content;
@ExtractByUrl
private String url;
public static void main(String[] args) {
OOSpider.create(Site.me().addStartUrl("http://www.36kr.com/"), new ConsolePageModelPipeline(),
Kr36NewsModel.class).run();
//Just for benchmark
OOSpider.create(Site.me().addStartUrl("http://www.36kr.com/").setSleepTime(0), new PageModelPipeline() {
@Override
public void process(Object o, Task task) {
}
},Kr36NewsModel.class).thread(20).run();
}
public String getTitle() {
return title;
}
public String getContent() {
return content;
}
public String getUrl() {
return url;
}
}

View File

@ -1,10 +1,11 @@
package us.codecraft.webmagic.model.samples;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.model.OOSpider;
import us.codecraft.webmagic.pipeline.PageModelPipeline;
import us.codecraft.webmagic.model.annotation.ExtractBy;
import us.codecraft.webmagic.model.annotation.TargetUrl;
import us.codecraft.webmagic.pipeline.JsonFilePageModelPipeline;
import java.util.List;
@ -24,8 +25,16 @@ public class OschinaBlog{
private List<String> tags;
public static void main(String[] args) {
OOSpider.create(Site.me().addStartUrl("http://my.oschina.net/flashsword/blog")
,new JsonFilePageModelPipeline(), OschinaBlog.class).run();
OOSpider.create(Site.me()
.setUserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36").addStartUrl("http://my.oschina.net/flashsword/blog")
.setSleepTime(0)
.setRetryTimes(3)
,new PageModelPipeline() {
@Override
public void process(Object o, Task task) {
}
}, OschinaBlog.class).thread(10).run();
}
public String getTitle() {

View File

@ -5,7 +5,7 @@
<parent>
<artifactId>webmagic-parent</artifactId>
<groupId>us.codecraft</groupId>
<version>0.3.2</version>
<version>0.4.0-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>

View File

@ -5,7 +5,7 @@
<parent>
<artifactId>webmagic-parent</artifactId>
<groupId>us.codecraft</groupId>
<version>0.3.2</version>
<version>0.4.0-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>

View File

@ -34,12 +34,12 @@ webmagic使用maven管理依赖在项目中添加对应的依赖即可使用w
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.3.2</version>
<version>0.4.0</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.3.2</version>
<version>0.4.0</version>
</dependency>
#### 项目结构

View File

@ -4,7 +4,7 @@
<date-generated>Sat Aug 17 14:14:45 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.downloader.HttpClientPool]]></key>
<key><![CDATA[us.codecraft.webmagic.downloader.HttpClientGenerator]]></key>
<data><![CDATA[ @author code4crafter@gmail.com <br>
Date: 13-4-21
Time: 下午12:29

View File

@ -12,7 +12,7 @@
]]></data>
</comment>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.OOSpider(us.codecraft.webmagic.Site, us.codecraft.webmagic.model.PageModelPipeline, java.lang.Class...)]]></key>
<key><![CDATA[us.codecraft.webmagic.model.OOSpider(us.codecraft.webmagic.Site, us.codecraft.webmagic.pipeline.PageModelPipeline, java.lang.Class...)]]></key>
<data><![CDATA[ 创建一个爬虫。<br>
@param site
@param pageModelPipeline

View File

@ -4,7 +4,7 @@
<date-generated>Sat Aug 17 14:14:46 CST 2013</date-generated>
</meta>
<comment>
<key><![CDATA[us.codecraft.webmagic.model.PageModelPipeline]]></key>
<key><![CDATA[us.codecraft.webmagic.pipeline.PageModelPipeline]]></key>
<data><![CDATA[ @author code4crafter@gmail.com <br>
Date: 13-8-3 <br>
Time: 上午9:34 <br>