Merge branch 'master' of git.oschina.net:flashsword20/webmagic into osc

Conflicts:
	pom.xml
master
yihua.huang 2013-12-21 07:59:28 +08:00
commit 2a8e1b654d
36 changed files with 534 additions and 148 deletions

126
README.md
View File

@ -1,25 +1,41 @@
![logo](https://raw.github.com/code4craft/webmagic/master/asserts/logo.jpg)
[Readme in Chinese](https://github.com/code4craft/webmagic/tree/master/zh_docs)
[User Manual (Chinese)](https://github.com/code4craft/webmagic/blob/master/user-manual.md)
[![Build Status](https://travis-ci.org/code4craft/webmagic.png?branch=master)](https://travis-ci.org/code4craft/webmagic)
>A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler.
## Features:
[Readme in English](https://github.com/code4craft/webmagic/tree/master/en_docs)
* Simple core with high flexibility.
* Simple API for html extracting.
* Annotation with POJO to customize a crawler, no configuration.
* Multi-thread and Distribution support.
* Easy to be integrated.
[用户手册](https://github.com/code4craft/webmagic/blob/master/user-manual.md)
## Install:
Add dependencies to your pom.xml:
>webmagic是一个开源的Java垂直爬虫框架目标是简化爬虫的开发流程让开发者专注于逻辑功能的开发。webmagic的核心非常简单但是覆盖爬虫的整个流程也是很好的学习爬虫开发的材料。作者曾经在前公司进行过一年的垂直爬虫的开发webmagic就是为了解决爬虫开发的一些重复劳动而产生的框架。
>web爬虫是一种技术webmagic致力于将这种技术的实现成本降低但是出于对资源提供者的尊重webmagic不会做反封锁的事情包括验证码破解、代理切换、自动登录等。
webmagic的主要特色
* 完全模块化的设计,强大的可扩展性。
* 核心简单但是涵盖爬虫的全部流程,灵活而强大,也是学习爬虫入门的好材料。
* 提供丰富的抽取页面API。
* 无配置但是可通过POJO+注解形式实现一个爬虫。
* 支持多线程。
* 支持分布式。
* 支持爬取js动态渲染的页面。
* 无框架依赖,可以灵活的嵌入到项目中去。
webmagic的架构和设计参考了以下两个项目感谢以下两个项目的作者
python爬虫 **scrapy** [https://github.com/scrapy/scrapy](https://github.com/scrapy/scrapy)
Java爬虫 **Spiderman** [https://gitcafe.com/laiweiwei/Spiderman](https://gitcafe.com/laiweiwei/Spiderman)
webmagic的github地址[https://github.com/code4craft/webmagic](https://github.com/code4craft/webmagic)。
## 快速开始
### 使用maven
webmagic使用maven管理依赖在项目中添加对应的依赖即可使用webmagic
<dependency>
<groupId>us.codecraft</groupId>
@ -32,13 +48,40 @@ Add dependencies to your pom.xml:
<version>0.4.2</version>
</dependency>
## Get Started:
#### 项目结构
webmagic主要包括两个包
### First crawler:
* **webmagic-core**
webmagic核心部分只包含爬虫基本模块和基本抽取器。webmagic-core的目标是成为网页爬虫的一个教科书般的实现。
* **webmagic-extension**
webmagic的扩展模块提供一些更方便的编写爬虫的工具。包括注解格式定义爬虫、JSON、分布式等支持。
webmagic还包含两个可用的扩展包因为这两个包都依赖了比较重量级的工具所以从主要包中抽离出来这些包需要下载源码后自己编译
Write a class implements PageProcessor
* **webmagic-saxon**
webmagic与Saxon结合的模块。Saxon是一个XPath、XSLT的解析工具webmagic依赖Saxon来进行XPath2.0语法解析支持。
* **webmagic-selenium**
webmagic与Selenium结合的模块。Selenium是一个模拟浏览器进行页面渲染的工具webmagic依赖Selenium进行动态页面的抓取。
在项目中,你可以根据需要依赖不同的包。
### 不使用maven
在项目的**lib**目录下有依赖的所有jar包直接在IDE里import即可。
### 第一个爬虫
#### 定制PageProcessor
PageProcessor是webmagic-core的一部分定制一个PageProcessor即可实现自己的爬虫逻辑。以下是抓取osc博客的一段代码
```java
public class OschinaBlogPageProcesser implements PageProcessor {
private Site site = Site.me().setDomain("my.oschina.net")
@ -64,15 +107,17 @@ Write a class implements PageProcessor
.pipeline(new ConsolePipeline()).run();
}
}
```
* `page.addTargetRequests(links)`
Add urls for crawling.
You can also use annotation way:
这里通过page.addTargetRequests()方法来增加要抓取的URL并通过page.putField()来保存抽取结果。page.getHtml().xpath()则是按照某个规则对结果进行抽取这里抽取支持链式调用。调用结束后toString()表示转化为单个Stringall()则转化为一个String列表。
Spider是爬虫的入口类。Pipeline是结果输出和持久化的接口这里ConsolePipeline表示结果输出到控制台。
执行这个main方法即可在控制台看到抓取结果。webmagic默认有3秒抓取间隔请耐心等待。
#### 使用注解
webmagic-extension包括了注解方式编写爬虫的方法只需基于一个POJO增加注解即可完成一个爬虫。以下仍然是抓取oschina博客的一段代码功能与OschinaBlogPageProcesser完全相同
```java
@TargetUrl("http://my.oschina.net/flashsword/blog/\\d+")
public class OschinaBlog {
@ -91,38 +136,23 @@ You can also use annotation way:
new ConsolePageModelPipeline(), OschinaBlog.class).run();
}
}
```
### Docs and samples:
The architecture of webmagic (refered to [Scrapy](http://scrapy.org/))
这个例子定义了一个Model类Model类的字段'title'、'content'、'tags'均为要抽取的属性。这个类在Pipeline里是可以复用的。
![image](http://code4craft.github.io/images/posts/webmagic.png)
### 详细文档
Javadocs: [http://code4craft.github.io/webmagic/docs/en/](http://code4craft.github.io/webmagic/docs/en/)
见[webmagic manual.md](https://github.com/code4craft/webmagic/blob/master/user-manual.md)。
There are some samples in `webmagic-samples` package.
### 示例
webmagic-samples目录里有一些定制PageProcessor以抽取不同站点的例子。
### Lisence:
作者还有一个使用webmagic进行抽取并持久化到数据库的项目[JobHunter](http://git.oschina.net/flashsword20/jobhunter)。这个项目整合了Spring自定义了Pipeline使用mybatis进行数据持久化。
Lisenced under [Apache 2.0 lisence](http://opensource.org/licenses/Apache-2.0)
### 协议
### Thanks:
webmagic遵循[Apache 2.0协议](http://opensource.org/licenses/Apache-2.0)
To write webmagic, I refered to the projects below :
* **Scrapy**
A crawler framework in Python.
[http://scrapy.org/](http://scrapy.org/)
* **Spiderman**
Another crawler framework in Java.
[https://gitcafe.com/laiweiwei/Spiderman](https://gitcafe.com/laiweiwei/Spiderman)
### Mail-list:

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

BIN
lib/guava-15.0.jar 100644

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

BIN
lib/jdom2-2.0.4.jar 100644

Binary file not shown.

BIN
lib/jedis-2.0.0.jar 100644

Binary file not shown.

Binary file not shown.

Binary file not shown.

BIN
lib/jsoup-1.7.2.jar 100644

Binary file not shown.

BIN
lib/junit-4.7.jar 100644

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

BIN
lib/xsoup-0.1.0.jar 100644

Binary file not shown.

3
make.sh 100644
View File

@ -0,0 +1,3 @@
#!/bin/sh
mvn clean package
rsync -avz --delete ./webmagic-samples/target/lib/ ./lib/

View File

@ -6,7 +6,7 @@
<version>7</version>
</parent>
<groupId>us.codecraft</groupId>
<version>0.4.3-SNAPSHOT</version>
<version>0.4.2</version>
<modelVersion>4.0.0</modelVersion>
<packaging>pom</packaging>
<properties>

View File

@ -0,0 +1,156 @@
<?xml version="1.0" encoding="UTF-8"?>
<project name="module_webmagic-core" default="compile.module.webmagic-core">
<dirname property="module.webmagic-core.basedir" file="${ant.file.module_webmagic-core}"/>
<property name="module.jdk.home.webmagic-core" value="${project.jdk.home}"/>
<property name="module.jdk.bin.webmagic-core" value="${project.jdk.bin}"/>
<property name="module.jdk.classpath.webmagic-core" value="${project.jdk.classpath}"/>
<property name="compiler.args.webmagic-core" value="${compiler.args}"/>
<property name="webmagic-core.output.dir" value="${module.webmagic-core.basedir}/target/classes"/>
<property name="webmagic-core.testoutput.dir" value="${module.webmagic-core.basedir}/target/test-classes"/>
<path id="webmagic-core.module.bootclasspath">
<!-- Paths to be included in compilation bootclasspath -->
</path>
<path id="webmagic-core.module.production.classpath">
<path refid="${module.jdk.classpath.webmagic-core}"/>
<path refid="library.maven:_org.apache.httpcomponents:httpclient:4.2.4.classpath"/>
<path refid="library.maven:_org.apache.httpcomponents:httpcore:4.2.4.classpath"/>
<path refid="library.maven:_commons-logging:commons-logging:1.1.1.classpath"/>
<path refid="library.maven:_commons-codec:commons-codec:1.6.classpath"/>
<path refid="library.maven:_com.google.guava:guava:13.0.1.classpath"/>
<path refid="library.maven:_org.apache.commons:commons-lang3:3.1.classpath"/>
<path refid="library.maven:_log4j:log4j:1.2.17.classpath"/>
<path refid="library.maven:_commons-collections:commons-collections:3.2.1.classpath"/>
<path refid="library.maven:_net.sourceforge.htmlcleaner:htmlcleaner:2.4.classpath"/>
<path refid="library.maven:_org.jdom:jdom2:2.0.4.classpath"/>
<path refid="library.maven:_commons-io:commons-io:1.3.2.classpath"/>
</path>
<path id="webmagic-core.runtime.production.module.classpath">
<pathelement location="${webmagic-core.output.dir}"/>
<path refid="library.maven:_org.apache.httpcomponents:httpclient:4.2.4.classpath"/>
<path refid="library.maven:_org.apache.httpcomponents:httpcore:4.2.4.classpath"/>
<path refid="library.maven:_commons-logging:commons-logging:1.1.1.classpath"/>
<path refid="library.maven:_commons-codec:commons-codec:1.6.classpath"/>
<path refid="library.maven:_com.google.guava:guava:13.0.1.classpath"/>
<path refid="library.maven:_org.apache.commons:commons-lang3:3.1.classpath"/>
<path refid="library.maven:_log4j:log4j:1.2.17.classpath"/>
<path refid="library.maven:_commons-collections:commons-collections:3.2.1.classpath"/>
<path refid="library.maven:_net.sourceforge.htmlcleaner:htmlcleaner:2.4.classpath"/>
<path refid="library.maven:_org.jdom:jdom2:2.0.4.classpath"/>
<path refid="library.maven:_commons-io:commons-io:1.3.2.classpath"/>
</path>
<path id="webmagic-core.module.classpath">
<path refid="${module.jdk.classpath.webmagic-core}"/>
<pathelement location="${webmagic-core.output.dir}"/>
<path refid="library.maven:_org.apache.httpcomponents:httpclient:4.2.4.classpath"/>
<path refid="library.maven:_org.apache.httpcomponents:httpcore:4.2.4.classpath"/>
<path refid="library.maven:_commons-logging:commons-logging:1.1.1.classpath"/>
<path refid="library.maven:_commons-codec:commons-codec:1.6.classpath"/>
<path refid="library.maven:_junit:junit:4.7.classpath"/>
<path refid="library.maven:_com.google.guava:guava:13.0.1.classpath"/>
<path refid="library.maven:_org.apache.commons:commons-lang3:3.1.classpath"/>
<path refid="library.maven:_log4j:log4j:1.2.17.classpath"/>
<path refid="library.maven:_commons-collections:commons-collections:3.2.1.classpath"/>
<path refid="library.maven:_net.sourceforge.htmlcleaner:htmlcleaner:2.4.classpath"/>
<path refid="library.maven:_org.jdom:jdom2:2.0.4.classpath"/>
<path refid="library.maven:_commons-io:commons-io:1.3.2.classpath"/>
</path>
<path id="webmagic-core.runtime.module.classpath">
<pathelement location="${webmagic-core.testoutput.dir}"/>
<pathelement location="${webmagic-core.output.dir}"/>
<path refid="library.maven:_org.apache.httpcomponents:httpclient:4.2.4.classpath"/>
<path refid="library.maven:_org.apache.httpcomponents:httpcore:4.2.4.classpath"/>
<path refid="library.maven:_commons-logging:commons-logging:1.1.1.classpath"/>
<path refid="library.maven:_commons-codec:commons-codec:1.6.classpath"/>
<path refid="library.maven:_junit:junit:4.7.classpath"/>
<path refid="library.maven:_com.google.guava:guava:13.0.1.classpath"/>
<path refid="library.maven:_org.apache.commons:commons-lang3:3.1.classpath"/>
<path refid="library.maven:_log4j:log4j:1.2.17.classpath"/>
<path refid="library.maven:_commons-collections:commons-collections:3.2.1.classpath"/>
<path refid="library.maven:_net.sourceforge.htmlcleaner:htmlcleaner:2.4.classpath"/>
<path refid="library.maven:_org.jdom:jdom2:2.0.4.classpath"/>
<path refid="library.maven:_commons-io:commons-io:1.3.2.classpath"/>
</path>
<patternset id="excluded.from.module.webmagic-core">
<patternset refid="ignored.files"/>
</patternset>
<patternset id="excluded.from.compilation.webmagic-core">
<patternset refid="excluded.from.module.webmagic-core"/>
</patternset>
<path id="webmagic-core.module.sourcepath">
<dirset dir="${module.webmagic-core.basedir}">
<include name="src/main/java"/>
<include name="src/main/resources"/>
</dirset>
</path>
<path id="webmagic-core.module.test.sourcepath">
<dirset dir="${module.webmagic-core.basedir}">
<include name="src/test/java"/>
<include name="src/test/resources"/>
</dirset>
</path>
<target name="compile.module.webmagic-core" depends="compile.module.webmagic-core.production,compile.module.webmagic-core.tests" description="Compile module webmagic-core"/>
<target name="compile.module.webmagic-core.production" depends="register.custom.compilers" description="Compile module webmagic-core; production classes">
<mkdir dir="${webmagic-core.output.dir}"/>
<javac2 destdir="${webmagic-core.output.dir}" debug="${compiler.debug}" nowarn="${compiler.generate.no.warnings}" memorymaximumsize="${compiler.max.memory}" fork="true" executable="${module.jdk.bin.webmagic-core}/javac">
<compilerarg line="${compiler.args.webmagic-core}"/>
<bootclasspath refid="webmagic-core.module.bootclasspath"/>
<classpath refid="webmagic-core.module.production.classpath"/>
<src refid="webmagic-core.module.sourcepath"/>
<patternset refid="excluded.from.compilation.webmagic-core"/>
</javac2>
<copy todir="${webmagic-core.output.dir}">
<fileset dir="${module.webmagic-core.basedir}/src/main/java">
<patternset refid="compiler.resources"/>
<type type="file"/>
</fileset>
<fileset dir="${module.webmagic-core.basedir}/src/main/resources">
<patternset refid="compiler.resources"/>
<type type="file"/>
</fileset>
</copy>
</target>
<target name="compile.module.webmagic-core.tests" depends="register.custom.compilers,compile.module.webmagic-core.production" description="compile module webmagic-core; test classes" unless="skip.tests">
<mkdir dir="${webmagic-core.testoutput.dir}"/>
<javac2 destdir="${webmagic-core.testoutput.dir}" debug="${compiler.debug}" nowarn="${compiler.generate.no.warnings}" memorymaximumsize="${compiler.max.memory}" fork="true" executable="${module.jdk.bin.webmagic-core}/javac">
<compilerarg line="${compiler.args.webmagic-core}"/>
<bootclasspath refid="webmagic-core.module.bootclasspath"/>
<classpath refid="webmagic-core.module.classpath"/>
<src refid="webmagic-core.module.test.sourcepath"/>
<patternset refid="excluded.from.compilation.webmagic-core"/>
</javac2>
<copy todir="${webmagic-core.testoutput.dir}">
<fileset dir="${module.webmagic-core.basedir}/src/test/java">
<patternset refid="compiler.resources"/>
<type type="file"/>
</fileset>
<fileset dir="${module.webmagic-core.basedir}/src/test/resources">
<patternset refid="compiler.resources"/>
<type type="file"/>
</fileset>
</copy>
</target>
<target name="clean.module.webmagic-core" description="cleanup module">
<delete dir="${webmagic-core.output.dir}"/>
<delete dir="${webmagic-core.testoutput.dir}"/>
</target>
</project>

View File

@ -3,7 +3,7 @@
<parent>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-parent</artifactId>
<version>0.4.3-SNAPSHOT</version>
<version>0.4.2</version>
</parent>
<modelVersion>4.0.0</modelVersion>
@ -63,4 +63,4 @@
</dependencies>
</project>
</project>

View File

@ -1,93 +1,102 @@
package us.codecraft.webmagic.selector;
import org.apache.commons.lang3.StringUtils;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;
/**
* Selector in regex.<br>
*
* @author code4crafter@gmail.com <br>
* @since 0.1.0
*/
public class RegexSelector implements Selector {
private String regexStr;
private Pattern regex;
private int group = 1;
public RegexSelector(String regexStr, int group) {
if (StringUtils.isBlank(regexStr)) {
throw new IllegalArgumentException("regex must not be empty");
}
if (!StringUtils.contains(regexStr, "(") && !StringUtils.contains(regexStr, ")")) {
regexStr = "(" + regexStr + ")";
}
if (!StringUtils.contains(regexStr, "(") || !StringUtils.contains(regexStr, ")")) {
throw new IllegalArgumentException("regex must have capture group 1");
}
this.regexStr = regexStr;
try {
regex = Pattern.compile(regexStr, Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
} catch (PatternSyntaxException e) {
throw new IllegalArgumentException("invalid regex", e);
}
this.group = group;
}
public RegexSelector(String regexStr) {
this(regexStr, 1);
}
@Override
public String select(String text) {
return selectGroup(text).get(group);
}
@Override
public List<String> selectList(String text) {
List<String> strings = new ArrayList<String>();
List<RegexResult> results = selectGroupList(text);
for (RegexResult result : results) {
strings.add(result.get(group));
}
return strings;
}
public RegexResult selectGroup(String text) {
Matcher matcher = regex.matcher(text);
if (matcher.find()) {
String[] groups = new String[matcher.groupCount() + 1];
for (int i = 0; i < groups.length; i++) {
groups[i] = matcher.group(i);
}
return new RegexResult(groups);
}
return RegexResult.EMPTY_RESULT;
}
public List<RegexResult> selectGroupList(String text) {
Matcher matcher = regex.matcher(text);
List<RegexResult> resultList = new ArrayList<RegexResult>();
while (matcher.find()) {
String[] groups = new String[matcher.groupCount() + 1];
for (int i = 0; i < groups.length; i++) {
groups[i] = matcher.group(i);
}
resultList.add(new RegexResult(groups));
}
return resultList;
}
@Override
public String toString() {
return regexStr;
}
}
package us.codecraft.webmagic.selector;
import org.apache.commons.lang3.StringUtils;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;
/**
* Selector in regex.<br>
*
* @author code4crafter@gmail.com <br>
* @since 0.1.0
*/
public class RegexSelector implements Selector {
private String regexStr;
private Pattern regex;
private int group = 1;
public RegexSelector(String regexStr, int group) {
if (StringUtils.isBlank(regexStr)) {
throw new IllegalArgumentException("regex must not be empty");
}
/* Can't detect '\(', '(?:)' so that would result in ArrayIndexOutOfBoundsException
if (!StringUtils.contains(regexStr, "(") && !StringUtils.contains(regexStr, ")")) {
regexStr = "(" + regexStr + ")";
}
if (!StringUtils.contains(regexStr, "(") || !StringUtils.contains(regexStr, ")")) {
throw new IllegalArgumentException("regex must have capture group 1");
}
*/
// Try to fix: Only check if there exists the valid left parenthesis, leave regexp validation for Pattern
if (StringUtils.countMatches(regexStr, "(") - StringUtils.countMatches(regexStr, "\\\\\\(") ==
StringUtils.countMatches(regexStr, "(?:") - StringUtils.countMatches(regexStr, "\\\\\\(?:")) {
regexStr = "(" + regexStr + ")";
}
this.regexStr = regexStr;
try {
regex = Pattern.compile(regexStr, Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
} catch (PatternSyntaxException e) {
throw new IllegalArgumentException("invalid regex", e);
}
this.group = group;
}
public RegexSelector(String regexStr) {
this(regexStr, 1);
}
@Override
public String select(String text) {
return selectGroup(text).get(group);
}
@Override
public List<String> selectList(String text) {
List<String> strings = new ArrayList<String>();
List<RegexResult> results = selectGroupList(text);
for (RegexResult result : results) {
strings.add(result.get(group));
}
return strings;
}
public RegexResult selectGroup(String text) {
Matcher matcher = regex.matcher(text);
if (matcher.find()) {
String[] groups = new String[matcher.groupCount() + 1];
for (int i = 0; i < groups.length; i++) {
groups[i] = matcher.group(i);
}
return new RegexResult(groups);
}
return RegexResult.EMPTY_RESULT;
}
public List<RegexResult> selectGroupList(String text) {
Matcher matcher = regex.matcher(text);
List<RegexResult> resultList = new ArrayList<RegexResult>();
while (matcher.find()) {
String[] groups = new String[matcher.groupCount() + 1];
for (int i = 0; i < groups.length; i++) {
groups[i] = matcher.group(i);
}
resultList.add(new RegexResult(groups));
}
return resultList;
}
@Override
public String toString() {
return regexStr;
}
}

View File

@ -3,7 +3,7 @@
<parent>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-parent</artifactId>
<version>0.4.3-SNAPSHOT</version>
<version>0.4.2</version>
</parent>
<modelVersion>4.0.0</modelVersion>
@ -35,4 +35,4 @@
</dependency>
</dependencies>
</project>
</project>

View File

@ -0,0 +1,37 @@
package us.codecraft.webmagic.model.samples;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.model.AfterExtractor;
import us.codecraft.webmagic.model.OOSpider;
import us.codecraft.webmagic.model.annotation.ExtractBy;
import us.codecraft.webmagic.model.annotation.TargetUrl;
import java.util.List;
/**
* @author yihua.huang@dianping.com <br>
* Date: 13-8-13 <br>
* Time: 10:13 <br>
*/
@TargetUrl("http://*.alpha.dp/*")
public class DianpingFtlDataScanner implements AfterExtractor {
@ExtractBy(value = "(DP\\.data\\(\\{.*\\}\\));", type = ExtractBy.Type.Regex, notNull = true, multi = true)
private List<String> data;
public static void main(String[] args) {
OOSpider.create(Site.me().addStartUrl("http://w.alpha.dp/").setSleepTime(0), DianpingFtlDataScanner.class)
.thread(5).run();
}
@Override
public void afterProcess(Page page) {
if (data.size() > 1) {
System.err.println(page.getUrl());
}
if (data.size() > 0 && data.get(0).length() > 100) {
System.err.println(page.getUrl());
}
}
}

View File

@ -0,0 +1,46 @@
package us.codecraft.webmagic.samples;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.selector.PlainText;
import java.util.List;
/**
* @author code4crafter@gmail.com <br>
* Date: 13-4-21
* Time: 8:08
*/
public class DiaoyuwengProcessor implements PageProcessor {
private Site site;
@Override
public void process(Page page) {
List<String> requests = page.getHtml().links().regex("(http://www\\.diaoyuweng\\.com/home\\.php\\?mod=space&uid=88304&do=thread&view=me&type=thread&order=dateline&from=space&page=\\d+)").all();
page.addTargetRequests(requests);
requests = page.getHtml().links().regex("(http://www\\.diaoyuweng\\.com/thread-\\d+-1-1.html)").all();
page.addTargetRequests(requests);
if (page.getUrl().toString().contains("thread")){
page.putField("title", page.getHtml().xpath("//a[@id='thread_subject']"));
page.putField("content", page.getHtml().xpath("//div[@class='pcb']//tbody/tidyText()"));
page.putField("date",page.getHtml().regex("发表于 (\\d{4}-\\d+-\\d+ \\d+:\\d+:\\d+)"));
page.putField("id",new PlainText("1000"+page.getUrl().regex("http://www\\.diaoyuweng\\.com/thread-(\\d+)-1-1.html").toString()));
}
}
@Override
public Site getSite() {
if (site==null){
site= Site.me().setDomain("www.diaoyuweng.com").addStartUrl("http://www.diaoyuweng.com/home.php?mod=space&uid=88304&do=thread&view=me&type=thread&from=space").
setUserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31").setCharset("GBK").setSleepTime(500);
}
return site;
}
public static void main(String[] args) {
Spider.create(new DiaoyuwengProcessor()).run();
}
}

View File

@ -0,0 +1,34 @@
package us.codecraft.webmagic.samples;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.scheduler.RedisScheduler;
import java.util.List;
/**
* @author code4crafter@gmail.com <br>
* Date: 13-4-21
* Time: 1:48
*/
public class F58PageProcesser implements PageProcessor {
@Override
public void process(Page page) {
List<String> strings = page.getHtml().links().regex(".*/yewu/.*").all();
page.addTargetRequests(strings);
page.putField("title",page.getHtml().regex("<title>(.*)</title>"));
page.putField("body",page.getHtml().xpath("//dd"));
}
@Override
public Site getSite() {
return Site.me().setDomain("sh.58.com").addStartUrl("http://sh1.51a8.com/").setCycleRetryTimes(2); //To change body of implemented methods use File | Settings | File Templates.
}
public static void main(String[] args) {
Spider.create(new F58PageProcesser()).setScheduler(new RedisScheduler("localhost")).run();
}
}

View File

@ -27,4 +27,5 @@ public class HuxiuProcessor implements PageProcessor {
public static void main(String[] args) {
Spider.create(new HuxiuProcessor()).run();
}
}

View File

@ -0,0 +1,32 @@
package us.codecraft.webmagic.samples;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
/**
* @author code4crafter@gmail.com <br>
* Date: 13-5-20
* Time: 5:31
*/
public class KaichibaProcessor implements PageProcessor {
@Override
public void process(Page page) {
//http://progressdaily.diandian.com/post/2013-01-24/40046867275
int i = Integer.valueOf(page.getUrl().regex("shop/(\\d+)").toString()) + 1;
page.addTargetRequest("http://kaichiba.com/shop/" + i);
page.putField("title",page.getHtml().xpath("//Title"));
page.putField("items", page.getHtml().xpath("//li[@class=\"foodTitle\"]").replace("^\\s+", "").replace("\\s+$", "").replace("<span>.*?</span>", ""));
}
@Override
public Site getSite() {
return Site.me().setDomain("kaichiba.com").addStartUrl("http://kaichiba.com/shop/41725781").setCharset("utf-8").
setUserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31");
}
public static void main(String[] args) {
Spider.create(new KaichibaProcessor()).run();
}
}

View File

@ -0,0 +1,38 @@
package us.codecraft.webmagic.samples;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import java.util.List;
/**
* @author code4crafter@gmail.com <br>
* Date: 13-5-20
* Time: 5:31
*/
public class MeicanProcessor implements PageProcessor {
@Override
public void process(Page page) {
//http://progressdaily.diandian.com/post/2013-01-24/40046867275
List<String> requests = page.getHtml().xpath("//a[@class=\"area_link flat_btn\"]/@href").all();
if (requests.size() > 2) {
requests = requests.subList(0, 2);
}
page.addTargetRequests(requests);
page.addTargetRequests(page.getHtml().links().regex("(.*/restaurant/[^#]+)").all());
page.putField("items", page.getHtml().xpath("//ul[@class=\"dishes menu_dishes\"]/li/span[@class=\"name\"]/text()"));
page.putField("prices", page.getHtml().xpath("//ul[@class=\"dishes menu_dishes\"]/li/span[@class=\"price_outer\"]/span[@class=\"price\"]/text()"));
}
@Override
public Site getSite() {
return Site.me().setDomain("meican.com").addStartUrl("http://www.meican.com/shanghai/districts").setCharset("utf-8").
setUserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31");
}
public static void main(String[] args) {
Spider.create(new MeicanProcessor()).run();
}
}

View File

@ -3,7 +3,7 @@
<parent>
<artifactId>webmagic-parent</artifactId>
<groupId>us.codecraft</groupId>
<version>0.4.3-SNAPSHOT</version>
<version>0.4.2</version>
</parent>
<modelVersion>4.0.0</modelVersion>
@ -91,4 +91,4 @@
</build>
</project>
</project>