\n" + + "webmagic
\n" + + "\n" + + "\n" + + "\n" + + "\n" + + "\n" + + "\n" + + "\n" + + "\n" + + "A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler.
\n" + + "
\n" + + "Features:
\n" + + "\n" + + "- \n" +
+ "
- Simple core with high flexibility. \n" + + "
- Simple API for html extracting. \n" + + "
- Annotation with POJO to customize a crawler, no configuration. \n" + + "
- Multi-thread and Distribution support. \n" + + "
- Easy to be integrated. \n" + + "
\n" + + "Install:
\n" + + "\n" + + "Add dependencies to your pom.xml:
\n" + + "\n" + + " <dependency>\n" +
+ " <groupId>us.codecraft</groupId>\n" +
+ " <artifactId>webmagic-core</artifactId>\n" +
+ " <version>0.3.0</version>\n" +
+ " </dependency>\n" +
+ " <dependency>\n" +
+ " <groupId>us.codecraft</groupId>\n" +
+ " <artifactId>webmagic-extension</artifactId>\n" +
+ " <version>0.3.0</version>\n" +
+ " </dependency>\n" +
+ "
\n" +
+ "\n" +
+ "\n" + + "Get Started:
\n" + + "\n" + + "\n" + + "First crawler:
\n" + + "\n" + + "Write a class implements PageProcessor:
\n" + + "\n" + + " public class OschinaBlogPageProcesser implements PageProcessor {\n" +
+ "\n" +
+ " private Site site = Site.me().setDomain(\"my.oschina.net\")\n" +
+ " .addStartUrl(\"http://my.oschina.net/flashsword/blog\");\n" +
+ "\n" +
+ " @Override\n" +
+ " public void process(Page page) {\n" +
+ " List<String> links = page.getHtml().links().regex(\"http://my\\\\.oschina\\\\.net/flashsword/blog/\\\\d+\").all();\n" +
+ " page.addTargetRequests(links);\n" +
+ " page.putField(\"title\", page.getHtml().xpath(\"//div[@class='BlogEntity']/div[@class='BlogTitle']/h1\").toString());\n" +
+ " page.putField(\"content\", page.getHtml().$(\"div.content\").toString());\n" +
+ " page.putField(\"tags\",page.getHtml().xpath(\"//div[@class='BlogTags']/a/text()\").all());\n" +
+ " }\n" +
+ "\n" +
+ " @Override\n" +
+ " public Site getSite() {\n" +
+ " return site;\n" +
+ "\n" +
+ " }\n" +
+ "\n" +
+ " public static void main(String[] args) {\n" +
+ " Spider.create(new OschinaBlogPageProcesser())\n" +
+ " .pipeline(new ConsolePipeline()).run();\n" +
+ " }\n" +
+ " }\n" +
+ "
- \n" +
+ "
- \n" +
+ "
\n" + + "\n" + + "page.addTargetRequests(links)
Add urls for crawling.
\n" + + " \n" +
+ "
You can also use annotation way:
\n" + + "\n" + + " @TargetUrl(\"http://my.oschina.net/flashsword/blog/\\\\d+\")\n" +
+ " public class OschinaBlog {\n" +
+ "\n" +
+ " @ExtractBy(\"//title\")\n" +
+ " private String title;\n" +
+ "\n" +
+ " @ExtractBy(value = \"div.BlogContent\",type = ExtractBy.Type.Css)\n" +
+ " private String content;\n" +
+ "\n" +
+ " @ExtractBy(value = \"//div[@class='BlogTags']/a/text()\", multi = true)\n" +
+ " private List<String> tags;\n" +
+ "\n" +
+ " public static void main(String[] args) {\n" +
+ " OOSpider.create(\n" +
+ " Site.me().addStartUrl(\"http://my.oschina.net/flashsword/blog\"),\n" +
+ " new ConsolePageModelPipeline(), OschinaBlog.class).run();\n" +
+ " }\n" +
+ " }\n" +
+ "
\n" + + "Docs and samples:
\n" + + "\n" + + "The architecture of webmagic (refered to Scrapy)
\n" + + "\n" + + "\n" + + "\n" + + "Javadocs: http://code4craft.github.io/webmagic/docs/en/
\n" + + "\n" + + "There are some samples in webmagic-samples
package.
\n" + + "Lisence:
\n" + + "\n" + + "Lisenced under Apache 2.0 lisence
\n" + + "\n" + + "\n" + + "Thanks:
\n" + + "\n" + + "To write webmagic, I refered to the projects below :
\n" + + "\n" + + "- \n" +
+ "
- \n" +
+ "
Scrapy
\n" + + "\n" + + "A crawler framework in Python.
\n" + + "\n" + + "\n" + + " \n" +
+ " - \n" +
+ "
Spiderman
\n" + + "\n" + + "Another crawler framework in Java.
\n" + + "\n" + + "\n" + + " \n" +
+ "