diff --git a/.gitignore b/.gitignore index c0dc326..d7d63fe 100644 --- a/.gitignore +++ b/.gitignore @@ -2,4 +2,5 @@ target *.iml out/ .idea - +.classpath +.project diff --git a/README.md b/README.md index fba7076..b336367 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -![logo](https://raw.github.com/code4craft/webmagic/master/assets/logo.jpg) +![logo](https://raw.github.com/code4craft/webmagic/master/asserts/logo.jpg) [![Build Status](https://travis-ci.org/code4craft/webmagic.png?branch=master)](https://travis-ci.org/code4craft/webmagic) @@ -175,6 +175,7 @@ webmagic遵循[Apache 2.0协议](http://opensource.org/licenses/Apache-2.0) 以下是为WebMagic提交过代码或者issue的朋友: +* [ccliangbo](https://github.com/ccliangbo) * [yuany](https://github.com/yuany) * [yxssfxwzy](https://github.com/yxssfxwzy) * [linkerlin](https://github.com/linkerlin) @@ -188,8 +189,9 @@ webmagic遵循[Apache 2.0协议](http://opensource.org/licenses/Apache-2.0) * [ywooer](https://github.com/ywooer) * [yyw258520](https://github.com/yyw258520) * [perfecking](https://github.com/perfecking) -* [ccliangbo](https://github.com/ccliangbo) * [lidongyang](http://my.oschina.net/lidongyang) +* [seveniu](https://github.com/seveniu) +* [sebastian1118](https://github.com/sebastian1118) ### 邮件组: @@ -201,4 +203,4 @@ QQ: ### QQ群: -330192938 +373225642 diff --git a/asserts/data.plist b/assets/data.plist similarity index 100% rename from asserts/data.plist rename to assets/data.plist diff --git a/asserts/image1.pdf b/assets/image1.pdf similarity index 100% rename from asserts/image1.pdf rename to assets/image1.pdf diff --git a/asserts/logo-simple.jpg b/assets/logo-simple.jpg similarity index 100% rename from asserts/logo-simple.jpg rename to assets/logo-simple.jpg diff --git a/asserts/logo.graffle b/assets/logo.graffle similarity index 100% rename from asserts/logo.graffle rename to assets/logo.graffle diff --git a/asserts/logo.jpg b/assets/logo.jpg similarity index 100% rename from asserts/logo.jpg rename to assets/logo.jpg diff --git a/asserts/logo2.graffle/data.plist b/assets/logo2.graffle/data.plist similarity index 100% rename from asserts/logo2.graffle/data.plist rename to assets/logo2.graffle/data.plist diff --git a/asserts/logo2.graffle/image1.tiff b/assets/logo2.graffle/image1.tiff similarity index 100% rename from asserts/logo2.graffle/image1.tiff rename to assets/logo2.graffle/image1.tiff diff --git a/asserts/logo3.graffle/data.plist b/assets/logo3.graffle/data.plist similarity index 100% rename from asserts/logo3.graffle/data.plist rename to assets/logo3.graffle/data.plist diff --git a/asserts/logo3.graffle/image1.tiff b/assets/logo3.graffle/image1.tiff similarity index 100% rename from asserts/logo3.graffle/image1.tiff rename to assets/logo3.graffle/image1.tiff diff --git a/asserts/logo3.graffle/image2.tiff b/assets/logo3.graffle/image2.tiff similarity index 100% rename from asserts/logo3.graffle/image2.tiff rename to assets/logo3.graffle/image2.tiff diff --git a/asserts/logo3.graffle/image4.tiff b/assets/logo3.graffle/image4.tiff similarity index 100% rename from asserts/logo3.graffle/image4.tiff rename to assets/logo3.graffle/image4.tiff diff --git a/asserts/logo3.graffle/image5.tiff b/assets/logo3.graffle/image5.tiff similarity index 100% rename from asserts/logo3.graffle/image5.tiff rename to assets/logo3.graffle/image5.tiff diff --git a/asserts/logo3.png b/assets/logo3.png similarity index 100% rename from asserts/logo3.png rename to assets/logo3.png diff --git a/asserts/logo4.png b/assets/logo4.png similarity index 100% rename from asserts/logo4.png rename to assets/logo4.png diff --git a/assets/page-extract-rule.bmml b/assets/page-extract-rule.bmml new file mode 100644 index 0000000..fec8d3e --- /dev/null +++ b/assets/page-extract-rule.bmml @@ -0,0 +1,9 @@ + + + + + A%20Web%20Page%0Ahttp%3A// + + + + \ No newline at end of file diff --git a/asserts/webmagic-create-spider.bmml b/assets/webmagic-create-spider.bmml similarity index 100% rename from asserts/webmagic-create-spider.bmml rename to assets/webmagic-create-spider.bmml diff --git a/asserts/webmagic-create-spider.png b/assets/webmagic-create-spider.png similarity index 100% rename from asserts/webmagic-create-spider.png rename to assets/webmagic-create-spider.png diff --git a/asserts/webmagic-spider-manage.bmml b/assets/webmagic-spider-manage.bmml similarity index 100% rename from asserts/webmagic-spider-manage.bmml rename to assets/webmagic-spider-manage.bmml diff --git a/asserts/webmagic-spider-manage.png b/assets/webmagic-spider-manage.png similarity index 100% rename from asserts/webmagic-spider-manage.png rename to assets/webmagic-spider-manage.png diff --git a/asserts/webmagic.psd b/assets/webmagic.psd similarity index 100% rename from asserts/webmagic.psd rename to assets/webmagic.psd diff --git a/en_docs/README.md b/en_docs/README.md index 684da90..cccbf3f 100644 --- a/en_docs/README.md +++ b/en_docs/README.md @@ -1,10 +1,13 @@ -webmagic ---- +![logo](https://raw.github.com/code4craft/webmagic/master/asserts/logo.jpg) + [Readme in Chinese](https://github.com/code4craft/webmagic/tree/master/zh_docs) +[User Manual (Chinese)](https://github.com/code4craft/webmagic/blob/master/user-manual.md) + + [![Build Status](https://travis-ci.org/code4craft/webmagic.png?branch=master)](https://travis-ci.org/code4craft/webmagic) ->A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simply the development of a specific crawler. +>A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler. ## Features: @@ -14,26 +17,19 @@ webmagic * Multi-thread and Distribution support. * Easy to be integrated. - ## Install: - -Clone the repo and build: - - git clone https://github.com/code4craft/webmagic.git - cd webmagic - mvn clean install - -Add dependencies to your project: + +Add dependencies to your pom.xml: us.codecraft webmagic-core - 0.4.2 + 0.4.3 us.codecraft webmagic-extension - 0.4.2 + 0.4.3 ## Get Started: @@ -42,10 +38,10 @@ Add dependencies to your project: Write a class implements PageProcessor: +```java public class OschinaBlogPageProcesser implements PageProcessor { - private Site site = Site.me().setDomain("my.oschina.net") - .addStartUrl("http://my.oschina.net/flashsword/blog"); + private Site site = Site.me().setDomain("my.oschina.net"); @Override public void process(Page page) { @@ -63,10 +59,11 @@ Write a class implements PageProcessor: } public static void main(String[] args) { - Spider.create(new OschinaBlogPageProcesser()) - .pipeline(new ConsolePipeline()).run(); + Spider.create(new OschinaBlogPageProcesser()).addUrl("http://my.oschina.net/flashsword/blog") + .addPipeline(new ConsolePipeline()).run(); } } +``` * `page.addTargetRequests(links)` @@ -74,6 +71,7 @@ Write a class implements PageProcessor: You can also use annotation way: +```java @TargetUrl("http://my.oschina.net/flashsword/blog/\\d+") public class OschinaBlog { @@ -88,10 +86,11 @@ You can also use annotation way: public static void main(String[] args) { OOSpider.create( - Site.me().addStartUrl("http://my.oschina.net/flashsword/blog"), - new ConsolePageModelPipeline(), OschinaBlog.class).run(); + Site.me(), + new ConsolePageModelPipeline(), OschinaBlog.class).addUrl("http://my.oschina.net/flashsword/blog").run(); } } +``` ### Docs and samples: @@ -103,11 +102,30 @@ Javadocs: [http://code4craft.github.io/webmagic/docs/en/](http://code4craft.gith There are some samples in `webmagic-samples` package. - ### Lisence: Lisenced under [Apache 2.0 lisence](http://opensource.org/licenses/Apache-2.0) +### Contributors: + +Thanks these people for commiting source code, reporting bugs or suggesting for new feature: + +* [yuany](https://github.com/yuany) +* [yxssfxwzy](https://github.com/yxssfxwzy) +* [linkerlin](https://github.com/linkerlin) +* [d0ngw](https://github.com/d0ngw) +* [xuchaoo](https://github.com/xuchaoo) +* [supermicah](https://github.com/supermicah) +* [SimpleExpress](https://github.com/SimpleExpress) +* [aruanruan](https://github.com/aruanruan) +* [l1z2g9](https://github.com/l1z2g9) +* [zhegexiaohuozi](https://github.com/zhegexiaohuozi) +* [ywooer](https://github.com/ywooer) +* [yyw258520](https://github.com/yyw258520) +* [perfecking](https://github.com/perfecking) +* [lidongyang](http://my.oschina.net/lidongyang) + + ### Thanks: To write webmagic, I refered to the projects below : @@ -124,3 +142,10 @@ To write webmagic, I refered to the projects below : [https://gitcafe.com/laiweiwei/Spiderman](https://gitcafe.com/laiweiwei/Spiderman) +### Mail-list: + +[https://groups.google.com/forum/#!forum/webmagic-java](https://groups.google.com/forum/#!forum/webmagic-java) + + +[![Bitdeli Badge](https://d2weczhvl823v0.cloudfront.net/code4craft/webmagic/trend.png)](https://bitdeli.com/free "Bitdeli Badge") + diff --git a/pom.xml b/pom.xml index 5405d5e..9bfc505 100644 --- a/pom.xml +++ b/pom.xml @@ -6,7 +6,7 @@ 7 us.codecraft - 0.4.3 + 0.5.0 4.0.0 pom @@ -51,11 +51,10 @@ webmagic-core webmagic-extension/ webmagic-scripts/ - webmagic-avalon - webmagic-lucene - webmagic-samples - webmagic-saxon webmagic-selenium + webmagic-saxon + webmagic-samples + webmagic-avalon @@ -63,7 +62,7 @@ junit junit - 4.7 + 4.11 test @@ -89,12 +88,7 @@ us.codecraft xsoup - 0.2.0 - - - net.sf.saxon - Saxon-HE - 9.5.1-1 + 0.2.2 com.alibaba @@ -121,11 +115,6 @@ commons-collections 3.2.1 - - net.sourceforge.htmlcleaner - htmlcleaner - 2.5 - org.apache.commons commons-io @@ -136,6 +125,12 @@ jsoup 1.7.2 + + org.mockito + mockito-all + 1.9.5 + test + @@ -159,26 +154,26 @@ UTF-8 - - org.apache.maven.plugins - maven-dependency-plugin - 2.8 - - - copy-dependencies - package - - copy-dependencies - - - ${project.build.directory}/lib - false - false - true - - - - + + + + + + + + + + + + + + + + + + + + org.apache.maven.plugins maven-resources-plugin @@ -187,6 +182,15 @@ UTF-8 + + org.apache.maven.plugins + maven-jar-plugin + + + log4j.xml + + + org.apache.maven.plugins maven-source-plugin diff --git a/user-manual.md b/user-manual.md index f225c8a..d191965 100644 --- a/user-manual.md +++ b/user-manual.md @@ -65,7 +65,7 @@ webmagic还包含两个可用的扩展包,因为这两个包都依赖了比较 git clone http://git.oschina.net/flashsword20/webmagic.git -在**bin/lib**目录下,有项目依赖的所有jar包,直接在IDE里import即可。 +在**lib**目录下,有项目依赖的所有jar包,直接在IDE里import即可。 -------- diff --git a/webmagic-avalon/README.md b/webmagic-avalon/README.md new file mode 100644 index 0000000..4b15ed3 --- /dev/null +++ b/webmagic-avalon/README.md @@ -0,0 +1,5 @@ +WebMagic-Avalon +======== +> Spiders Manage Web + +see [#issue43](https://github.com/code4craft/webmagic/issues/43) \ No newline at end of file diff --git a/webmagic-avalon/forger/LICENSE b/webmagic-avalon/forger/LICENSE new file mode 100644 index 0000000..e06d208 --- /dev/null +++ b/webmagic-avalon/forger/LICENSE @@ -0,0 +1,202 @@ +Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "{}" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright {yyyy} {name of copyright owner} + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + diff --git a/webmagic-avalon/forger/README.md b/webmagic-avalon/forger/README.md new file mode 100644 index 0000000..1e4d7f5 --- /dev/null +++ b/webmagic-avalon/forger/README.md @@ -0,0 +1,27 @@ +forger +====== + +Dynamic Java object generator with template class and configuration. + +## Compiler + +Use groovy compiler. Compile source code to Java class. + +## PropertyLoader + +Load properties of object from user input. + +## API + +```java + @Test + public void testForgerCreateByClassAnnotationCompile() throws Exception { + ForgerFactory forgerFactory = new ForgerFactory(new AnnotationPropertyLoader(), new GroovyForgerCompiler()); + Forger forger = forgerFactory.compile(Foo.SOURCE_CODE); + Fooable foo = forger.forge(ImmutableMap.of("fooa", "test")); + Field field = forger.getClazz().getDeclaredField("foo"); + field.setAccessible(true); + assertThat(field.get(foo)).isEqualTo("test"); + assertThat(foo.foo()).isEqualTo("test"); + } +``` \ No newline at end of file diff --git a/webmagic-avalon/forger/pom.xml b/webmagic-avalon/forger/pom.xml new file mode 100644 index 0000000..44b42f9 --- /dev/null +++ b/webmagic-avalon/forger/pom.xml @@ -0,0 +1,193 @@ + + + + org.sonatype.oss + oss-parent + 7 + + us.codecraft + forger + 0.1.1-SNAPSHOT + 4.0.0 + jar + + UTF-8 + UTF-8 + + forger + + Dynamic Java object generator with template class and configuration. + + https://github.com/code4craft/forger/ + + + code4craft + Yihua huang + code4crafer@gmail.com + + + + scm:git:git@github.com:code4craft/forger.git + scm:git:git@github.com:code4craft/forger.git + git@github.com:code4craft/forger.git + HEAD + + + + Apache License,Version 2 + http://www.apache.org/licenses/LICENSE-2.0 + repo + + + + + + junit + junit + 4.11 + test + + + org.assertj + assertj-core + 1.5.0 + + + org.codehaus.groovy + groovy + 2.2.2 + + + org.slf4j + slf4j-api + 1.7.6 + + + + org.slf4j + slf4j-log4j12 + 1.7.6 + + + + org.apache.commons + commons-lang3 + 3.1 + + + + com.google.guava + guava + 15.0 + + + + + + + org.apache.maven.plugins + maven-compiler-plugin + 3.1 + + 1.6 + 1.6 + UTF-8 + + + + org.apache.maven.plugins + maven-dependency-plugin + 2.8 + + + copy-dependencies + package + + copy-dependencies + + + ${project.build.directory}/lib + false + false + true + + + + + + org.apache.maven.plugins + maven-resources-plugin + 2.6 + + UTF-8 + + + + org.apache.maven.plugins + maven-source-plugin + 2.2.1 + + + attach-sources + + jar + + + + + + org.apache.maven.plugins + maven-javadoc-plugin + 2.9.1 + + UTF-8 + + + + attach-javadocs + + jar + + + + + + org.apache.maven.plugins + maven-release-plugin + 2.4.1 + + + + + + + release-sign-artifacts + + + performRelease + true + + + + + + org.apache.maven.plugins + maven-gpg-plugin + 1.1 + + + sign-artifacts + verify + + sign + + + + + + + + + + + diff --git a/webmagic-avalon/forger/src/main/java/us/codecraft/forger/Forger.java b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/Forger.java new file mode 100644 index 0000000..57ec2ab --- /dev/null +++ b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/Forger.java @@ -0,0 +1,36 @@ +package us.codecraft.forger; + +import us.codecraft.forger.property.Property; +import us.codecraft.forger.property.PropertyLoader; + +import java.util.List; +import java.util.Map; + +/** + * @author code4crafter@gmail.com + */ +public class Forger { + + private final Class clazz; + + private final PropertyLoader propertyLoader; + + public Forger(Class clazz,PropertyLoader propertyLoader) { + this.clazz = clazz; + this.propertyLoader = propertyLoader; + } + + public T forge(Map properties) throws IllegalAccessException, InstantiationException { + T t = clazz.newInstance(); + propertyLoader.load(t, properties); + return t; + } + + public List getPropertyNames() { + return propertyLoader.getProperties(clazz); + } + + public Class getClazz() { + return clazz; + } +} diff --git a/webmagic-avalon/forger/src/main/java/us/codecraft/forger/ForgerFactory.java b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/ForgerFactory.java new file mode 100644 index 0000000..84b507b --- /dev/null +++ b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/ForgerFactory.java @@ -0,0 +1,28 @@ +package us.codecraft.forger; + +import us.codecraft.forger.compiler.ForgerCompiler; +import us.codecraft.forger.property.PropertyLoader; + +/** + * @author code4crafter@gmail.com + */ +public class ForgerFactory { + + private final PropertyLoader propertyLoader; + + private final ForgerCompiler forgerCompiler; + + public ForgerFactory(PropertyLoader propertyLoader, ForgerCompiler forgerCompiler) { + this.propertyLoader = propertyLoader; + this.forgerCompiler = forgerCompiler; + } + + public Forger compile(String sourceCode) { + Class clazz = forgerCompiler.compile(sourceCode); + return new Forger(clazz, propertyLoader); + } + + public Forger create(Class clazz) { + return new Forger(clazz, propertyLoader); + } +} diff --git a/webmagic-avalon/forger/src/main/java/us/codecraft/forger/compiler/ForgerCompiler.java b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/compiler/ForgerCompiler.java new file mode 100644 index 0000000..5e9e378 --- /dev/null +++ b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/compiler/ForgerCompiler.java @@ -0,0 +1,9 @@ +package us.codecraft.forger.compiler; + +/** + * @author code4crafter@gmail.com + */ +public interface ForgerCompiler { + + public Class compile(String sourceCode); +} diff --git a/webmagic-avalon/forger/src/main/java/us/codecraft/forger/compiler/GroovyForgerCompiler.java b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/compiler/GroovyForgerCompiler.java new file mode 100644 index 0000000..26a137e --- /dev/null +++ b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/compiler/GroovyForgerCompiler.java @@ -0,0 +1,16 @@ +package us.codecraft.forger.compiler; + +import groovy.lang.GroovyClassLoader; + +/** + * @author code4crafter@gmail.com + */ +public class GroovyForgerCompiler implements ForgerCompiler{ + + private GroovyClassLoader groovyClassLoader = new GroovyClassLoader(); + + @Override + public Class compile(String sourceCode) { + return groovyClassLoader.parseClass(sourceCode); + } +} diff --git a/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/AbstractPropertyLoader.java b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/AbstractPropertyLoader.java new file mode 100644 index 0000000..f0b638d --- /dev/null +++ b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/AbstractPropertyLoader.java @@ -0,0 +1,112 @@ +package us.codecraft.forger.property; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import us.codecraft.forger.property.format.*; + +import java.lang.reflect.Field; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +/** + * @author code4crafter@gmail.com + */ +public abstract class AbstractPropertyLoader implements PropertyLoader { + + private TypeFormatterFactory typeFormatterFactory = new TypeFormatterFactory(); + + protected Logger logger = LoggerFactory.getLogger(getClass()); + + protected TypeFormatterFactory getTypeFormatterFactory() { + return typeFormatterFactory; + } + + @Override + public T load(T object, Map propertyConfigs) { + List properties = getProperties(object.getClass()); + for (Property property : properties) { + Object value = propertyConfigs.get(property.getName()); + if (value == null) { + throw new IllegalArgumentException("Config for property " + property.getName() + " is missing!"); + } + ObjectFormatter objectFormatter = property.getObjectFormatter(); + switch (property.getType()) { + case PropertyString: + Object fieldValue = objectFormatter.format(String.valueOf(value)); + try { + property.getField().set(object, fieldValue); + } catch (IllegalAccessException e) { + logger.warn("Set field " + property.getField() + " error!", e); + } + break; + case PropertyList: + if (!List.class.isAssignableFrom(value.getClass())) { + throw new IllegalArgumentException("Config for property " + property.getName() + " should be subclass of List!"); + } + List listField = new ArrayList(); + List listConfigs = (List) value; + for (String listConfig : listConfigs) { + listField.add(objectFormatter.format(listConfig)); + } + try { + property.getField().set(object, listField); + } catch (IllegalAccessException e) { + logger.warn("Set field " + property.getField() + " error!", e); + } + break; + case PropertyMap: + if (!Map.class.isAssignableFrom(value.getClass())) { + throw new IllegalArgumentException("Config for property " + property.getName() + " should be subclass of List!"); + } + Map mapField = new HashMap(); + Map mapConfigs = (Map) value; + for (Map.Entry entry : mapConfigs.entrySet()) { + mapField.put(entry.getKey(), objectFormatter.format(entry.getValue())); + } + try { + property.getField().set(object, mapField); + } catch (IllegalAccessException e) { + logger.warn("Set field " + property.getField() + " error!", e); + } + break; + } + } + return object; + } + + protected ObjectFormatter prepareTypeFormatterParam(TypeFormatter objectFormatter, String[] params) { + if (params == null) { + return objectFormatter; + } + return new ObjectFormatterWithParams().setTypeFormatter(objectFormatter).setParams(params); + } + + protected ObjectFormatter getObjectFormatter(Field field) { + Class type = field.getType(); + if (List.class.isAssignableFrom(type) || Map.class.isAssignableFrom(type)) { + type = String.class; + } + if (field.isAnnotationPresent(Formatter.class)) { + Formatter formatter = field.getAnnotation(Formatter.class); + if (!formatter.formatter().equals(TypeFormatter.class)) { + TypeFormatter typeFormatter = typeFormatterFactory.getByFormatterClass(formatter.formatter()); + if (typeFormatter != null) { + return prepareTypeFormatterParam(typeFormatter,formatter.value()); + } + typeFormatterFactory.put(formatter.formatter()); + return prepareTypeFormatterParam(typeFormatterFactory.getByFormatterClass(formatter.formatter()), formatter.value()); + } else if (!formatter.subClazz().equals(String.class)) { + type = formatter.subClazz(); + TypeFormatter typeFormatter = typeFormatterFactory.get(type); + if (typeFormatter == null) { + throw new IllegalArgumentException("No typeFormatter for class " + type); + } + return prepareTypeFormatterParam(typeFormatter, formatter.value()); + } + } + return getTypeFormatterFactory().get(BasicTypeFormatter.detectBasicClass(type)); + } + +} diff --git a/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/AnnotationPropertyLoader.java b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/AnnotationPropertyLoader.java new file mode 100644 index 0000000..ea630b9 --- /dev/null +++ b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/AnnotationPropertyLoader.java @@ -0,0 +1,32 @@ +package us.codecraft.forger.property; + +import java.lang.reflect.Field; +import java.util.ArrayList; +import java.util.List; + +/** + * @author code4crafter@gmail.com + */ +public class AnnotationPropertyLoader extends AbstractPropertyLoader { + + @Override + public List getProperties(Class clazz) { + Field[] fields = clazz.getDeclaredFields(); + List properties = new ArrayList(fields.length); + for (Field field : fields) { + Inject inject = field.getAnnotation(Inject.class); + if (inject != null) { + if (!field.isAccessible()) { + field.setAccessible(true); + } + Property property = Property.fromField(field); + if (inject.value().length() > 0) { + property.setName(inject.value()); + } + property.setObjectFormatter(getObjectFormatter(field)); + properties.add(property); + } + } + return properties; + } +} diff --git a/webmagic-extension/src/main/java/us/codecraft/webmagic/configurable/Inject.java b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/Inject.java similarity index 77% rename from webmagic-extension/src/main/java/us/codecraft/webmagic/configurable/Inject.java rename to webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/Inject.java index c6608ae..262e45a 100644 --- a/webmagic-extension/src/main/java/us/codecraft/webmagic/configurable/Inject.java +++ b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/Inject.java @@ -1,15 +1,16 @@ -package us.codecraft.webmagic.configurable; +package us.codecraft.forger.property; import java.lang.annotation.ElementType; import java.lang.annotation.Retention; import java.lang.annotation.Target; /** - * @author yihua.huang@dianping.com + * @author code4crafter@gmail.com */ @Retention(java.lang.annotation.RetentionPolicy.RUNTIME) @Target({ElementType.FIELD}) public @interface Inject { String value() default ""; + } diff --git a/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/Property.java b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/Property.java new file mode 100644 index 0000000..66b196c --- /dev/null +++ b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/Property.java @@ -0,0 +1,60 @@ +package us.codecraft.forger.property; + +import us.codecraft.forger.property.format.ObjectFormatter; + +import java.lang.reflect.Field; + +/** + * @author code4crafter@gmail.com + */ +public class Property { + + private String name; + + private PropertyType type; + + private Field field; + + private ObjectFormatter objectFormatter; + + public ObjectFormatter getObjectFormatter() { + return objectFormatter; + } + + public Property setObjectFormatter(ObjectFormatter objectFormatter) { + this.objectFormatter = objectFormatter; + return this; + } + + public String getName() { + return name; + } + + public Property setName(String name) { + this.name = name; + return this; + } + + public PropertyType getType() { + return type; + } + + public Property setType(PropertyType type) { + this.type = type; + return this; + } + + public Field getField() { + return field; + } + + public Property setField(Field field) { + this.field = field; + return this; + } + + public static Property fromField(Field field) { + return new Property().setName(field.getName()).setType(PropertyType.from(field.getType())).setField(field); + } + +} diff --git a/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/PropertyLoader.java b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/PropertyLoader.java new file mode 100644 index 0000000..226407a --- /dev/null +++ b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/PropertyLoader.java @@ -0,0 +1,15 @@ +package us.codecraft.forger.property; + +import java.util.List; +import java.util.Map; + +/** + * @author code4crafter@gmail.com + */ +public interface PropertyLoader { + + public T load(T object, Map propertyConfigs); + + public List getProperties(Class clazz); + +} diff --git a/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/PropertyType.java b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/PropertyType.java new file mode 100644 index 0000000..aa0df51 --- /dev/null +++ b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/PropertyType.java @@ -0,0 +1,23 @@ +package us.codecraft.forger.property; + +import java.util.List; +import java.util.Map; + +/** + * @author code4crafter@gmail.com + */ +public enum PropertyType { + + PropertyString,PropertyMap,PropertyList; + + public static PropertyType from(Class clazz){ + if (Map.class.isAssignableFrom(clazz)){ + return PropertyMap; + } + if (List.class.isAssignableFrom(clazz)){ + return PropertyList; + } + return PropertyString; + } + +} diff --git a/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/SimpleFieldPropertyLoader.java b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/SimpleFieldPropertyLoader.java new file mode 100644 index 0000000..13ff68a --- /dev/null +++ b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/SimpleFieldPropertyLoader.java @@ -0,0 +1,28 @@ +package us.codecraft.forger.property; + +import java.lang.reflect.Field; +import java.lang.reflect.Modifier; +import java.util.ArrayList; +import java.util.List; + +/** + * @author code4crafter@gmail.com + */ +public class SimpleFieldPropertyLoader extends AbstractPropertyLoader { + + @Override + public List getProperties(Class clazz) { + Field[] fields = clazz.getDeclaredFields(); + List properties = new ArrayList(fields.length); + for (Field field : fields) { + if (Modifier.isStatic(field.getModifiers())){ + continue; + } + if (!field.isAccessible()){ + field.setAccessible(true); + } + properties.add(Property.fromField(field).setObjectFormatter(getObjectFormatter(field))); + } + return properties; + } +} diff --git a/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/format/BasicTypeFormatter.java b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/format/BasicTypeFormatter.java new file mode 100644 index 0000000..a6d0e5f --- /dev/null +++ b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/format/BasicTypeFormatter.java @@ -0,0 +1,168 @@ +package us.codecraft.forger.property.format; + +import java.util.Arrays; +import java.util.List; + +/** + * @author code4crafter@gmail.com + * @since 0.3.2 + */ +public abstract class BasicTypeFormatter implements TypeFormatter { + + @Override + public T format(String text) { + if (text == null) { + return null; + } + text = text.trim(); + return formatTrimmed(text); + } + + @Override + public T format(String text, String[] params) { + return format(text); + } + + protected abstract T formatTrimmed(String raw); + + public static final List> basicTypeFormatters = Arrays.>asList(IntegerFormatter.class, + LongFormatter.class, DoubleFormatter.class, FloatFormatter.class, ShortFormatter.class, + CharactorFormatter.class, ByteFormatter.class, BooleanFormatter.class, DateFormatter.class, StringFormatter.class); + + public static Class detectBasicClass(Class type) { + if (type.equals(Integer.TYPE) || type.equals(Integer.class)) { + return Integer.class; + } else if (type.equals(Long.TYPE) || type.equals(Long.class)) { + return Long.class; + } else if (type.equals(Double.TYPE) || type.equals(Double.class)) { + return Double.class; + } else if (type.equals(Float.TYPE) || type.equals(Float.class)) { + return Float.class; + } else if (type.equals(Short.TYPE) || type.equals(Short.class)) { + return Short.class; + } else if (type.equals(Character.TYPE) || type.equals(Character.class)) { + return Character.class; + } else if (type.equals(Byte.TYPE) || type.equals(Byte.class)) { + return Byte.class; + } else if (type.equals(Boolean.TYPE) || type.equals(Boolean.class)) { + return Boolean.class; + } + return type; + } + + public static class IntegerFormatter extends BasicTypeFormatter { + @Override + public Integer formatTrimmed(String raw) { + return Integer.parseInt(raw); + } + + @Override + public Class clazz() { + return Integer.class; + } + } + + public static class LongFormatter extends BasicTypeFormatter { + @Override + public Long formatTrimmed(String raw) { + return Long.parseLong(raw); + } + + @Override + public Class clazz() { + return Long.class; + } + } + + public static class DoubleFormatter extends BasicTypeFormatter { + @Override + public Double formatTrimmed(String raw) { + return Double.parseDouble(raw); + } + + @Override + public Class clazz() { + return Double.class; + } + } + + public static class FloatFormatter extends BasicTypeFormatter { + @Override + public Float formatTrimmed(String raw) { + return Float.parseFloat(raw); + } + + @Override + public Class clazz() { + return Float.class; + } + } + + public static class ShortFormatter extends BasicTypeFormatter { + @Override + public Short formatTrimmed(String raw) { + return Short.parseShort(raw); + } + + @Override + public Class clazz() { + return Short.class; + } + } + + public static class CharactorFormatter extends BasicTypeFormatter { + @Override + public Character formatTrimmed(String raw) { + return raw.charAt(0); + } + + @Override + public Class clazz() { + return Character.class; + } + } + + public static class ByteFormatter extends BasicTypeFormatter { + @Override + public Byte formatTrimmed(String raw) { + return Byte.parseByte(raw, 10); + } + + @Override + public Class clazz() { + return Byte.class; + } + } + + public static class BooleanFormatter extends BasicTypeFormatter { + @Override + public Boolean formatTrimmed(String raw) { + return Boolean.parseBoolean(raw); + } + + @Override + public Class clazz() { + return Boolean.class; + } + } + + public static class StringFormatter implements TypeFormatter { + + @Override + public String format(String text) { + return text; + } + + @Override + public String format(String text, String[] params) { + return format(text); + } + + @Override + public Class clazz() { + return String.class; + } + } + + +} diff --git a/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/format/DateFormatter.java b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/format/DateFormatter.java new file mode 100644 index 0000000..f9bdd9f --- /dev/null +++ b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/format/DateFormatter.java @@ -0,0 +1,35 @@ +package us.codecraft.forger.property.format; + +import org.apache.commons.lang3.time.DateUtils; + +import java.text.ParseException; +import java.util.Date; + +/** + * @author code4crafter@gmail.com + * @since 0.3.2 + */ +public class DateFormatter implements TypeFormatter { + + public static final String[] DEFAULT_PATTERN = new String[]{"yyyy-MM-dd HH:mm"}; + + @Override + public Date format(String text) { + return format(text,DEFAULT_PATTERN); + } + + @Override + public Date format(String text, String[] params) { + try { + return DateUtils.parseDate(text, params); + } catch (ParseException e) { + throw new IllegalArgumentException(e); + } + } + + @Override + public Class clazz() { + return Date.class; + } + +} diff --git a/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/format/Formatter.java b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/format/Formatter.java new file mode 100644 index 0000000..45b84b1 --- /dev/null +++ b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/format/Formatter.java @@ -0,0 +1,39 @@ +package us.codecraft.forger.property.format; + +import java.lang.annotation.ElementType; +import java.lang.annotation.Retention; +import java.lang.annotation.Target; + +/** + * Define how the result string is convert to an object for field. + * + * @author code4crafter@gmail.com
+ * @since 0.3.2 + */ +@Retention(java.lang.annotation.RetentionPolicy.RUNTIME) +@Target({ElementType.FIELD}) +public @interface Formatter { + + /** + * Set formatter params. + * + * @return formatter params + */ + String[] value(); + + /** + * Specific the class of field of class of elements in collection for field.
+ * It is not necessary to be set because we can detect the class by class of field, + * unless you use a collection as a field.
+ * + * @return the class of field + */ + Class subClazz() default String.class; + + /** + * If there are more than one formatter for a class, just specify the implement. + * @return implement + */ + Class formatter() default TypeFormatter.class; + +} diff --git a/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/format/ObjectFormatter.java b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/format/ObjectFormatter.java new file mode 100644 index 0000000..a5a8134 --- /dev/null +++ b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/format/ObjectFormatter.java @@ -0,0 +1,9 @@ +package us.codecraft.forger.property.format; + +/** + * @author code4crafter@gmail.com + */ +public interface ObjectFormatter { + + T format(String text); +} diff --git a/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/format/ObjectFormatterWithParams.java b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/format/ObjectFormatterWithParams.java new file mode 100644 index 0000000..051cc5d --- /dev/null +++ b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/format/ObjectFormatterWithParams.java @@ -0,0 +1,34 @@ +package us.codecraft.forger.property.format; + +/** + * @author code4crafter@gmail.com + */ +public class ObjectFormatterWithParams implements ObjectFormatter { + + private TypeFormatter typeFormatter; + + private String[] params; + + public TypeFormatter getTypeFormatter() { + return typeFormatter; + } + + public ObjectFormatterWithParams setTypeFormatter(TypeFormatter typeFormatter) { + this.typeFormatter = typeFormatter; + return this; + } + + public String[] getParams() { + return params; + } + + public ObjectFormatterWithParams setParams(String[] params) { + this.params = params; + return this; + } + + @Override + public T format(String text) { + return typeFormatter.format(text, params); + } +} diff --git a/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/format/TypeFormatter.java b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/format/TypeFormatter.java new file mode 100644 index 0000000..e6e436d --- /dev/null +++ b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/format/TypeFormatter.java @@ -0,0 +1,12 @@ +package us.codecraft.forger.property.format; + +/** + * @author code4crafter@gmail.com + */ +public interface TypeFormatter extends ObjectFormatter { + + T format(String text, String[] params); + + Class clazz(); + +} diff --git a/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/format/TypeFormatterFactory.java b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/format/TypeFormatterFactory.java new file mode 100644 index 0000000..027d8fe --- /dev/null +++ b/webmagic-avalon/forger/src/main/java/us/codecraft/forger/property/format/TypeFormatterFactory.java @@ -0,0 +1,53 @@ +package us.codecraft.forger.property.format; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Map; +import java.util.concurrent.ConcurrentHashMap; + +/** + * @author code4crafter@gmail.com + * @since 0.3.2 + */ +public class TypeFormatterFactory { + + private Logger logger = LoggerFactory.getLogger(getClass()); + + private Map objectFormatterMapWithPropertyAsKey = new ConcurrentHashMap(); + + private Map objectFormatterMapWithClassAsKey = new ConcurrentHashMap(); + + public TypeFormatterFactory() { + initFormatterMap(); + } + + private void initFormatterMap() { + for (Class basicTypeFormatter : BasicTypeFormatter.basicTypeFormatters) { + put(basicTypeFormatter); + } + put(DateFormatter.class); + } + + public synchronized void put(Class objectFormatterClazz) { + try { + TypeFormatter typeFormatter = objectFormatterClazz.newInstance(); + if (typeFormatter.clazz() != null) { + objectFormatterMapWithPropertyAsKey.put(typeFormatter.clazz(), typeFormatter); + } + objectFormatterMapWithClassAsKey.put(objectFormatterClazz, typeFormatter); + } catch (InstantiationException e) { + logger.error("Init objectFormatter error", e); + } catch (IllegalAccessException e) { + logger.error("Init objectFormatter error", e); + } + } + + public TypeFormatter get(Class clazz) { + return objectFormatterMapWithPropertyAsKey.get(clazz); + } + + public TypeFormatter getByFormatterClass(Class clazz) { + return objectFormatterMapWithClassAsKey.get(clazz); + } +} diff --git a/webmagic-avalon/src/main/resources/log/log4j.xml b/webmagic-avalon/forger/src/main/resources/log4j.xml similarity index 100% rename from webmagic-avalon/src/main/resources/log/log4j.xml rename to webmagic-avalon/forger/src/main/resources/log4j.xml diff --git a/webmagic-avalon/forger/src/test/java/us/codecraft/forger/Bar.java b/webmagic-avalon/forger/src/test/java/us/codecraft/forger/Bar.java new file mode 100644 index 0000000..3b51a5c --- /dev/null +++ b/webmagic-avalon/forger/src/test/java/us/codecraft/forger/Bar.java @@ -0,0 +1,47 @@ +package us.codecraft.forger; + +import us.codecraft.forger.property.Inject; +import us.codecraft.forger.property.format.Formatter; + +import java.util.List; +import java.util.Map; + +/** + * @author code4crafter@gmail.com + */ +public class Bar { + + @Inject("bar") + private String bar; + + @Inject + private List values; + + @Formatter(value = "", subClazz = Integer.class) + @Inject + private Map idMap; + + public String getBar() { + return bar; + } + + public void setBar(String bar) { + this.bar = bar; + } + + public List getValues() { + return values; + } + + public void setValues(List values) { + this.values = values; + } + + public Map getIdMap() { + return idMap; + } + + public void setIdMap(Map idMap) { + this.idMap = idMap; + } +} diff --git a/webmagic-avalon/forger/src/test/java/us/codecraft/forger/Foo.java b/webmagic-avalon/forger/src/test/java/us/codecraft/forger/Foo.java new file mode 100644 index 0000000..daa2e15 --- /dev/null +++ b/webmagic-avalon/forger/src/test/java/us/codecraft/forger/Foo.java @@ -0,0 +1,47 @@ +package us.codecraft.forger; + +import us.codecraft.forger.property.Inject; +import us.codecraft.forger.property.format.Formatter; + +/** + * @author code4crafter@gmail.com + */ +public class Foo implements Fooable{ + + @Formatter("") + @Inject("fooa") + private String foo; + + public static final String SOURCE_CODE="import us.codecraft.forger.*;\n" + + "import us.codecraft.forger.property.Inject;\n" + + "import us.codecraft.forger.property.Inject;\n" + + "import us.codecraft.forger.property.format.Formatter;\n" + + "\n" + + "/**\n" + + " * @author code4crafter@gmail.com\n" + + " */\n" + + "public class Foo implements Fooable{\n" + + "\n" + + " @Formatter(\"\")\n" + + " @Inject(\"fooa\")\n" + + " private String foo;\n" + + "\n" + + " public String getFoo() {\n" + + " return foo;\n" + + " }\n" + + "\n" + + " @Override\n" + + " public String foo() {\n" + + " return foo;\n" + + " }\n" + + "}"; + + public String getFoo() { + return foo; + } + + @Override + public String foo() { + return foo; + } +} diff --git a/webmagic-avalon/forger/src/test/java/us/codecraft/forger/Fooable.java b/webmagic-avalon/forger/src/test/java/us/codecraft/forger/Fooable.java new file mode 100644 index 0000000..86c1d02 --- /dev/null +++ b/webmagic-avalon/forger/src/test/java/us/codecraft/forger/Fooable.java @@ -0,0 +1,9 @@ +package us.codecraft.forger; + +/** + * @author code4crafter@gmail.com + */ +public interface Fooable { + + public String foo(); +} diff --git a/webmagic-avalon/forger/src/test/java/us/codecraft/forger/ForgerFactoryTest.java b/webmagic-avalon/forger/src/test/java/us/codecraft/forger/ForgerFactoryTest.java new file mode 100644 index 0000000..50f248a --- /dev/null +++ b/webmagic-avalon/forger/src/test/java/us/codecraft/forger/ForgerFactoryTest.java @@ -0,0 +1,66 @@ +package us.codecraft.forger; + +import com.google.common.collect.ImmutableMap; +import org.junit.Test; +import us.codecraft.forger.compiler.GroovyForgerCompiler; +import us.codecraft.forger.property.AnnotationPropertyLoader; +import us.codecraft.forger.property.SimpleFieldPropertyLoader; + +import java.lang.reflect.Field; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import static org.assertj.core.api.Assertions.*; + +/** + * @author code4crafter@gmail.com + */ +public class ForgerFactoryTest { + + @Test + public void testForgerCreateByClassProperty() throws Exception { + ForgerFactory forgerFactory = new ForgerFactory(new SimpleFieldPropertyLoader(), null); + Forger forger = forgerFactory.create(Foo.class); + Foo foo = forger.forge(ImmutableMap.of("foo", "test")); + assertThat(foo.getFoo()).isEqualTo("test"); + } + + @Test + public void testForgerCreateByClassAnnotation() throws Exception { + ForgerFactory forgerFactory = new ForgerFactory(new AnnotationPropertyLoader(), null); + Forger forger = forgerFactory.create(Foo.class); + Foo foo = forger.forge(ImmutableMap.of("fooa", "test")); + assertThat(foo.getFoo()).isEqualTo("test"); + } + + @Test + public void testForgerCreateByClassAnnotationCompile() throws Exception { + ForgerFactory forgerFactory = new ForgerFactory(new AnnotationPropertyLoader(), new GroovyForgerCompiler()); + Forger forger = forgerFactory.compile(Foo.SOURCE_CODE); + Fooable foo = forger.forge(ImmutableMap.of("fooa", "test")); + Field field = forger.getClazz().getDeclaredField("foo"); + field.setAccessible(true); + assertThat(field.get(foo)).isEqualTo("test"); + assertThat(foo.foo()).isEqualTo("test"); + } + + @Test + public void testForgerCreateByClassAnnotationWithCollections() throws Exception { + ForgerFactory forgerFactory = new ForgerFactory(new AnnotationPropertyLoader(), null); + Forger forger = forgerFactory.create(Bar.class); + Map map = new HashMap(); + map.put("bar", "bar"); + Map submap = new HashMap(); + submap.put("1", "1"); + submap.put("2", "2"); + map.put("idMap", submap); + List sublist = new ArrayList(); + sublist.add("test"); + map.put("values", sublist); + Bar forge = forger.forge(map); + assertThat(forge.getValues().size() > 0); + assertThat(forge.getIdMap().get("1")).isEqualTo(1); + } +} diff --git a/webmagic-avalon/forger/src/test/java/us/codecraft/forger/compiler/GroovyForgerCompilerTest.java b/webmagic-avalon/forger/src/test/java/us/codecraft/forger/compiler/GroovyForgerCompilerTest.java new file mode 100644 index 0000000..244c25f --- /dev/null +++ b/webmagic-avalon/forger/src/test/java/us/codecraft/forger/compiler/GroovyForgerCompilerTest.java @@ -0,0 +1,19 @@ +package us.codecraft.forger.compiler; + +import org.junit.Test; +import us.codecraft.forger.Foo; + +import static org.assertj.core.api.Assertions.assertThat; + +/** + * @author code4crafter@gmail.com + */ +public class GroovyForgerCompilerTest { + + @Test + public void testGroovyClassLoader() throws Exception { + GroovyForgerCompiler groovyForgerCompiler = new GroovyForgerCompiler(); + Class compiledClass = groovyForgerCompiler.compile(Foo.SOURCE_CODE); + assertThat(compiledClass.getName()).isEqualTo("Foo"); + } +} diff --git a/webmagic-avalon/forger/src/test/resources/log4j.xml b/webmagic-avalon/forger/src/test/resources/log4j.xml new file mode 100644 index 0000000..9084694 --- /dev/null +++ b/webmagic-avalon/forger/src/test/resources/log4j.xml @@ -0,0 +1,31 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/webmagic-avalon/pom.xml b/webmagic-avalon/pom.xml index ecac421..d1fadc4 100644 --- a/webmagic-avalon/pom.xml +++ b/webmagic-avalon/pom.xml @@ -3,94 +3,130 @@ webmagic-parent us.codecraft - 0.4.3 + 0.5.0 4.0.0 us.codecraft webmagic-avalon - war + pom - - - us.codecraft - webmagic-scripts - ${project.version} - + + forger + webmagic-admin + webmagic-worker + webmagic-avalon-common + - - org.mybatis - mybatis - 3.1.1 - + + - - org.mybatis - mybatis-spring - 1.1.1 - + + us.codecraft + webmagic-scripts + ${project.version} + - - org.freemarker - freemarker - 2.3.19 - - - org.springframework - spring-test - ${spring-version} - test - + + org.mybatis + mybatis + 3.1.1 + - - org.springframework - spring-aop - ${spring-version} - + + org.mybatis + mybatis-spring + 1.1.1 + - - org.aspectj - aspectjrt - 1.7.2 - - - org.aspectj - aspectjweaver - 1.7.2 - - - org.springframework - spring-core - ${spring-version} - - - org.springframework - spring-webmvc - ${spring-version} - + + us.codecraft + forger + 0.1.1-SNAPSHOT + - - javax.servlet - javax.servlet-api - 3.0.1 - - - org.springframework - spring-context - ${spring-version} - - - org.springframework - spring-context-support - ${spring-version} - - - com.alibaba - fastjson - 1.1.37 - + + org.freemarker + freemarker + 2.3.19 + - + + org.springframework + spring-test + ${spring-version} + test + + + + org.assertj + assertj-core + 1.5.0 + test + + + + mysql + mysql-connector-java + 5.1.18 + + + + commons-dbcp + commons-dbcp + 1.3 + + + + org.springframework + spring-aop + ${spring-version} + + + + org.aspectj + aspectjrt + 1.7.2 + + + org.aspectj + aspectjweaver + 1.7.2 + + + org.springframework + spring-core + ${spring-version} + + + org.springframework + spring-webmvc + ${spring-version} + + + + javax.servlet + javax.servlet-api + 3.0.1 + + + org.springframework + spring-context + ${spring-version} + + + org.springframework + spring-context-support + ${spring-version} + + + com.alibaba + fastjson + 1.1.37 + + + + @@ -104,4 +140,4 @@ - \ No newline at end of file + diff --git a/webmagic-avalon/src/main/resources/spring/applicationContext-myBatis.xml b/webmagic-avalon/src/main/resources/spring/applicationContext-myBatis.xml deleted file mode 100644 index 222df02..0000000 --- a/webmagic-avalon/src/main/resources/spring/applicationContext-myBatis.xml +++ /dev/null @@ -1,21 +0,0 @@ - - - - - - - - - - - - - - - - - - \ No newline at end of file diff --git a/webmagic-avalon/src/main/webapp/WEB-INF/pages/create_spider.ftl b/webmagic-avalon/src/main/webapp/WEB-INF/pages/create_spider.ftl deleted file mode 100644 index e69de29..0000000 diff --git a/webmagic-avalon/webmagic-admin/README.md b/webmagic-avalon/webmagic-admin/README.md new file mode 100644 index 0000000..6e32c06 --- /dev/null +++ b/webmagic-avalon/webmagic-admin/README.md @@ -0,0 +1,3 @@ +WebMagic-Admin +===== +Admin is the control web of workers. \ No newline at end of file diff --git a/webmagic-panel/pom.xml b/webmagic-avalon/webmagic-admin/pom.xml similarity index 59% rename from webmagic-panel/pom.xml rename to webmagic-avalon/webmagic-admin/pom.xml index 288e8df..020ca8a 100644 --- a/webmagic-panel/pom.xml +++ b/webmagic-avalon/webmagic-admin/pom.xml @@ -1,23 +1,23 @@ - + - webmagic-parent + webmagic-avalon us.codecraft - 0.4.3-SNAPSHOT + 0.5.0 4.0.0 us.codecraft - webmagic-panel + webmagic-admin + war us.codecraft - webmagic-scripts + webmagic-avalon-common ${project.version} + diff --git a/webmagic-avalon/src/main/java/us/codecraft/webmagic/avalon/web/DashBoardController.java b/webmagic-avalon/webmagic-admin/src/main/java/us/codecraft/webmagic/avalon/web/DashBoardController.java similarity index 100% rename from webmagic-avalon/src/main/java/us/codecraft/webmagic/avalon/web/DashBoardController.java rename to webmagic-avalon/webmagic-admin/src/main/java/us/codecraft/webmagic/avalon/web/DashBoardController.java diff --git a/webmagic-avalon/src/main/java/us/codecraft/webmagic/avalon/web/SpiderController.java b/webmagic-avalon/webmagic-admin/src/main/java/us/codecraft/webmagic/avalon/web/SpiderController.java similarity index 100% rename from webmagic-avalon/src/main/java/us/codecraft/webmagic/avalon/web/SpiderController.java rename to webmagic-avalon/webmagic-admin/src/main/java/us/codecraft/webmagic/avalon/web/SpiderController.java diff --git a/webmagic-avalon/src/main/resources/freemarker.properties b/webmagic-avalon/webmagic-admin/src/main/resources/freemarker.properties similarity index 100% rename from webmagic-avalon/src/main/resources/freemarker.properties rename to webmagic-avalon/webmagic-admin/src/main/resources/freemarker.properties diff --git a/webmagic-avalon/webmagic-admin/src/main/resources/log/log4j.xml b/webmagic-avalon/webmagic-admin/src/main/resources/log/log4j.xml new file mode 100644 index 0000000..c2b5a2f --- /dev/null +++ b/webmagic-avalon/webmagic-admin/src/main/resources/log/log4j.xml @@ -0,0 +1,21 @@ + + + + + + + + + + + + + + + + + + + + + diff --git a/webmagic-avalon/src/main/webapp/WEB-INF/jsp/404.jsp b/webmagic-avalon/webmagic-admin/src/main/webapp/WEB-INF/jsp/404.jsp similarity index 100% rename from webmagic-avalon/src/main/webapp/WEB-INF/jsp/404.jsp rename to webmagic-avalon/webmagic-admin/src/main/webapp/WEB-INF/jsp/404.jsp diff --git a/webmagic-avalon/src/main/webapp/WEB-INF/jsp/500.jsp b/webmagic-avalon/webmagic-admin/src/main/webapp/WEB-INF/jsp/500.jsp similarity index 100% rename from webmagic-avalon/src/main/webapp/WEB-INF/jsp/500.jsp rename to webmagic-avalon/webmagic-admin/src/main/webapp/WEB-INF/jsp/500.jsp diff --git a/webmagic-avalon/webmagic-admin/src/main/webapp/WEB-INF/pages/create_spider.ftl b/webmagic-avalon/webmagic-admin/src/main/webapp/WEB-INF/pages/create_spider.ftl new file mode 100644 index 0000000..4cd838c --- /dev/null +++ b/webmagic-avalon/webmagic-admin/src/main/webapp/WEB-INF/pages/create_spider.ftl @@ -0,0 +1,14 @@ + + + + + +
+ +
+ +
+ +
+ + \ No newline at end of file diff --git a/webmagic-avalon/src/main/webapp/WEB-INF/pages/dashboard.ftl b/webmagic-avalon/webmagic-admin/src/main/webapp/WEB-INF/pages/dashboard.ftl similarity index 88% rename from webmagic-avalon/src/main/webapp/WEB-INF/pages/dashboard.ftl rename to webmagic-avalon/webmagic-admin/src/main/webapp/WEB-INF/pages/dashboard.ftl index 591d180..5ed6fb6 100644 --- a/webmagic-avalon/src/main/webapp/WEB-INF/pages/dashboard.ftl +++ b/webmagic-avalon/webmagic-admin/src/main/webapp/WEB-INF/pages/dashboard.ftl @@ -15,8 +15,8 @@ WebMaigc Avalon - - + + @@ -123,23 +123,10 @@ @@ -173,26 +160,6 @@ 6 - - -
Pro Members
-
228
- 4 -
- - - -
Sales
-
$13320
- $34 -
- - - -
Messages
-
25
- 12 -
diff --git a/webmagic-avalon/src/main/webapp/WEB-INF/pages/spider_list.ftl b/webmagic-avalon/webmagic-admin/src/main/webapp/WEB-INF/pages/spider_list.ftl similarity index 100% rename from webmagic-avalon/src/main/webapp/WEB-INF/pages/spider_list.ftl rename to webmagic-avalon/webmagic-admin/src/main/webapp/WEB-INF/pages/spider_list.ftl diff --git a/webmagic-avalon/src/main/webapp/WEB-INF/web.xml b/webmagic-avalon/webmagic-admin/src/main/webapp/WEB-INF/web.xml similarity index 91% rename from webmagic-avalon/src/main/webapp/WEB-INF/web.xml rename to webmagic-avalon/webmagic-admin/src/main/webapp/WEB-INF/web.xml index eb253f3..cd7ee5b 100644 --- a/webmagic-avalon/src/main/webapp/WEB-INF/web.xml +++ b/webmagic-avalon/webmagic-admin/src/main/webapp/WEB-INF/web.xml @@ -7,7 +7,7 @@ contextConfigLocation - classpath*:spring/applicationContext*.xml, + classpath*:/config/spring/applicationContext*.xml, @@ -33,7 +33,7 @@ org.springframework.web.servlet.DispatcherServlet contextConfigLocation - classpath:/spring/applicationContext*.xml + classpath*:config/spring/applicationContext*.xml 1 diff --git a/webmagic-avalon/src/main/webapp/static/css/bootstrap-cerulean.css b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/bootstrap-cerulean.css similarity index 99% rename from webmagic-avalon/src/main/webapp/static/css/bootstrap-cerulean.css rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/css/bootstrap-cerulean.css index 82037d4..3d95708 100755 --- a/webmagic-avalon/src/main/webapp/static/css/bootstrap-cerulean.css +++ b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/bootstrap-cerulean.css @@ -1,4 +1,4 @@ -@import url(https://fonts.googleapis.com/css?family=Karla|Ubuntu); +/*@import url(https://fonts.googleapis.com/css?family=Karla|Ubuntu);*/ /*! * Bootstrap v2.0.4 * diff --git a/webmagic-avalon/src/main/webapp/static/css/bootstrap-classic.css b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/bootstrap-classic.css similarity index 100% rename from webmagic-avalon/src/main/webapp/static/css/bootstrap-classic.css rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/css/bootstrap-classic.css diff --git a/webmagic-avalon/src/main/webapp/static/css/bootstrap-classic.min.css b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/bootstrap-classic.min.css similarity index 100% rename from webmagic-avalon/src/main/webapp/static/css/bootstrap-classic.min.css rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/css/bootstrap-classic.min.css diff --git a/webmagic-avalon/src/main/webapp/static/css/bootstrap-cyborg.css b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/bootstrap-cyborg.css similarity index 99% rename from webmagic-avalon/src/main/webapp/static/css/bootstrap-cyborg.css rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/css/bootstrap-cyborg.css index 6f4b9c4..39ec617 100755 --- a/webmagic-avalon/src/main/webapp/static/css/bootstrap-cyborg.css +++ b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/bootstrap-cyborg.css @@ -1,4 +1,4 @@ -@import url('https://fonts.googleapis.com/css?family=Droid+Sans:400,700'); +/*@import url('https://fonts.googleapis.com/css?family=Droid+Sans:400,700');*/ /*! * Bootstrap v2.0.4 * diff --git a/webmagic-avalon/src/main/webapp/static/css/bootstrap-journal.css b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/bootstrap-journal.css similarity index 99% rename from webmagic-avalon/src/main/webapp/static/css/bootstrap-journal.css rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/css/bootstrap-journal.css index 9c18433..e335d98 100755 --- a/webmagic-avalon/src/main/webapp/static/css/bootstrap-journal.css +++ b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/bootstrap-journal.css @@ -1,4 +1,4 @@ -@import url('https://fonts.googleapis.com/css?family=Open+Sans:400,700'); +/*@import url('https://fonts.googleapis.com/css?family=Open+Sans:400,700');*/ /*! * Bootstrap v2.0.4 * diff --git a/webmagic-avalon/src/main/webapp/static/css/bootstrap-redy.css b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/bootstrap-redy.css similarity index 99% rename from webmagic-avalon/src/main/webapp/static/css/bootstrap-redy.css rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/css/bootstrap-redy.css index f498982..2e208b2 100644 --- a/webmagic-avalon/src/main/webapp/static/css/bootstrap-redy.css +++ b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/bootstrap-redy.css @@ -1,4 +1,4 @@ -@import url(https://fonts.googleapis.com/css?family=Karla|Ubuntu); +/*@import url(https://fonts.googleapis.com/css?family=Karla|Ubuntu);*/ /*! * Bootstrap v2.0.4 * diff --git a/webmagic-avalon/src/main/webapp/static/css/bootstrap-responsive.css b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/bootstrap-responsive.css similarity index 100% rename from webmagic-avalon/src/main/webapp/static/css/bootstrap-responsive.css rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/css/bootstrap-responsive.css diff --git a/webmagic-avalon/src/main/webapp/static/css/bootstrap-responsive.min.css b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/bootstrap-responsive.min.css similarity index 100% rename from webmagic-avalon/src/main/webapp/static/css/bootstrap-responsive.min.css rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/css/bootstrap-responsive.min.css diff --git a/webmagic-avalon/src/main/webapp/static/css/bootstrap-simplex.css b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/bootstrap-simplex.css similarity index 100% rename from webmagic-avalon/src/main/webapp/static/css/bootstrap-simplex.css rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/css/bootstrap-simplex.css diff --git a/webmagic-avalon/src/main/webapp/static/css/bootstrap-slate.css b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/bootstrap-slate.css similarity index 100% rename from webmagic-avalon/src/main/webapp/static/css/bootstrap-slate.css rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/css/bootstrap-slate.css diff --git a/webmagic-avalon/src/main/webapp/static/css/bootstrap-spacelab.css b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/bootstrap-spacelab.css similarity index 100% rename from webmagic-avalon/src/main/webapp/static/css/bootstrap-spacelab.css rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/css/bootstrap-spacelab.css diff --git a/webmagic-avalon/src/main/webapp/static/css/bootstrap-united.css b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/bootstrap-united.css similarity index 99% rename from webmagic-avalon/src/main/webapp/static/css/bootstrap-united.css rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/css/bootstrap-united.css index b05b04e..94e4c79 100755 --- a/webmagic-avalon/src/main/webapp/static/css/bootstrap-united.css +++ b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/bootstrap-united.css @@ -1,4 +1,4 @@ -@import url(https://fonts.googleapis.com/css?family=Ubuntu); +/*@import url(https://fonts.googleapis.com/css?family=Ubuntu);*/ /*! * Bootstrap v2.0.4 * diff --git a/webmagic-avalon/src/main/webapp/static/css/charisma-app.css b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/charisma-app.css similarity index 99% rename from webmagic-avalon/src/main/webapp/static/css/charisma-app.css rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/css/charisma-app.css index f795fb4..5b46b39 100755 --- a/webmagic-avalon/src/main/webapp/static/css/charisma-app.css +++ b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/charisma-app.css @@ -1,4 +1,4 @@ -@import url(https://fonts.googleapis.com/css?family=Shojumaru); +/*@import url(https://fonts.googleapis.com/css?family=Shojumaru);*/ select{ background-color:#fff; diff --git a/webmagic-avalon/src/main/webapp/static/css/chosen.css b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/chosen.css similarity index 100% rename from webmagic-avalon/src/main/webapp/static/css/chosen.css rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/css/chosen.css diff --git a/webmagic-avalon/src/main/webapp/static/css/colorbox.css b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/colorbox.css similarity index 100% rename from webmagic-avalon/src/main/webapp/static/css/colorbox.css rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/css/colorbox.css diff --git a/webmagic-avalon/src/main/webapp/static/css/elfinder.min.css b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/elfinder.min.css similarity index 100% rename from webmagic-avalon/src/main/webapp/static/css/elfinder.min.css rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/css/elfinder.min.css diff --git a/webmagic-avalon/src/main/webapp/static/css/elfinder.theme.css b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/elfinder.theme.css similarity index 100% rename from webmagic-avalon/src/main/webapp/static/css/elfinder.theme.css rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/css/elfinder.theme.css diff --git a/webmagic-avalon/src/main/webapp/static/css/fullcalendar.css b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/fullcalendar.css similarity index 100% rename from webmagic-avalon/src/main/webapp/static/css/fullcalendar.css rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/css/fullcalendar.css diff --git a/webmagic-avalon/src/main/webapp/static/css/fullcalendar.print.css b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/fullcalendar.print.css similarity index 100% rename from webmagic-avalon/src/main/webapp/static/css/fullcalendar.print.css rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/css/fullcalendar.print.css diff --git a/webmagic-avalon/src/main/webapp/static/css/jquery-ui-1.8.21.custom.css b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/jquery-ui-1.8.21.custom.css similarity index 100% rename from webmagic-avalon/src/main/webapp/static/css/jquery-ui-1.8.21.custom.css rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/css/jquery-ui-1.8.21.custom.css diff --git a/webmagic-avalon/src/main/webapp/static/css/jquery.cleditor.css b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/jquery.cleditor.css similarity index 100% rename from webmagic-avalon/src/main/webapp/static/css/jquery.cleditor.css rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/css/jquery.cleditor.css diff --git a/webmagic-avalon/src/main/webapp/static/css/jquery.iphone.toggle.css b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/jquery.iphone.toggle.css similarity index 100% rename from webmagic-avalon/src/main/webapp/static/css/jquery.iphone.toggle.css rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/css/jquery.iphone.toggle.css diff --git a/webmagic-avalon/src/main/webapp/static/css/jquery.noty.css b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/jquery.noty.css similarity index 100% rename from webmagic-avalon/src/main/webapp/static/css/jquery.noty.css rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/css/jquery.noty.css diff --git a/webmagic-avalon/src/main/webapp/static/css/noty_theme_default.css b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/noty_theme_default.css similarity index 100% rename from webmagic-avalon/src/main/webapp/static/css/noty_theme_default.css rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/css/noty_theme_default.css diff --git a/webmagic-avalon/src/main/webapp/static/css/opa-icons.css b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/opa-icons.css similarity index 100% rename from webmagic-avalon/src/main/webapp/static/css/opa-icons.css rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/css/opa-icons.css diff --git a/webmagic-avalon/src/main/webapp/static/css/uniform.default.css b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/uniform.default.css similarity index 100% rename from webmagic-avalon/src/main/webapp/static/css/uniform.default.css rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/css/uniform.default.css diff --git a/webmagic-avalon/src/main/webapp/static/css/uploadify.css b/webmagic-avalon/webmagic-admin/src/main/webapp/static/css/uploadify.css similarity index 100% rename from webmagic-avalon/src/main/webapp/static/css/uploadify.css rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/css/uploadify.css diff --git a/webmagic-avalon/src/main/webapp/static/favicon.jpg b/webmagic-avalon/webmagic-admin/src/main/webapp/static/favicon.jpg similarity index 100% rename from webmagic-avalon/src/main/webapp/static/favicon.jpg rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/favicon.jpg diff --git a/webmagic-avalon/src/main/webapp/static/js/bootstrap-alert.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/bootstrap-alert.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/bootstrap-alert.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/bootstrap-alert.js diff --git a/webmagic-avalon/src/main/webapp/static/js/bootstrap-button.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/bootstrap-button.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/bootstrap-button.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/bootstrap-button.js diff --git a/webmagic-avalon/src/main/webapp/static/js/bootstrap-carousel.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/bootstrap-carousel.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/bootstrap-carousel.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/bootstrap-carousel.js diff --git a/webmagic-avalon/src/main/webapp/static/js/bootstrap-collapse.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/bootstrap-collapse.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/bootstrap-collapse.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/bootstrap-collapse.js diff --git a/webmagic-avalon/src/main/webapp/static/js/bootstrap-dropdown.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/bootstrap-dropdown.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/bootstrap-dropdown.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/bootstrap-dropdown.js diff --git a/webmagic-avalon/src/main/webapp/static/js/bootstrap-modal.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/bootstrap-modal.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/bootstrap-modal.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/bootstrap-modal.js diff --git a/webmagic-avalon/src/main/webapp/static/js/bootstrap-popover.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/bootstrap-popover.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/bootstrap-popover.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/bootstrap-popover.js diff --git a/webmagic-avalon/src/main/webapp/static/js/bootstrap-scrollspy.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/bootstrap-scrollspy.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/bootstrap-scrollspy.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/bootstrap-scrollspy.js diff --git a/webmagic-avalon/src/main/webapp/static/js/bootstrap-tab.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/bootstrap-tab.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/bootstrap-tab.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/bootstrap-tab.js diff --git a/webmagic-avalon/src/main/webapp/static/js/bootstrap-toggle.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/bootstrap-toggle.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/bootstrap-toggle.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/bootstrap-toggle.js diff --git a/webmagic-avalon/src/main/webapp/static/js/bootstrap-tooltip.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/bootstrap-tooltip.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/bootstrap-tooltip.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/bootstrap-tooltip.js diff --git a/webmagic-avalon/src/main/webapp/static/js/bootstrap-tour.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/bootstrap-tour.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/bootstrap-tour.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/bootstrap-tour.js diff --git a/webmagic-avalon/src/main/webapp/static/js/bootstrap-transition.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/bootstrap-transition.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/bootstrap-transition.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/bootstrap-transition.js diff --git a/webmagic-avalon/src/main/webapp/static/js/bootstrap-typeahead.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/bootstrap-typeahead.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/bootstrap-typeahead.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/bootstrap-typeahead.js diff --git a/webmagic-avalon/src/main/webapp/static/js/charisma.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/charisma.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/charisma.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/charisma.js diff --git a/webmagic-avalon/src/main/webapp/static/js/excanvas.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/excanvas.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/excanvas.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/excanvas.js diff --git a/webmagic-avalon/src/main/webapp/static/js/fullcalendar.min.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/fullcalendar.min.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/fullcalendar.min.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/fullcalendar.min.js diff --git a/webmagic-avalon/src/main/webapp/static/js/jquery-1.7.2.min.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery-1.7.2.min.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/jquery-1.7.2.min.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery-1.7.2.min.js diff --git a/webmagic-avalon/src/main/webapp/static/js/jquery-ui-1.8.21.custom.min.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery-ui-1.8.21.custom.min.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/jquery-ui-1.8.21.custom.min.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery-ui-1.8.21.custom.min.js diff --git a/webmagic-avalon/src/main/webapp/static/js/jquery.autogrow-textarea.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.autogrow-textarea.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/jquery.autogrow-textarea.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.autogrow-textarea.js diff --git a/webmagic-avalon/src/main/webapp/static/js/jquery.chosen.min.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.chosen.min.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/jquery.chosen.min.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.chosen.min.js diff --git a/webmagic-avalon/src/main/webapp/static/js/jquery.cleditor.min.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.cleditor.min.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/jquery.cleditor.min.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.cleditor.min.js diff --git a/webmagic-avalon/src/main/webapp/static/js/jquery.colorbox.min.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.colorbox.min.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/jquery.colorbox.min.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.colorbox.min.js diff --git a/webmagic-avalon/src/main/webapp/static/js/jquery.cookie.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.cookie.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/jquery.cookie.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.cookie.js diff --git a/webmagic-avalon/src/main/webapp/static/js/jquery.dataTables.min.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.dataTables.min.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/jquery.dataTables.min.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.dataTables.min.js diff --git a/webmagic-avalon/src/main/webapp/static/js/jquery.elfinder.min.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.elfinder.min.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/jquery.elfinder.min.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.elfinder.min.js diff --git a/webmagic-avalon/src/main/webapp/static/js/jquery.flot.min.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.flot.min.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/jquery.flot.min.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.flot.min.js diff --git a/webmagic-avalon/src/main/webapp/static/js/jquery.flot.pie.min.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.flot.pie.min.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/jquery.flot.pie.min.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.flot.pie.min.js diff --git a/webmagic-avalon/src/main/webapp/static/js/jquery.flot.resize.min.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.flot.resize.min.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/jquery.flot.resize.min.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.flot.resize.min.js diff --git a/webmagic-avalon/src/main/webapp/static/js/jquery.flot.stack.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.flot.stack.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/jquery.flot.stack.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.flot.stack.js diff --git a/webmagic-avalon/src/main/webapp/static/js/jquery.history.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.history.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/jquery.history.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.history.js diff --git a/webmagic-avalon/src/main/webapp/static/js/jquery.iphone.toggle.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.iphone.toggle.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/jquery.iphone.toggle.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.iphone.toggle.js diff --git a/webmagic-avalon/src/main/webapp/static/js/jquery.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/jquery.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.js diff --git a/webmagic-avalon/src/main/webapp/static/js/jquery.noty.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.noty.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/jquery.noty.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.noty.js diff --git a/webmagic-avalon/src/main/webapp/static/js/jquery.raty.min.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.raty.min.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/jquery.raty.min.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.raty.min.js diff --git a/webmagic-avalon/src/main/webapp/static/js/jquery.uniform.min.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.uniform.min.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/jquery.uniform.min.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.uniform.min.js diff --git a/webmagic-avalon/src/main/webapp/static/js/jquery.uploadify-3.1.min.js b/webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.uploadify-3.1.min.js similarity index 100% rename from webmagic-avalon/src/main/webapp/static/js/jquery.uploadify-3.1.min.js rename to webmagic-avalon/webmagic-admin/src/main/webapp/static/js/jquery.uploadify-3.1.min.js diff --git a/webmagic-avalon/webmagic-avalon-common/pom.xml b/webmagic-avalon/webmagic-avalon-common/pom.xml new file mode 100644 index 0000000..0125b19 --- /dev/null +++ b/webmagic-avalon/webmagic-avalon-common/pom.xml @@ -0,0 +1,167 @@ + + + + webmagic-avalon + us.codecraft + 0.5.0 + + 4.0.0 + + webmagic-avalon-common + jar + + + + + us.codecraft + webmagic-extension + ${project.version} + + + + org.mybatis + mybatis + + + + us.codecraft + forger + 0.1.1-SNAPSHOT + + + + org.mybatis + mybatis-spring + + + + org.freemarker + freemarker + + + + org.springframework + spring-test + test + + + + org.assertj + assertj-core + test + + + + junit + junit + + + + mysql + mysql-connector-java + + + + commons-dbcp + commons-dbcp + + + + org.springframework + spring-aop + ${spring-version} + + + + org.aspectj + aspectjrt + + + + org.aspectj + aspectjweaver + + + + org.springframework + spring-core + + + + org.springframework + spring-webmvc + + + + com.h2database + h2 + 1.3.175 + + + + org.mockito + mockito-all + + + + javax.servlet + javax.servlet-api + + + + org.springframework + spring-context + + + + org.springframework + spring-context-support + + + + com.alibaba + fastjson + + + + + + + + maven-deploy-plugin + + true + + + + org.apache.maven.plugins + maven-jar-plugin + 2.4 + + + + true + ./lib/ + us.codecraft.webmagic.main.QuickStarter + + + + + + + + + + sonatype-nexus-snapshots + Sonatype Nexus Snapshots + https://oss.sonatype.org/content/repositories/snapshots + + false + + + true + + + + + diff --git a/webmagic-avalon/webmagic-avalon-common/src/main/java/us/codecraft/webmagic/dao/DynamicClassDao.java b/webmagic-avalon/webmagic-avalon-common/src/main/java/us/codecraft/webmagic/dao/DynamicClassDao.java new file mode 100644 index 0000000..b3d93ad --- /dev/null +++ b/webmagic-avalon/webmagic-avalon-common/src/main/java/us/codecraft/webmagic/dao/DynamicClassDao.java @@ -0,0 +1,11 @@ +package us.codecraft.webmagic.dao; + +import us.codecraft.webmagic.model.DynamicClass; + +/** + * @author code4crafter@gmail.com + */ +public interface DynamicClassDao { + + public int add(DynamicClass dynamicClass); +} diff --git a/webmagic-avalon/webmagic-avalon-common/src/main/java/us/codecraft/webmagic/exception/DynamicClassCompileException.java b/webmagic-avalon/webmagic-avalon-common/src/main/java/us/codecraft/webmagic/exception/DynamicClassCompileException.java new file mode 100644 index 0000000..8512ae5 --- /dev/null +++ b/webmagic-avalon/webmagic-avalon-common/src/main/java/us/codecraft/webmagic/exception/DynamicClassCompileException.java @@ -0,0 +1,15 @@ +package us.codecraft.webmagic.exception; + +/** + * @author code4crafter@gmail.com + */ +public class DynamicClassCompileException extends Exception{ + + public DynamicClassCompileException(String message) { + super(message); + } + + public DynamicClassCompileException(String message, Throwable cause) { + super(message, cause); + } +} diff --git a/webmagic-avalon/webmagic-avalon-common/src/main/java/us/codecraft/webmagic/model/DynamicClass.java b/webmagic-avalon/webmagic-avalon-common/src/main/java/us/codecraft/webmagic/model/DynamicClass.java new file mode 100644 index 0000000..4809128 --- /dev/null +++ b/webmagic-avalon/webmagic-avalon-common/src/main/java/us/codecraft/webmagic/model/DynamicClass.java @@ -0,0 +1,49 @@ +package us.codecraft.webmagic.model; + +import java.util.Date; + +/** + * @author code4crafter@gmail.com + */ +public class DynamicClass { + + private String className; + + private String sourceCode; + + private Date addTime; + + private Date updateTime; + + public String getClassName() { + return className; + } + + public void setClassName(String className) { + this.className = className; + } + + public String getSourceCode() { + return sourceCode; + } + + public void setSourceCode(String sourceCode) { + this.sourceCode = sourceCode; + } + + public Date getAddTime() { + return addTime; + } + + public void setAddTime(Date addTime) { + this.addTime = addTime; + } + + public Date getUpdateTime() { + return updateTime; + } + + public void setUpdateTime(Date updateTime) { + this.updateTime = updateTime; + } +} diff --git a/webmagic-avalon/webmagic-avalon-common/src/main/java/us/codecraft/webmagic/service/DynamicClassService.java b/webmagic-avalon/webmagic-avalon-common/src/main/java/us/codecraft/webmagic/service/DynamicClassService.java new file mode 100644 index 0000000..1cd719c --- /dev/null +++ b/webmagic-avalon/webmagic-avalon-common/src/main/java/us/codecraft/webmagic/service/DynamicClassService.java @@ -0,0 +1,12 @@ +package us.codecraft.webmagic.service; + +import us.codecraft.webmagic.exception.DynamicClassCompileException; + +/** + * @author code4crafter@gmail.com + */ +public interface DynamicClassService { + + public Class compileAndSave(String sourceCode) throws DynamicClassCompileException; + +} diff --git a/webmagic-avalon/webmagic-avalon-common/src/main/java/us/codecraft/webmagic/service/impl/DynamicClassServiceImpl.java b/webmagic-avalon/webmagic-avalon-common/src/main/java/us/codecraft/webmagic/service/impl/DynamicClassServiceImpl.java new file mode 100644 index 0000000..ec83efd --- /dev/null +++ b/webmagic-avalon/webmagic-avalon-common/src/main/java/us/codecraft/webmagic/service/impl/DynamicClassServiceImpl.java @@ -0,0 +1,41 @@ +package us.codecraft.webmagic.service.impl; + +import org.codehaus.groovy.control.CompilationFailedException; +import org.springframework.beans.factory.annotation.Autowired; +import org.springframework.stereotype.Service; +import us.codecraft.forger.Forger; +import us.codecraft.forger.ForgerFactory; +import us.codecraft.webmagic.dao.DynamicClassDao; +import us.codecraft.webmagic.exception.DynamicClassCompileException; +import us.codecraft.webmagic.model.DynamicClass; +import us.codecraft.webmagic.service.DynamicClassService; + +/** + * @author code4crafter@gmail.com + */ +@Service +public class DynamicClassServiceImpl implements DynamicClassService { + + @Autowired + private DynamicClassDao dynamicClassDao; + + @Autowired + private ForgerFactory forgerFactory; + + @Override + public Class compileAndSave(String sourceCode) throws DynamicClassCompileException { + Forger forger; + try { + forger = forgerFactory.compile(sourceCode); + } catch (CompilationFailedException e) { + throw new DynamicClassCompileException(e.getMessage(),e); + } + String className = forger.getClazz().getCanonicalName(); + DynamicClass dynamicClass = new DynamicClass(); + dynamicClass.setClassName(className); + dynamicClass.setSourceCode(sourceCode); + dynamicClassDao.add(dynamicClass); + return forger.getClazz(); + } + +} diff --git a/webmagic-avalon/webmagic-avalon-common/src/main/resources/config/freemarker.properties b/webmagic-avalon/webmagic-avalon-common/src/main/resources/config/freemarker.properties new file mode 100644 index 0000000..dbed67f --- /dev/null +++ b/webmagic-avalon/webmagic-avalon-common/src/main/resources/config/freemarker.properties @@ -0,0 +1,7 @@ +number_format=# +classic_compatible=true + +default_encoding=UTF-8 +template_update_delay=0 +######################### +template_exception_handler=rethrow diff --git a/webmagic-avalon/webmagic-avalon-common/src/main/resources/config/log/log4j.xml b/webmagic-avalon/webmagic-avalon-common/src/main/resources/config/log/log4j.xml new file mode 100644 index 0000000..c2b5a2f --- /dev/null +++ b/webmagic-avalon/webmagic-avalon-common/src/main/resources/config/log/log4j.xml @@ -0,0 +1,21 @@ + + + + + + + + + + + + + + + + + + + + + diff --git a/webmagic-avalon/webmagic-avalon-common/src/main/resources/config/mapper/DynamicClass.xml b/webmagic-avalon/webmagic-avalon-common/src/main/resources/config/mapper/DynamicClass.xml new file mode 100644 index 0000000..1e09b7f --- /dev/null +++ b/webmagic-avalon/webmagic-avalon-common/src/main/resources/config/mapper/DynamicClass.xml @@ -0,0 +1,32 @@ + + + + + + + + insert into DynamicClass (`ClassName`,`SourceCode`,`AddTime`,`UpdateTime`) + values (#{className},#{sourceCode},now(),now()) + + + + insert into DynamicClass (`ClassName`,`SourceCode`,`AddTime`,`UpdateTime`) + values (#{className},#{sourceCode},now(),now()) + + + \ No newline at end of file diff --git a/webmagic-avalon/webmagic-avalon-common/src/main/resources/config/spring/applicationContext-component.xml b/webmagic-avalon/webmagic-avalon-common/src/main/resources/config/spring/applicationContext-component.xml new file mode 100644 index 0000000..faba6ca --- /dev/null +++ b/webmagic-avalon/webmagic-avalon-common/src/main/resources/config/spring/applicationContext-component.xml @@ -0,0 +1,24 @@ + + + + + + + + web_messages + + + + + + + \ No newline at end of file diff --git a/webmagic-avalon/webmagic-avalon-common/src/main/resources/config/spring/applicationContext-datasource.xml b/webmagic-avalon/webmagic-avalon-common/src/main/resources/config/spring/applicationContext-datasource.xml new file mode 100644 index 0000000..7d468af --- /dev/null +++ b/webmagic-avalon/webmagic-avalon-common/src/main/resources/config/spring/applicationContext-datasource.xml @@ -0,0 +1,41 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/webmagic-avalon/src/main/resources/spring/applicationContext-freemarker.xml b/webmagic-avalon/webmagic-avalon-common/src/main/resources/config/spring/applicationContext-freemarker.xml similarity index 100% rename from webmagic-avalon/src/main/resources/spring/applicationContext-freemarker.xml rename to webmagic-avalon/webmagic-avalon-common/src/main/resources/config/spring/applicationContext-freemarker.xml diff --git a/webmagic-avalon/webmagic-avalon-common/src/main/resources/config/spring/applicationContext-myBatis.xml b/webmagic-avalon/webmagic-avalon-common/src/main/resources/config/spring/applicationContext-myBatis.xml new file mode 100644 index 0000000..8601852 --- /dev/null +++ b/webmagic-avalon/webmagic-avalon-common/src/main/resources/config/spring/applicationContext-myBatis.xml @@ -0,0 +1,33 @@ + + + + + + + + + + + sqlserver + db2 + oracle + mysql + h2 + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/webmagic-avalon/webmagic-avalon-common/src/main/resources/config/spring/applicationContext-service.xml b/webmagic-avalon/webmagic-avalon-common/src/main/resources/config/spring/applicationContext-service.xml new file mode 100644 index 0000000..f854d3d --- /dev/null +++ b/webmagic-avalon/webmagic-avalon-common/src/main/resources/config/spring/applicationContext-service.xml @@ -0,0 +1,22 @@ + + + + + + + + + + + + + \ No newline at end of file diff --git a/webmagic-avalon/webmagic-avalon-common/src/main/resources/config/spring/applicationContext-tx.xml b/webmagic-avalon/webmagic-avalon-common/src/main/resources/config/spring/applicationContext-tx.xml new file mode 100644 index 0000000..79421a2 --- /dev/null +++ b/webmagic-avalon/webmagic-avalon-common/src/main/resources/config/spring/applicationContext-tx.xml @@ -0,0 +1,18 @@ + + + + + + + + + + + \ No newline at end of file diff --git a/webmagic-avalon/src/main/resources/spring/applicationContext.xml b/webmagic-avalon/webmagic-avalon-common/src/main/resources/config/spring/applicationContext-webmvc.xml similarity index 64% rename from webmagic-avalon/src/main/resources/spring/applicationContext.xml rename to webmagic-avalon/webmagic-avalon-common/src/main/resources/config/spring/applicationContext-webmvc.xml index 7c19641..476cd39 100644 --- a/webmagic-avalon/src/main/resources/spring/applicationContext.xml +++ b/webmagic-avalon/webmagic-avalon-common/src/main/resources/config/spring/applicationContext-webmvc.xml @@ -2,29 +2,19 @@ - - - - - - web_messages - - - - - - + text/html;charset=UTF-8 @@ -38,10 +28,6 @@ - - - - - + \ No newline at end of file diff --git a/webmagic-avalon/webmagic-avalon-common/src/main/resources/sql/h2/schema.sql b/webmagic-avalon/webmagic-avalon-common/src/main/resources/sql/h2/schema.sql new file mode 100644 index 0000000..37c3758 --- /dev/null +++ b/webmagic-avalon/webmagic-avalon-common/src/main/resources/sql/h2/schema.sql @@ -0,0 +1,8 @@ +CREATE TABLE DynamicClass( + Id int(11) NOT NULL AUTO_INCREMENT PRIMARY KEY, + `ClassName` varchar(200) NOT NULL, + `SourceCode` text NOT NULL, + `AddTime` datetime NOT NULL, + `UpdateTime` datetime NOT NULL, + UNIQUE INDEX `un_class_name` (`ClassName`) +); \ No newline at end of file diff --git a/webmagic-avalon/webmagic-avalon-common/src/main/resources/sql/mysql/schema.sql b/webmagic-avalon/webmagic-avalon-common/src/main/resources/sql/mysql/schema.sql new file mode 100644 index 0000000..c75a884 --- /dev/null +++ b/webmagic-avalon/webmagic-avalon-common/src/main/resources/sql/mysql/schema.sql @@ -0,0 +1,31 @@ +CREATE TABLE `DynamicClass` ( + `Id` int(11) unsigned NOT NULL AUTO_INCREMENT, + `ClassName` varchar(200) NOT NULL, + `SourceCode` text NOT NULL, + `AddTime` datetime NOT NULL, + `UpdateTime` datetime NOT NULL, + PRIMARY KEY (`Id`), + UNIQUE KEY `un_class_name` (`ClassName`) +) ENGINE=InnoDB DEFAULT CHARSET=utf8; + +CREATE TABLE `Spider` ( + `Id` int(11) unsigned NOT NULL AUTO_INCREMENT, + `PageProcessorId` int(11) unsigned NOT NULL AUTO_INCREMENT, + `PipelineId` int(11) unsigned NOT NULL AUTO_INCREMENT, + `SchedulerId` int(11) unsigned NOT NULL AUTO_INCREMENT, + `Config` text NOT NULL, + `AddTime` datetime NOT NULL, + `UpdateTime` datetime NOT NULL, + PRIMARY KEY (`Id`), + UNIQUE KEY `un_class_name` (`ClassName`) +) ENGINE=InnoDB DEFAULT CHARSET=utf8; + +CREATE TABLE `PageProcessor` ( + `Id` int(11) unsigned NOT NULL AUTO_INCREMENT, + `ClassName` varchar(200) NOT NULL, + `Params` text NOT NULL, + `AddTime` datetime NOT NULL, + `UpdateTime` datetime NOT NULL, + PRIMARY KEY (`Id`), + UNIQUE KEY `un_class_name` (`ClassName`) +) ENGINE=InnoDB DEFAULT CHARSET=utf8; \ No newline at end of file diff --git a/webmagic-avalon/webmagic-avalon-common/src/test/java/us/codecraft/webmagic/AbstractTest.java b/webmagic-avalon/webmagic-avalon-common/src/test/java/us/codecraft/webmagic/AbstractTest.java new file mode 100644 index 0000000..b259a6d --- /dev/null +++ b/webmagic-avalon/webmagic-avalon-common/src/test/java/us/codecraft/webmagic/AbstractTest.java @@ -0,0 +1,17 @@ +package us.codecraft.webmagic; + +import org.junit.runner.RunWith; +import org.springframework.test.context.ActiveProfiles; +import org.springframework.test.context.ContextConfiguration; +import org.springframework.test.context.junit4.SpringJUnit4ClassRunner; +import org.springframework.transaction.annotation.Transactional; + +/** + * @author code4crafter@gmail.com + */ +@RunWith(SpringJUnit4ClassRunner.class) +@ContextConfiguration(locations = {"classpath*:/config/spring/applicationContext*.xml"}) +@ActiveProfiles("test") +@Transactional +public abstract class AbstractTest { +} diff --git a/webmagic-avalon/webmagic-avalon-common/src/test/java/us/codecraft/webmagic/Foo.java b/webmagic-avalon/webmagic-avalon-common/src/test/java/us/codecraft/webmagic/Foo.java new file mode 100644 index 0000000..9078eb4 --- /dev/null +++ b/webmagic-avalon/webmagic-avalon-common/src/test/java/us/codecraft/webmagic/Foo.java @@ -0,0 +1,45 @@ +package us.codecraft.webmagic; + +import us.codecraft.forger.property.Inject; +import us.codecraft.forger.property.format.Formatter; + +/** + * @author code4crafter@gmail.com + */ +public class Foo { + + @Formatter("") + @Inject("fooa") + private String foo; + + public static final String SOURCE_CODE="package us.codecraft.webmagic;\n" + + "\n" + + "import us.codecraft.forger.property.Inject;\n" + + "import us.codecraft.forger.property.format.Formatter;\n" + + "\n" + + "/**\n" + + " * @author code4crafter@gmail.com\n" + + " */\n" + + "public class Foo {\n" + + "\n" + + " @Formatter(\"\")\n" + + " @Inject(\"fooa\")\n" + + " private String foo;\n" + + "\n" + + " public String getFoo() {\n" + + " return foo;\n" + + " }\n" + + "\n" + + " public String foo() {\n" + + " return foo;\n" + + " }\n" + + "}"; + + public String getFoo() { + return foo; + } + + public String foo() { + return foo; + } +} diff --git a/webmagic-avalon/webmagic-avalon-common/src/test/java/us/codecraft/webmagic/dao/DynamicClassDaoTest.java b/webmagic-avalon/webmagic-avalon-common/src/test/java/us/codecraft/webmagic/dao/DynamicClassDaoTest.java new file mode 100644 index 0000000..86a9a15 --- /dev/null +++ b/webmagic-avalon/webmagic-avalon-common/src/test/java/us/codecraft/webmagic/dao/DynamicClassDaoTest.java @@ -0,0 +1,27 @@ +package us.codecraft.webmagic.dao; + +import org.junit.Test; +import org.springframework.beans.factory.annotation.Autowired; +import org.springframework.test.annotation.Rollback; +import org.springframework.transaction.annotation.Transactional; +import us.codecraft.webmagic.AbstractTest; +import us.codecraft.webmagic.model.DynamicClass; + +/** + * @author code4crafter@gmail.com + */ +public class DynamicClassDaoTest extends AbstractTest { + + @Autowired + private DynamicClassDao dynamicClassDao; + + @Test + @Transactional + @Rollback(true) + public void testAdd() throws Exception { + DynamicClass dynamicClass = new DynamicClass(); + dynamicClass.setClassName("test"); + dynamicClass.setSourceCode("testSource"); + dynamicClassDao.add(dynamicClass); + } +} diff --git a/webmagic-avalon/webmagic-avalon-common/src/test/java/us/codecraft/webmagic/service/DynamicClassServiceImplTest.java b/webmagic-avalon/webmagic-avalon-common/src/test/java/us/codecraft/webmagic/service/DynamicClassServiceImplTest.java new file mode 100644 index 0000000..92e213a --- /dev/null +++ b/webmagic-avalon/webmagic-avalon-common/src/test/java/us/codecraft/webmagic/service/DynamicClassServiceImplTest.java @@ -0,0 +1,54 @@ +package us.codecraft.webmagic.service; + +import org.junit.Before; +import org.junit.Test; +import org.mockito.InjectMocks; +import org.mockito.Mock; +import org.mockito.MockitoAnnotations; +import org.mockito.Spy; +import org.springframework.beans.factory.annotation.Autowired; +import us.codecraft.forger.ForgerFactory; +import us.codecraft.webmagic.AbstractTest; +import us.codecraft.webmagic.Foo; +import us.codecraft.webmagic.dao.DynamicClassDao; +import us.codecraft.webmagic.exception.DynamicClassCompileException; +import us.codecraft.webmagic.service.impl.DynamicClassServiceImpl; + +import static org.assertj.core.api.Assertions.assertThat; +import static org.assertj.core.api.Assertions.failBecauseExceptionWasNotThrown; + +/** + * @author code4crafter@gmail.com + */ +public class DynamicClassServiceImplTest extends AbstractTest { + + @Before + public void setUp() { + MockitoAnnotations.initMocks(this); + } + + @Spy + @Autowired + private ForgerFactory forgerFactory; + + @InjectMocks + private DynamicClassService dynamicClassService = new DynamicClassServiceImpl(); + + @Mock + private DynamicClassDao dynamicClassDao; + + @Test + public void testCompileAndSave() throws Exception { + Class aClass = dynamicClassService.compileAndSave(Foo.SOURCE_CODE); + assertThat(aClass.getCanonicalName()).isEqualTo("us.codecraft.webmagic.Foo"); + } + + @Test + public void testCompileFail() { + try { + dynamicClassService.compileAndSave("class s(("); + failBecauseExceptionWasNotThrown(DynamicClassCompileException.class); + } catch (DynamicClassCompileException e) { + } + } +} diff --git a/webmagic-avalon/webmagic-worker/README.md b/webmagic-avalon/webmagic-worker/README.md new file mode 100644 index 0000000..334ab0e --- /dev/null +++ b/webmagic-avalon/webmagic-worker/README.md @@ -0,0 +1,3 @@ +WebMagic-Worker +===== +Worker is the spider container. \ No newline at end of file diff --git a/webmagic-avalon/webmagic-worker/pom.xml b/webmagic-avalon/webmagic-worker/pom.xml new file mode 100644 index 0000000..f085c82 --- /dev/null +++ b/webmagic-avalon/webmagic-worker/pom.xml @@ -0,0 +1,58 @@ + + + + webmagic-avalon + us.codecraft + 0.5.0 + + 4.0.0 + + webmagic-worker + war + + + + us.codecraft + webmagic-avalon-common + ${project.version} + + + junit + junit + + + org.mockito + mockito-all + + + org.aspectj + aspectjrt + + + + + + + maven-deploy-plugin + + true + + + + org.apache.maven.plugins + maven-jar-plugin + 2.4 + + + + true + ./lib/ + us.codecraft.webmagic.main.QuickStarter + + + + + + + + diff --git a/webmagic-avalon/webmagic-worker/src/main/java/us/codecraft/webmagic/worker/Worker.java b/webmagic-avalon/webmagic-worker/src/main/java/us/codecraft/webmagic/worker/Worker.java new file mode 100644 index 0000000..312500b --- /dev/null +++ b/webmagic-avalon/webmagic-worker/src/main/java/us/codecraft/webmagic/worker/Worker.java @@ -0,0 +1,48 @@ +package us.codecraft.webmagic.worker; + +import us.codecraft.webmagic.Spider; + +import java.util.Map; +import java.util.concurrent.ConcurrentHashMap; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; + +/** + * Container of Spiders. + * + * @author code4crafter@gmail.com + */ +public class Worker { + + public static final int DEFAULT_POOL_SIZE = 10; + + private int poolSize; + + private ExecutorService executorService; + + private Map spiderMap; + + public Worker(int poolSize) { + this.poolSize = poolSize; + this.executorService = initExecutorService(); + this.spiderMap = new ConcurrentHashMap(); + } + + public Worker() { + this(DEFAULT_POOL_SIZE); + } + + protected ExecutorService initExecutorService() { + return Executors.newFixedThreadPool(poolSize); + } + + public void addSpider(Spider spider) { + spider.setExecutorService(executorService); + spiderMap.put(spider.getUUID(), spider); + } + + public Spider getSpider(String uuid){ + return spiderMap.get(uuid); + } + +} diff --git a/webmagic-avalon/webmagic-worker/src/main/java/us/codecraft/webmagic/worker/controller/SpiderController.java b/webmagic-avalon/webmagic-worker/src/main/java/us/codecraft/webmagic/worker/controller/SpiderController.java new file mode 100644 index 0000000..d33b0da --- /dev/null +++ b/webmagic-avalon/webmagic-worker/src/main/java/us/codecraft/webmagic/worker/controller/SpiderController.java @@ -0,0 +1,31 @@ +package us.codecraft.webmagic.worker.controller; + +import org.springframework.beans.factory.annotation.Autowired; +import org.springframework.stereotype.Controller; +import org.springframework.web.bind.annotation.RequestMapping; +import org.springframework.web.bind.annotation.RequestParam; +import org.springframework.web.bind.annotation.ResponseBody; +import us.codecraft.webmagic.worker.Worker; + +import java.util.HashMap; +import java.util.Map; + +/** + * @author code4crafter@gmail.com + */ +@Controller +@RequestMapping("spider") +public class SpiderController { + + @Autowired + private Worker worker; + + @RequestMapping("create") + @ResponseBody + public Map create(@RequestParam("id") String id) { + HashMap map = new HashMap(); + map.put("code", 200); + return map; + } + +} diff --git a/webmagic-avalon/webmagic-worker/src/main/resources/freemarker.properties b/webmagic-avalon/webmagic-worker/src/main/resources/freemarker.properties new file mode 100644 index 0000000..dbed67f --- /dev/null +++ b/webmagic-avalon/webmagic-worker/src/main/resources/freemarker.properties @@ -0,0 +1,7 @@ +number_format=# +classic_compatible=true + +default_encoding=UTF-8 +template_update_delay=0 +######################### +template_exception_handler=rethrow diff --git a/webmagic-avalon/webmagic-worker/src/main/resources/log/log4j.xml b/webmagic-avalon/webmagic-worker/src/main/resources/log/log4j.xml new file mode 100644 index 0000000..c2b5a2f --- /dev/null +++ b/webmagic-avalon/webmagic-worker/src/main/resources/log/log4j.xml @@ -0,0 +1,21 @@ + + + + + + + + + + + + + + + + + + + + + diff --git a/webmagic-avalon/webmagic-worker/src/main/webapp/WEB-INF/jsp/404.jsp b/webmagic-avalon/webmagic-worker/src/main/webapp/WEB-INF/jsp/404.jsp new file mode 100644 index 0000000..9a3348f --- /dev/null +++ b/webmagic-avalon/webmagic-worker/src/main/webapp/WEB-INF/jsp/404.jsp @@ -0,0 +1,74 @@ +<%@ page language="java" contentType="text/html; charset=utf8" + pageEncoding="utf8"%> + + + + + + + Page not found · GitLab Pages + + + + +
+ +

404

+

There isn't a Gitlab Page here.

+ +

Forgive my poor design.

+

You can edit 404.jsp to customize your 404 page.

+ + +
+ + diff --git a/webmagic-avalon/webmagic-worker/src/main/webapp/WEB-INF/jsp/500.jsp b/webmagic-avalon/webmagic-worker/src/main/webapp/WEB-INF/jsp/500.jsp new file mode 100644 index 0000000..150df3a --- /dev/null +++ b/webmagic-avalon/webmagic-worker/src/main/webapp/WEB-INF/jsp/500.jsp @@ -0,0 +1,18 @@ +<%@ page language="java" contentType="text/html; charset=utf8" + pageEncoding="utf8" isErrorPage="true" import="java.io.*"%> + + + + + 500 + + +页面出错啦! +<% + + StringWriter stringWriter = new StringWriter(); + exception.printStackTrace(new PrintWriter(stringWriter)); + out.println(stringWriter.toString()); +%> + + \ No newline at end of file diff --git a/webmagic-avalon/webmagic-worker/src/main/webapp/WEB-INF/web.xml b/webmagic-avalon/webmagic-worker/src/main/webapp/WEB-INF/web.xml new file mode 100644 index 0000000..533832e --- /dev/null +++ b/webmagic-avalon/webmagic-worker/src/main/webapp/WEB-INF/web.xml @@ -0,0 +1,53 @@ + + + Archetype Created Web Application + + + contextConfigLocation + + classpath*:/config/spring/applicationContext*.xml, + + + + + contextClass + org.springframework.web.context.support.XmlWebApplicationContext + + + + + log4jConfigLocation + classpath:log/log4j.xml + + + + log4jRefreshInterval + 60000 + + + + + spring + org.springframework.web.servlet.DispatcherServlet + + contextConfigLocation + classpath*:/config/spring/applicationContext*.xml + + 1 + + + spring + / + + + 404 + /WEB-INF/jsp/404.jsp + + + 500 + /WEB-INF/jsp/500.jsp + + + diff --git a/webmagic-avalon/webmagic-worker/src/test/java/us/codecraft/webmagic/worker/WorkerTest.java b/webmagic-avalon/webmagic-worker/src/test/java/us/codecraft/webmagic/worker/WorkerTest.java new file mode 100644 index 0000000..24bca19 --- /dev/null +++ b/webmagic-avalon/webmagic-worker/src/test/java/us/codecraft/webmagic/worker/WorkerTest.java @@ -0,0 +1,27 @@ +package us.codecraft.webmagic.worker; + +import org.junit.Test; +import us.codecraft.webmagic.Site; +import us.codecraft.webmagic.Spider; +import us.codecraft.webmagic.processor.PageProcessor; + +import static org.assertj.core.api.Assertions.assertThat; +import static org.mockito.Mockito.*; + +/** + * @author code4crafter@gmail.com + */ +public class WorkerTest { + + @Test + public void testWorkerAsSpiderContains() throws Exception { + PageProcessor pageProcessor = mock(PageProcessor.class); + Site site = mock(Site.class); + when(pageProcessor.getSite()).thenReturn(site); + when(site.getDomain()).thenReturn("codecraft.us"); + Worker worker = new Worker(); + Spider spider = Spider.create(pageProcessor); + worker.addSpider(spider); + assertThat(worker.getSpider("codecraft.us")).isEqualTo(spider); + } +} diff --git a/webmagic-core/pom.xml b/webmagic-core/pom.xml index cbe5939..93ced05 100644 --- a/webmagic-core/pom.xml +++ b/webmagic-core/pom.xml @@ -3,7 +3,7 @@ us.codecraft webmagic-parent - 0.4.3 + 0.5.0 4.0.0 @@ -50,11 +50,6 @@ commons-collections - - net.sourceforge.htmlcleaner - htmlcleaner - - org.assertj assertj-core @@ -70,6 +65,17 @@ commons-io + + com.jayway.jsonpath + json-path + 0.8.1 + + + + com.alibaba + fastjson + + diff --git a/webmagic-core/src/main/java/us/codecraft/webmagic/Page.java b/webmagic-core/src/main/java/us/codecraft/webmagic/Page.java index a22fbdc..a74b608 100644 --- a/webmagic-core/src/main/java/us/codecraft/webmagic/Page.java +++ b/webmagic-core/src/main/java/us/codecraft/webmagic/Page.java @@ -2,6 +2,7 @@ package us.codecraft.webmagic; import org.apache.commons.lang3.StringUtils; import us.codecraft.webmagic.selector.Html; +import us.codecraft.webmagic.selector.Json; import us.codecraft.webmagic.selector.Selectable; import us.codecraft.webmagic.utils.UrlUtils; @@ -31,6 +32,8 @@ public class Page { private Html html; + private Json json; + private String rawText; private Selectable url; @@ -72,10 +75,23 @@ public class Page { return html; } + /** + * get json content of page + * + * @return json + * @since 0.5.0 + */ + public Json getJson() { + if (json == null) { + json = new Json(rawText); + } + return json; + } + /** * @param html * @deprecated since 0.4.0 - * The html is parse just when first time of calling {@link #getHtml()}, so use {@link #setRawText(String)} instead. + * The html is parse just when first time of calling {@link #getHtml()}, so use {@link #setRawText(String)} instead. */ public void setHtml(Html html) { this.html = html; @@ -94,7 +110,7 @@ public class Page { synchronized (targetRequests) { for (String s : requests) { if (StringUtils.isBlank(s) || s.equals("#") || s.startsWith("javascript:")) { - break; + continue; } s = UrlUtils.canonicalizeUrl(s, url.toString()); targetRequests.add(new Request(s)); @@ -111,7 +127,7 @@ public class Page { synchronized (targetRequests) { for (String s : requests) { if (StringUtils.isBlank(s) || s.equals("#") || s.startsWith("javascript:")) { - break; + continue; } s = UrlUtils.canonicalizeUrl(s, url.toString()); targetRequests.add(new Request(s).setPriority(priority)); diff --git a/webmagic-core/src/main/java/us/codecraft/webmagic/Request.java b/webmagic-core/src/main/java/us/codecraft/webmagic/Request.java index 142a20c..1f8a194 100644 --- a/webmagic-core/src/main/java/us/codecraft/webmagic/Request.java +++ b/webmagic-core/src/main/java/us/codecraft/webmagic/Request.java @@ -21,6 +21,8 @@ public class Request implements Serializable { private String url; + private String method; + /** * Store additional information in extras. */ @@ -106,10 +108,25 @@ public class Request implements Serializable { this.url = url; } + /** + * The http method of the request. Get for default. + * @return httpMethod + * @see us.codecraft.webmagic.utils.HttpConstant.Method + * @since 0.5.0 + */ + public String getMethod() { + return method; + } + + public void setMethod(String method) { + this.method = method; + } + @Override public String toString() { return "Request{" + "url='" + url + '\'' + + ", method='" + method + '\'' + ", extras=" + extras + ", priority=" + priority + '}'; diff --git a/webmagic-core/src/main/java/us/codecraft/webmagic/ResultItems.java b/webmagic-core/src/main/java/us/codecraft/webmagic/ResultItems.java index 4791e77..7b54361 100644 --- a/webmagic-core/src/main/java/us/codecraft/webmagic/ResultItems.java +++ b/webmagic-core/src/main/java/us/codecraft/webmagic/ResultItems.java @@ -1,6 +1,7 @@ package us.codecraft.webmagic; import java.util.HashMap; +import java.util.LinkedHashMap; import java.util.Map; /** @@ -14,7 +15,7 @@ import java.util.Map; */ public class ResultItems { - private Map fields = new HashMap(); + private Map fields = new LinkedHashMap(); private Request request; diff --git a/webmagic-core/src/main/java/us/codecraft/webmagic/Site.java b/webmagic-core/src/main/java/us/codecraft/webmagic/Site.java index e83e85f..a7c7bf8 100644 --- a/webmagic-core/src/main/java/us/codecraft/webmagic/Site.java +++ b/webmagic-core/src/main/java/us/codecraft/webmagic/Site.java @@ -1,5 +1,7 @@ package us.codecraft.webmagic; +import com.google.common.collect.HashBasedTable; +import com.google.common.collect.Table; import org.apache.http.HttpHost; import us.codecraft.webmagic.utils.UrlUtils; @@ -18,7 +20,9 @@ public class Site { private String userAgent; - private Map cookies = new LinkedHashMap(); + private Map defaultCookies = new LinkedHashMap(); + + private Table cookies = HashBasedTable.create(); private String charset; @@ -45,6 +49,10 @@ public class Site { private boolean useGzip = true; + /** + * @see us.codecraft.webmagic.utils.HttpConstant.Header + * @deprecated + */ public static interface HeaderConst { public static final String REFERER = "Referer"; @@ -72,7 +80,20 @@ public class Site { * @return this */ public Site addCookie(String name, String value) { - cookies.put(name, value); + defaultCookies.put(name, value); + return this; + } + + /** + * Add a cookie with specific domain. + * + * @param domain + * @param name + * @param value + * @return + */ + public Site addCookie(String domain, String name, String value) { + cookies.put(domain, name, value); return this; } @@ -93,7 +114,16 @@ public class Site { * @return get cookies */ public Map getCookies() { - return cookies; + return defaultCookies; + } + + /** + * get cookies of all domains + * + * @return get cookies + */ + public Map> getAllCookies() { + return cookies.rowMap(); } /** @@ -203,10 +233,10 @@ public class Site { * Add a url to start url.
* Because urls are more a Spider's property than Site, move it to {@link Spider#addUrl(String...)}} * - * @deprecated - * @see Spider#addUrl(String...) * @param startUrl * @return this + * @see Spider#addUrl(String...) + * @deprecated */ public Site addStartUrl(String startUrl) { return addStartRequest(new Request(startUrl)); @@ -216,10 +246,10 @@ public class Site { * Add a url to start url.
* Because urls are more a Spider's property than Site, move it to {@link Spider#addRequest(Request...)}} * - * @deprecated - * @see Spider#addRequest(Request...) - * @param startUrl + * @param startRequest * @return this + * @see Spider#addRequest(Request...) + * @deprecated */ public Site addStartRequest(Request startRequest) { this.startRequests.add(startRequest); @@ -312,6 +342,7 @@ public class Site { /** * set up httpProxy for this site + * * @param httpProxy * @return */ @@ -364,7 +395,8 @@ public class Site { if (acceptStatCode != null ? !acceptStatCode.equals(site.acceptStatCode) : site.acceptStatCode != null) return false; if (charset != null ? !charset.equals(site.charset) : site.charset != null) return false; - if (cookies != null ? !cookies.equals(site.cookies) : site.cookies != null) return false; + if (defaultCookies != null ? !defaultCookies.equals(site.defaultCookies) : site.defaultCookies != null) + return false; if (domain != null ? !domain.equals(site.domain) : site.domain != null) return false; if (headers != null ? !headers.equals(site.headers) : site.headers != null) return false; if (startRequests != null ? !startRequests.equals(site.startRequests) : site.startRequests != null) @@ -378,7 +410,7 @@ public class Site { public int hashCode() { int result = domain != null ? domain.hashCode() : 0; result = 31 * result + (userAgent != null ? userAgent.hashCode() : 0); - result = 31 * result + (cookies != null ? cookies.hashCode() : 0); + result = 31 * result + (defaultCookies != null ? defaultCookies.hashCode() : 0); result = 31 * result + (charset != null ? charset.hashCode() : 0); result = 31 * result + (startRequests != null ? startRequests.hashCode() : 0); result = 31 * result + sleepTime; @@ -395,7 +427,7 @@ public class Site { return "Site{" + "domain='" + domain + '\'' + ", userAgent='" + userAgent + '\'' + - ", cookies=" + cookies + + ", cookies=" + defaultCookies + ", charset='" + charset + '\'' + ", startRequests=" + startRequests + ", sleepTime=" + sleepTime + diff --git a/webmagic-core/src/main/java/us/codecraft/webmagic/Spider.java b/webmagic-core/src/main/java/us/codecraft/webmagic/Spider.java index b6f95ac..81cf179 100644 --- a/webmagic-core/src/main/java/us/codecraft/webmagic/Spider.java +++ b/webmagic-core/src/main/java/us/codecraft/webmagic/Spider.java @@ -13,17 +13,14 @@ import us.codecraft.webmagic.pipeline.ResultItemsCollectorPipeline; import us.codecraft.webmagic.processor.PageProcessor; import us.codecraft.webmagic.scheduler.QueueScheduler; import us.codecraft.webmagic.scheduler.Scheduler; -import us.codecraft.webmagic.utils.EnvironmentUtil; -import us.codecraft.webmagic.utils.ThreadUtils; +import us.codecraft.webmagic.selector.thread.CountableThreadPool; import us.codecraft.webmagic.utils.UrlUtils; import java.io.Closeable; import java.io.IOException; -import java.util.ArrayList; -import java.util.Collection; -import java.util.List; -import java.util.UUID; +import java.util.*; import java.util.concurrent.ExecutorService; +import java.util.concurrent.TimeUnit; import java.util.concurrent.atomic.AtomicInteger; import java.util.concurrent.atomic.AtomicLong; import java.util.concurrent.locks.Condition; @@ -78,6 +75,8 @@ public class Spider implements Runnable, Task { protected Logger logger = LoggerFactory.getLogger(getClass()); + protected CountableThreadPool threadPool; + protected ExecutorService executorService; protected int threadNum = 1; @@ -100,10 +99,14 @@ public class Spider implements Runnable, Task { private Condition newUrlCondition = newUrlLock.newCondition(); - private final AtomicInteger threadAlive = new AtomicInteger(0); + private List spiderListeners; private final AtomicLong pageCount = new AtomicLong(0); + private Date startTime; + + private int emptySleepTime = 30000; + /** * create a spider with pageProcessor. * @@ -143,7 +146,7 @@ public class Spider implements Runnable, Task { * Set startUrls of Spider.
* Prior to startUrls of Site. * - * @param startUrls + * @param startRequests * @return this */ public Spider startRequest(List startRequests) { @@ -186,7 +189,14 @@ public class Spider implements Runnable, Task { */ public Spider setScheduler(Scheduler scheduler) { checkIfRunning(); + Scheduler oldScheduler = this.scheduler; this.scheduler = scheduler; + if (oldScheduler != null) { + Request request; + while ((request = oldScheduler.poll(this)) != null) { + this.scheduler.push(request, this); + } + } return this; } @@ -219,7 +229,7 @@ public class Spider implements Runnable, Task { /** * set pipelines for Spider * - * @param pipeline + * @param pipelines * @return this * @see Pipeline * @since 0.4.1 @@ -273,8 +283,12 @@ public class Spider implements Runnable, Task { pipelines.add(new ConsolePipeline()); } downloader.setThread(threadNum); - if (executorService == null || executorService.isShutdown()) { - executorService = ThreadUtils.newFixedThreadPool(threadNum); + if (threadPool == null || threadPool.isShutdown()) { + if (executorService != null && !executorService.isShutdown()) { + threadPool = new CountableThreadPool(threadNum, executorService); + } else { + threadPool = new CountableThreadPool(threadNum); + } } if (startRequests != null) { for (Request request : startRequests) { @@ -282,7 +296,7 @@ public class Spider implements Runnable, Task { } startRequests.clear(); } - threadAlive.set(0); + startTime = new Date(); } @Override @@ -293,23 +307,23 @@ public class Spider implements Runnable, Task { while (!Thread.currentThread().isInterrupted() && stat.get() == STAT_RUNNING) { Request request = scheduler.poll(this); if (request == null) { - if (threadAlive.get() == 0 && exitWhenComplete) { + if (threadPool.getThreadAlive() == 0 && exitWhenComplete) { break; } // wait until new url added waitNewUrl(); } else { final Request requestFinal = request; - threadAlive.incrementAndGet(); - executorService.execute(new Runnable() { + threadPool.execute(new Runnable() { @Override public void run() { try { processRequest(requestFinal); + onSuccess(requestFinal); } catch (Exception e) { - logger.error("download " + requestFinal + " error", e); + onError(requestFinal); + logger.error("process request " + requestFinal + " error", e); } finally { - threadAlive.decrementAndGet(); pageCount.incrementAndGet(); signalNewUrl(); } @@ -324,6 +338,22 @@ public class Spider implements Runnable, Task { } } + protected void onError(Request request) { + if (CollectionUtils.isNotEmpty(spiderListeners)) { + for (SpiderListener spiderListener : spiderListeners) { + spiderListener.onError(request); + } + } + } + + protected void onSuccess(Request request) { + if (CollectionUtils.isNotEmpty(spiderListeners)) { + for (SpiderListener spiderListener : spiderListeners) { + spiderListener.onSuccess(request); + } + } + } + private void checkRunningStat() { while (true) { int statNow = stat.get(); @@ -342,7 +372,7 @@ public class Spider implements Runnable, Task { for (Pipeline pipeline : pipelines) { destroyEach(pipeline); } - executorService.shutdown(); + threadPool.shutdown(); } private void destroyEach(Object object) { @@ -373,6 +403,7 @@ public class Spider implements Runnable, Task { Page page = downloader.download(request, this); if (page == null) { sleep(site.getSleepTime()); + onError(request); return; } // for cycle retry @@ -478,7 +509,7 @@ public class Spider implements Runnable, Task { /** * Add urls with information to crawl.
* - * @param urls + * @param requests * @return */ public Spider addRequest(Request... requests) { @@ -490,16 +521,15 @@ public class Spider implements Runnable, Task { } private void waitNewUrl() { + newUrlLock.lock(); try { - newUrlLock.lock(); //double check - if (threadAlive.get() == 0 && exitWhenComplete) { + if (threadPool.getThreadAlive() == 0 && exitWhenComplete) { return; } - try { - newUrlCondition.await(); - } catch (InterruptedException e) { - } + newUrlCondition.await(emptySleepTime, TimeUnit.MILLISECONDS); + } catch (InterruptedException e) { + logger.warn("waitNewUrl - interrupted, error {}", e); } finally { newUrlLock.unlock(); } @@ -542,12 +572,18 @@ public class Spider implements Runnable, Task { } /** - * switch off xsoup + * start with more than one threads * - * @return + * @param threadNum + * @return this */ - public static void xsoupOff() { - EnvironmentUtil.setUseXsoup(false); + public Spider thread(ExecutorService executorService, int threadNum) { + checkIfRunning(); + this.threadNum = threadNum; + if (threadNum <= 0) { + throw new IllegalArgumentException("threadNum should be more than one!"); + } + return this; } public boolean isExitWhenComplete() { @@ -624,7 +660,10 @@ public class Spider implements Runnable, Task { * @since 0.4.1 */ public int getThreadAlive() { - return threadAlive.get(); + if (threadPool == null) { + return 0; + } + return threadPool.getThreadAlive(); } /** @@ -653,8 +692,40 @@ public class Spider implements Runnable, Task { return uuid; } + public Spider setExecutorService(ExecutorService executorService) { + checkIfRunning(); + this.executorService = executorService; + return this; + } + @Override public Site getSite() { return site; } + + public List getSpiderListeners() { + return spiderListeners; + } + + public Spider setSpiderListeners(List spiderListeners) { + this.spiderListeners = spiderListeners; + return this; + } + + public Date getStartTime() { + return startTime; + } + + public Scheduler getScheduler() { + return scheduler; + } + + /** + * Set wait time when no url is polled.

+ * + * @param emptySleepTime In MILLISECONDS. + */ + public void setEmptySleepTime(int emptySleepTime) { + this.emptySleepTime = emptySleepTime; + } } diff --git a/webmagic-core/src/main/java/us/codecraft/webmagic/SpiderListener.java b/webmagic-core/src/main/java/us/codecraft/webmagic/SpiderListener.java new file mode 100644 index 0000000..0678180 --- /dev/null +++ b/webmagic-core/src/main/java/us/codecraft/webmagic/SpiderListener.java @@ -0,0 +1,14 @@ +package us.codecraft.webmagic; + +/** + * Listener of Spider on page processing. Used for monitor and such on. + * + * @author code4crafer@gmail.com + * @since 0.5.0 + */ +public interface SpiderListener { + + public void onSuccess(Request request); + + public void onError(Request request); +} diff --git a/webmagic-core/src/main/java/us/codecraft/webmagic/downloader/AbstractDownloader.java b/webmagic-core/src/main/java/us/codecraft/webmagic/downloader/AbstractDownloader.java index 2336856..5940c2f 100644 --- a/webmagic-core/src/main/java/us/codecraft/webmagic/downloader/AbstractDownloader.java +++ b/webmagic-core/src/main/java/us/codecraft/webmagic/downloader/AbstractDownloader.java @@ -34,6 +34,12 @@ public abstract class AbstractDownloader implements Downloader { return (Html) page.getHtml(); } + protected void onSuccess(Request request) { + } + + protected void onError(Request request) { + } + protected Page addToCycleRetry(Request request, Site site) { Page page = new Page(); Object cycleTriedTimesObject = request.getExtra(Request.CYCLE_TRIED_TIMES); diff --git a/webmagic-core/src/main/java/us/codecraft/webmagic/downloader/HttpClientDownloader.java b/webmagic-core/src/main/java/us/codecraft/webmagic/downloader/HttpClientDownloader.java index bcf4a53..eeae70e 100644 --- a/webmagic-core/src/main/java/us/codecraft/webmagic/downloader/HttpClientDownloader.java +++ b/webmagic-core/src/main/java/us/codecraft/webmagic/downloader/HttpClientDownloader.java @@ -3,10 +3,12 @@ package us.codecraft.webmagic.downloader; import com.google.common.collect.Sets; import org.apache.commons.io.IOUtils; import org.apache.http.HttpResponse; +import org.apache.http.NameValuePair; import org.apache.http.annotation.ThreadSafe; import org.apache.http.client.config.CookieSpecs; import org.apache.http.client.config.RequestConfig; import org.apache.http.client.methods.CloseableHttpResponse; +import org.apache.http.client.methods.HttpUriRequest; import org.apache.http.client.methods.RequestBuilder; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.util.EntityUtils; @@ -16,6 +18,7 @@ import us.codecraft.webmagic.Page; import us.codecraft.webmagic.Request; import us.codecraft.webmagic.Site; import us.codecraft.webmagic.Task; +import us.codecraft.webmagic.utils.HttpConstant; import us.codecraft.webmagic.selector.PlainText; import us.codecraft.webmagic.utils.UrlUtils; @@ -74,8 +77,55 @@ public class HttpClientDownloader extends AbstractDownloader { } else { acceptStatCode = Sets.newHashSet(200); } - logger.info("downloading page " + request.getUrl()); - RequestBuilder requestBuilder = RequestBuilder.get().setUri(request.getUrl()); + logger.info("downloading page {}", request.getUrl()); + CloseableHttpResponse httpResponse = null; + try { + HttpUriRequest httpUriRequest = getHttpUriRequest(request, site, headers); + httpResponse = getHttpClient(site).execute(httpUriRequest); + int statusCode = httpResponse.getStatusLine().getStatusCode(); + if (statusAccept(acceptStatCode, statusCode)) { + //charset + if (charset == null) { + String value = httpResponse.getEntity().getContentType().getValue(); + charset = UrlUtils.getCharset(value); + } + Page page = handleResponse(request, charset, httpResponse, task); + onSuccess(request); + return page; + } else { + logger.warn("code error " + statusCode + "\t" + request.getUrl()); + return null; + } + } catch (IOException e) { + logger.warn("download page " + request.getUrl() + " error", e); + if (site.getCycleRetryTimes() > 0) { + return addToCycleRetry(request, site); + } + onError(request); + return null; + } finally { + try { + if (httpResponse != null) { + //ensure the connection is released back to pool + EntityUtils.consume(httpResponse.getEntity()); + } + } catch (IOException e) { + logger.warn("close response fail", e); + } + } + } + + @Override + public void setThread(int thread) { + httpClientGenerator.setPoolSize(thread); + } + + protected boolean statusAccept(Set acceptStatCode, int statusCode) { + return acceptStatCode.contains(statusCode); + } + + protected HttpUriRequest getHttpUriRequest(Request request, Site site, Map headers) { + RequestBuilder requestBuilder = selectRequestMethod(request).setUri(request.getUrl()); if (headers != null) { for (Map.Entry headerEntry : headers.entrySet()) { requestBuilder.addHeader(headerEntry.getKey(), headerEntry.getValue()); @@ -90,37 +140,31 @@ public class HttpClientDownloader extends AbstractDownloader { requestConfigBuilder.setProxy(site.getHttpProxy()); } requestBuilder.setConfig(requestConfigBuilder.build()); - CloseableHttpResponse httpResponse = null; - try { - httpResponse = getHttpClient(site).execute(requestBuilder.build()); - int statusCode = httpResponse.getStatusLine().getStatusCode(); - if (acceptStatCode.contains(statusCode)) { - //charset - if (charset == null) { - String value = httpResponse.getEntity().getContentType().getValue(); - charset = UrlUtils.getCharset(value); - } - return handleResponse(request, charset, httpResponse, task); - } else { - logger.warn("code error " + statusCode + "\t" + request.getUrl()); - return null; - } - } catch (IOException e) { - logger.warn("download page " + request.getUrl() + " error", e); - if (site.getCycleRetryTimes() > 0) { - return addToCycleRetry(request, site); - } - return null; - } finally { - try { - if (httpResponse != null) { - //ensure the connection is released back to pool - EntityUtils.consume(httpResponse.getEntity()); - } - } catch (IOException e) { - logger.warn("close response fail", e); + return requestBuilder.build(); + } + + protected RequestBuilder selectRequestMethod(Request request) { + String method = request.getMethod(); + if (method == null || method.equalsIgnoreCase(HttpConstant.Method.GET)) { + //default get + return RequestBuilder.get(); + } else if (method.equalsIgnoreCase(HttpConstant.Method.POST)) { + RequestBuilder requestBuilder = RequestBuilder.post(); + NameValuePair[] nameValuePair = (NameValuePair[]) request.getExtra("nameValuePair"); + if (nameValuePair.length > 0) { + requestBuilder.addParameters(nameValuePair); } + return requestBuilder; + } else if (method.equalsIgnoreCase(HttpConstant.Method.HEAD)) { + return RequestBuilder.head(); + } else if (method.equalsIgnoreCase(HttpConstant.Method.PUT)) { + return RequestBuilder.put(); + } else if (method.equalsIgnoreCase(HttpConstant.Method.DELETE)) { + return RequestBuilder.delete(); + } else if (method.equalsIgnoreCase(HttpConstant.Method.TRACE)) { + return RequestBuilder.trace(); } + throw new IllegalArgumentException("Illegal HTTP Method " + method); } protected Page handleResponse(Request request, String charset, HttpResponse httpResponse, Task task) throws IOException { @@ -132,9 +176,4 @@ public class HttpClientDownloader extends AbstractDownloader { page.setStatusCode(httpResponse.getStatusLine().getStatusCode()); return page; } - - @Override - public void setThread(int thread) { - httpClientGenerator.setPoolSize(thread); - } } diff --git a/webmagic-core/src/main/java/us/codecraft/webmagic/downloader/HttpClientGenerator.java b/webmagic-core/src/main/java/us/codecraft/webmagic/downloader/HttpClientGenerator.java index edb3a49..136d9c5 100644 --- a/webmagic-core/src/main/java/us/codecraft/webmagic/downloader/HttpClientGenerator.java +++ b/webmagic-core/src/main/java/us/codecraft/webmagic/downloader/HttpClientGenerator.java @@ -36,7 +36,7 @@ public class HttpClientGenerator { connectionManager.setDefaultMaxPerRoute(100); } - public HttpClientGenerator setPoolSize(int poolSize){ + public HttpClientGenerator setPoolSize(int poolSize) { connectionManager.setMaxTotal(poolSize); return this; } @@ -76,10 +76,15 @@ public class HttpClientGenerator { private void generateCookie(HttpClientBuilder httpClientBuilder, Site site) { CookieStore cookieStore = new BasicCookieStore(); - if (site.getCookies() != null) { - for (Map.Entry cookieEntry : site.getCookies().entrySet()) { + for (Map.Entry cookieEntry : site.getCookies().entrySet()) { + BasicClientCookie cookie = new BasicClientCookie(cookieEntry.getKey(), cookieEntry.getValue()); + cookie.setDomain(site.getDomain()); + cookieStore.addCookie(cookie); + } + for (Map.Entry> domainEntry : site.getAllCookies().entrySet()) { + for (Map.Entry cookieEntry : domainEntry.getValue().entrySet()) { BasicClientCookie cookie = new BasicClientCookie(cookieEntry.getKey(), cookieEntry.getValue()); - cookie.setDomain(site.getDomain()); + cookie.setDomain(domainEntry.getKey()); cookieStore.addCookie(cookie); } } diff --git a/webmagic-core/src/main/java/us/codecraft/webmagic/processor/example/GithubRepoPageProcessor.java b/webmagic-core/src/main/java/us/codecraft/webmagic/processor/example/GithubRepoPageProcessor.java index 179bad4..f4ae058 100644 --- a/webmagic-core/src/main/java/us/codecraft/webmagic/processor/example/GithubRepoPageProcessor.java +++ b/webmagic-core/src/main/java/us/codecraft/webmagic/processor/example/GithubRepoPageProcessor.java @@ -11,11 +11,12 @@ import us.codecraft.webmagic.processor.PageProcessor; */ public class GithubRepoPageProcessor implements PageProcessor { - private Site site = Site.me().setRetryTimes(3).setSleepTime(100); + private Site site = Site.me().setRetryTimes(3).setSleepTime(0); @Override public void process(Page page) { page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all()); + page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+)").all()); page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString()); page.putField("name", page.getHtml().xpath("//h1[@class='entry-title public']/strong/a/text()").toString()); if (page.getResultItems().get("name")==null){ diff --git a/webmagic-core/src/main/java/us/codecraft/webmagic/processor/example/OschinaBlogPageProcessor.java b/webmagic-core/src/main/java/us/codecraft/webmagic/processor/example/OschinaBlogPageProcessor.java index aac0ac1..053c155 100644 --- a/webmagic-core/src/main/java/us/codecraft/webmagic/processor/example/OschinaBlogPageProcessor.java +++ b/webmagic-core/src/main/java/us/codecraft/webmagic/processor/example/OschinaBlogPageProcessor.java @@ -34,6 +34,6 @@ public class OschinaBlogPageProcessor implements PageProcessor { } public static void main(String[] args) { - Spider.create(new OschinaBlogPageProcessor()).addUrl("http://my.oschina.net/flashsword/blog").thread(2).run(); + Spider.create(new OschinaBlogPageProcessor()).addUrl("http://my.oschina.net/flashsword/blog").run(); } } diff --git a/webmagic-core/src/main/java/us/codecraft/webmagic/scheduler/DuplicatedRemoveScheduler.java b/webmagic-core/src/main/java/us/codecraft/webmagic/scheduler/DuplicatedRemoveScheduler.java new file mode 100644 index 0000000..7b319b6 --- /dev/null +++ b/webmagic-core/src/main/java/us/codecraft/webmagic/scheduler/DuplicatedRemoveScheduler.java @@ -0,0 +1,45 @@ +package us.codecraft.webmagic.scheduler; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import us.codecraft.webmagic.Request; +import us.codecraft.webmagic.Task; + +/** + * Remove duplicate urls and only push urls which are not duplicate.

+ * + * @author code4crafer@gmail.com + * @since 0.5.0 + */ +public abstract class DuplicatedRemoveScheduler implements Scheduler { + + protected Logger logger = LoggerFactory.getLogger(getClass()); + + @Override + public void push(Request request, Task task) { + logger.trace("get a candidate url {}", request.getUrl()); + if (isDuplicate(request, task) || shouldReserved(request)) { + logger.debug("push to queue {}", request.getUrl()); + pushWhenNoDuplicate(request, task); + } + } + + /** + * Reset duplicate check. + */ + public abstract void resetDuplicateCheck(Task task); + + /** + * @param request + * @return + */ + protected abstract boolean isDuplicate(Request request, Task task); + + protected boolean shouldReserved(Request request) { + return request.getExtra(Request.CYCLE_TRIED_TIMES) != null; + } + + protected void pushWhenNoDuplicate(Request request, Task task) { + + } +} diff --git a/webmagic-core/src/main/java/us/codecraft/webmagic/scheduler/LocalDuplicatedRemoveScheduler.java b/webmagic-core/src/main/java/us/codecraft/webmagic/scheduler/LocalDuplicatedRemoveScheduler.java new file mode 100644 index 0000000..c127c98 --- /dev/null +++ b/webmagic-core/src/main/java/us/codecraft/webmagic/scheduler/LocalDuplicatedRemoveScheduler.java @@ -0,0 +1,34 @@ +package us.codecraft.webmagic.scheduler; + +import com.google.common.collect.Sets; +import us.codecraft.webmagic.Request; +import us.codecraft.webmagic.Task; + +import java.util.Set; +import java.util.concurrent.ConcurrentHashMap; + +/** + * Base Scheduler with duplicated urls removed by hash set.

+ * + * @author code4crafter@gmail.com + * @since 0.5.0 + */ +public abstract class LocalDuplicatedRemoveScheduler extends DuplicatedRemoveScheduler implements MonitorableScheduler { + + private Set urls = Sets.newSetFromMap(new ConcurrentHashMap()); + + @Override + public void resetDuplicateCheck(Task task) { + urls.clear(); + } + + @Override + protected boolean isDuplicate(Request request, Task task) { + return urls.add(request.getUrl()); + } + + @Override + public int getTotalRequestsCount(Task task) { + return urls.size(); + } +} diff --git a/webmagic-core/src/main/java/us/codecraft/webmagic/scheduler/LocalDuplicatedRemovedScheduler.java b/webmagic-core/src/main/java/us/codecraft/webmagic/scheduler/LocalDuplicatedRemovedScheduler.java deleted file mode 100644 index c4b08f3..0000000 --- a/webmagic-core/src/main/java/us/codecraft/webmagic/scheduler/LocalDuplicatedRemovedScheduler.java +++ /dev/null @@ -1,33 +0,0 @@ -package us.codecraft.webmagic.scheduler; - -import com.google.common.collect.Sets; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; -import us.codecraft.webmagic.Request; -import us.codecraft.webmagic.Task; - -import java.util.Set; -import java.util.concurrent.ConcurrentHashMap; - -/** - * Base Scheduler with duplicated urls removed locally. - * - * @author code4crafter@gmail.com - * @since 0.5.0 - */ -public abstract class LocalDuplicatedRemovedScheduler implements Scheduler { - - protected Logger logger = LoggerFactory.getLogger(getClass()); - - private Set urls = Sets.newSetFromMap(new ConcurrentHashMap()); - - @Override - public void push(Request request, Task task) { - logger.debug("push to queue " + request.getUrl()); - if (request.getExtra(Request.CYCLE_TRIED_TIMES) != null || urls.add(request.getUrl())) { - pushWhenNoDuplicate(request, task); - } - } - - protected abstract void pushWhenNoDuplicate(Request request, Task task); -} diff --git a/webmagic-core/src/main/java/us/codecraft/webmagic/scheduler/MonitorableScheduler.java b/webmagic-core/src/main/java/us/codecraft/webmagic/scheduler/MonitorableScheduler.java new file mode 100644 index 0000000..ca76dfa --- /dev/null +++ b/webmagic-core/src/main/java/us/codecraft/webmagic/scheduler/MonitorableScheduler.java @@ -0,0 +1,17 @@ +package us.codecraft.webmagic.scheduler; + +import us.codecraft.webmagic.Task; + +/** + * The scheduler whose requests can be counted for monitor. + * + * @author code4crafter@gmail.com + * @since 0.5.0 + */ +public interface MonitorableScheduler extends Scheduler { + + public int getLeftRequestsCount(Task task); + + public int getTotalRequestsCount(Task task); + +} \ No newline at end of file diff --git a/webmagic-core/src/main/java/us/codecraft/webmagic/scheduler/PriorityScheduler.java b/webmagic-core/src/main/java/us/codecraft/webmagic/scheduler/PriorityScheduler.java index 04917ad..38c9b6c 100644 --- a/webmagic-core/src/main/java/us/codecraft/webmagic/scheduler/PriorityScheduler.java +++ b/webmagic-core/src/main/java/us/codecraft/webmagic/scheduler/PriorityScheduler.java @@ -17,7 +17,7 @@ import java.util.concurrent.PriorityBlockingQueue; * @since 0.2.1 */ @ThreadSafe -public class PriorityScheduler extends LocalDuplicatedRemovedScheduler { +public class PriorityScheduler extends LocalDuplicatedRemoveScheduler { public static final int INITIAL_CAPACITY = 5; @@ -60,4 +60,9 @@ public class PriorityScheduler extends LocalDuplicatedRemovedScheduler { } return priorityQueueMinus.poll(); } + + @Override + public int getLeftRequestsCount(Task task) { + return noPriorityQueue.size(); + } } diff --git a/webmagic-core/src/main/java/us/codecraft/webmagic/scheduler/QueueScheduler.java b/webmagic-core/src/main/java/us/codecraft/webmagic/scheduler/QueueScheduler.java index ab288df..511d8a0 100644 --- a/webmagic-core/src/main/java/us/codecraft/webmagic/scheduler/QueueScheduler.java +++ b/webmagic-core/src/main/java/us/codecraft/webmagic/scheduler/QueueScheduler.java @@ -16,7 +16,7 @@ import java.util.concurrent.LinkedBlockingQueue; * @since 0.1.0 */ @ThreadSafe -public class QueueScheduler extends LocalDuplicatedRemovedScheduler { +public class QueueScheduler extends LocalDuplicatedRemoveScheduler { private BlockingQueue queue = new LinkedBlockingQueue(); @@ -29,4 +29,9 @@ public class QueueScheduler extends LocalDuplicatedRemovedScheduler { public synchronized Request poll(Task task) { return queue.poll(); } + + @Override + public int getLeftRequestsCount(Task task) { + return queue.size(); + } } diff --git a/webmagic-core/src/main/java/us/codecraft/webmagic/selector/Html.java b/webmagic-core/src/main/java/us/codecraft/webmagic/selector/Html.java index 3f5df76..34386b5 100644 --- a/webmagic-core/src/main/java/us/codecraft/webmagic/selector/Html.java +++ b/webmagic-core/src/main/java/us/codecraft/webmagic/selector/Html.java @@ -4,7 +4,6 @@ import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.slf4j.Logger; import org.slf4j.LoggerFactory; -import us.codecraft.webmagic.utils.EnvironmentUtil; import java.util.ArrayList; import java.util.List; @@ -24,7 +23,7 @@ public class Html extends PlainText { */ private Document document; - private boolean init = false; + private boolean needInitCache = true; public Html(List strings) { super(strings); @@ -34,12 +33,22 @@ public class Html extends PlainText { super(text); } + public Html(List strings, boolean needInitCache) { + super(strings); + this.needInitCache = needInitCache; + } + + public Html(String text, boolean needInitCache) { + super(text); + this.needInitCache = needInitCache; + } + /** * lazy init */ private void initDocument() { - if (this.document == null && !init) { - init = true; + if (this.document == null && needInitCache) { + needInitCache = false; //just init once whether the parsing succeeds or not try { this.document = Jsoup.parse(getText()); @@ -68,7 +77,7 @@ public class Html extends PlainText { results.add(result); } } - return new Html(results); + return new Html(results, false); } @Override @@ -79,7 +88,7 @@ public class Html extends PlainText { List result = selector.selectList(string); results.addAll(result); } - return new Html(results); + return new Html(results, false); } @Override @@ -96,23 +105,18 @@ public class Html extends PlainText { @Override public Selectable xpath(String xpath) { - if (EnvironmentUtil.useXsoup()) { - XsoupSelector xsoupSelector = new XsoupSelector(xpath); - if (document != null) { - return new Html(xsoupSelector.selectList(document)); - } - return selectList(xsoupSelector, strings); - } else { - XpathSelector xpathSelector = new XpathSelector(xpath); - return selectList(xpathSelector, strings); + XpathSelector xpathSelector = Selectors.xpath(xpath); + if (document != null) { + return new Html(xpathSelector.selectList(document), false); } + return selectList(xpathSelector, strings); } @Override public Selectable $(String selector) { CssSelector cssSelector = Selectors.$(selector); if (document != null) { - return new Html(cssSelector.selectList(document)); + return new Html(cssSelector.selectList(document), false); } return selectList(cssSelector, strings); } @@ -121,12 +125,13 @@ public class Html extends PlainText { public Selectable $(String selector, String attrName) { CssSelector cssSelector = Selectors.$(selector, attrName); if (document != null) { - return new Html(cssSelector.selectList(document)); + return new Html(cssSelector.selectList(document), false); } return selectList(cssSelector, strings); } public Document getDocument() { + initDocument(); return document; } diff --git a/webmagic-core/src/main/java/us/codecraft/webmagic/selector/Json.java b/webmagic-core/src/main/java/us/codecraft/webmagic/selector/Json.java new file mode 100644 index 0000000..ef45d00 --- /dev/null +++ b/webmagic-core/src/main/java/us/codecraft/webmagic/selector/Json.java @@ -0,0 +1,64 @@ +package us.codecraft.webmagic.selector; + +import com.alibaba.fastjson.JSON; +import org.jsoup.parser.TokenQueue; + +import java.util.List; + +/** + * parse json + * @author code4crafter@gmail.com + * @since 0.5.0 + */ +public class Json extends PlainText { + + public Json(List strings) { + super(strings); + } + + public Json(String text) { + super(text); + } + + /** + * remove padding for JSONP + * @param padding + * @return + */ + public Json removePadding(String padding) { + String text = getText(); + TokenQueue tokenQueue = new TokenQueue(text); + tokenQueue.consumeWhitespace(); + tokenQueue.consume(padding); + tokenQueue.consumeWhitespace(); + String chompBalanced = tokenQueue.chompBalanced('(', ')'); + return new Json(chompBalanced); + } + + public T toObject(Class clazz) { + if (getText() == null) { + return null; + } + return JSON.parseObject(getText(), clazz); + } + + public List toList(Class clazz) { + if (getText() == null) { + return null; + } + return JSON.parseArray(getText(), clazz); + } + + public String getText() { + if (strings != null && strings.size() > 0) { + return strings.get(0); + } + return null; + } + + @Override + public Selectable jsonPath(String jsonPath) { + JsonPathSelector jsonPathSelector = new JsonPathSelector(jsonPath); + return selectList(jsonPathSelector,strings); + } +} diff --git a/webmagic-extension/src/main/java/us/codecraft/webmagic/selector/JsonPathSelector.java b/webmagic-core/src/main/java/us/codecraft/webmagic/selector/JsonPathSelector.java similarity index 91% rename from webmagic-extension/src/main/java/us/codecraft/webmagic/selector/JsonPathSelector.java rename to webmagic-core/src/main/java/us/codecraft/webmagic/selector/JsonPathSelector.java index 781669f..f9083a8 100644 --- a/webmagic-extension/src/main/java/us/codecraft/webmagic/selector/JsonPathSelector.java +++ b/webmagic-core/src/main/java/us/codecraft/webmagic/selector/JsonPathSelector.java @@ -1,7 +1,6 @@ package us.codecraft.webmagic.selector; import com.jayway.jsonpath.JsonPath; -import us.codecraft.webmagic.utils.Experimental; import java.util.ArrayList; import java.util.List; @@ -13,7 +12,6 @@ import java.util.List; * @author code4crafter@gmail.com
* @since 0.2.1 */ -@Experimental public class JsonPathSelector implements Selector { private String jsonPathStr; @@ -22,7 +20,7 @@ public class JsonPathSelector implements Selector { public JsonPathSelector(String jsonPathStr) { this.jsonPathStr = jsonPathStr; - this.jsonPath = JsonPath.compile(jsonPathStr); + this.jsonPath = JsonPath.compile(this.jsonPathStr); } @Override diff --git a/webmagic-core/src/main/java/us/codecraft/webmagic/selector/PlainText.java b/webmagic-core/src/main/java/us/codecraft/webmagic/selector/PlainText.java index bb1b868..efa38d8 100644 --- a/webmagic-core/src/main/java/us/codecraft/webmagic/selector/PlainText.java +++ b/webmagic-core/src/main/java/us/codecraft/webmagic/selector/PlainText.java @@ -109,7 +109,12 @@ public class PlainText implements Selectable { } @Override - public String toString() { + public Selectable jsonPath(String jsonPath) { + throw new UnsupportedOperationException(); + } + + @Override + public String get() { if (CollectionUtils.isNotEmpty(all())) { return all().get(0); } else { @@ -117,6 +122,21 @@ public class PlainText implements Selectable { } } + @Override + public Selectable select(Selector selector) { + return select(selector, strings); + } + + @Override + public Selectable selectList(Selector selector) { + return selectList(selector, strings); + } + + @Override + public String toString() { + return get(); + } + @Override public boolean match() { return strings != null && strings.size() > 0; diff --git a/webmagic-core/src/main/java/us/codecraft/webmagic/selector/Selectable.java b/webmagic-core/src/main/java/us/codecraft/webmagic/selector/Selectable.java index 6b4410e..2cc4ed9 100644 --- a/webmagic-core/src/main/java/us/codecraft/webmagic/selector/Selectable.java +++ b/webmagic-core/src/main/java/us/codecraft/webmagic/selector/Selectable.java @@ -99,6 +99,13 @@ public interface Selectable { */ public String toString(); + /** + * single string result + * + * @return single string result + */ + public String get(); + /** * if result exist for select * @@ -112,4 +119,28 @@ public interface Selectable { * @return multi string result */ public List all(); + + /** + * extract by JSON Path expression + * + * @param jsonPath + * @return + */ + public Selectable jsonPath(String jsonPath); + + /** + * extract by custom selector + * + * @param selector + * @return + */ + public Selectable select(Selector selector); + + /** + * extract by custom selector + * + * @param selector + * @return + */ + public Selectable selectList(Selector selector); } diff --git a/webmagic-core/src/main/java/us/codecraft/webmagic/selector/Selectors.java b/webmagic-core/src/main/java/us/codecraft/webmagic/selector/Selectors.java index 0c34ead..6cac964 100644 --- a/webmagic-core/src/main/java/us/codecraft/webmagic/selector/Selectors.java +++ b/webmagic-core/src/main/java/us/codecraft/webmagic/selector/Selectors.java @@ -32,8 +32,12 @@ public abstract class Selectors { return new XpathSelector(expr); } - public static XsoupSelector xsoup(String expr) { - return new XsoupSelector(expr); + /** + * @Deprecated + * @see #xpath(String) + */ + public static XpathSelector xsoup(String expr) { + return new XpathSelector(expr); } public static AndSelector and(Selector... selectors) { diff --git a/webmagic-core/src/main/java/us/codecraft/webmagic/selector/XpathSelector.java b/webmagic-core/src/main/java/us/codecraft/webmagic/selector/XpathSelector.java index c0e428c..d1bbcae 100644 --- a/webmagic-core/src/main/java/us/codecraft/webmagic/selector/XpathSelector.java +++ b/webmagic-core/src/main/java/us/codecraft/webmagic/selector/XpathSelector.java @@ -1,70 +1,32 @@ package us.codecraft.webmagic.selector; -import org.htmlcleaner.*; +import org.jsoup.nodes.Element; +import us.codecraft.xsoup.XPathEvaluator; +import us.codecraft.xsoup.Xsoup; -import java.util.ArrayList; import java.util.List; /** - * XPath selector based on HtmlCleaner.
+ * XPath selector based on Xsoup.
* * @author code4crafter@gmail.com
- * @since 0.1.0 + * @since 0.3.0 */ -public class XpathSelector implements Selector { +public class XpathSelector extends BaseElementSelector { - private String xpathStr; + private XPathEvaluator xPathEvaluator; public XpathSelector(String xpathStr) { - this.xpathStr = xpathStr; + this.xPathEvaluator = Xsoup.compile(xpathStr); } @Override - public String select(String text) { - HtmlCleaner htmlCleaner = new HtmlCleaner(); - TagNode tagNode = htmlCleaner.clean(text); - if (tagNode == null) { - return null; - } - try { - Object[] objects = tagNode.evaluateXPath(xpathStr); - if (objects != null && objects.length >= 1) { - if (objects[0] instanceof TagNode) { - TagNode tagNode1 = (TagNode) objects[0]; - return htmlCleaner.getInnerHtml(tagNode1); - } else { - return objects[0].toString(); - } - } - } catch (XPatherException e) { - e.printStackTrace(); - } - return null; + public String select(Element element) { + return xPathEvaluator.evaluate(element).get(); } @Override - public List selectList(String text) { - HtmlCleaner htmlCleaner = new HtmlCleaner(); - TagNode tagNode = htmlCleaner.clean(text); - if (tagNode == null) { - return null; - } - List results = new ArrayList(); - try { - Object[] objects = tagNode.evaluateXPath(xpathStr); - if (objects != null && objects.length >= 1) { - for (Object object : objects) { - if (object instanceof TagNode) { - TagNode tagNode1 = (TagNode) object; - results.add(htmlCleaner.getInnerHtml(tagNode1)); - } else { - results.add(object.toString()); - } - } - } - } catch (XPatherException e) { - e.printStackTrace(); - } - return results; + public List selectList(Element element) { + return xPathEvaluator.evaluate(element).list(); } } diff --git a/webmagic-core/src/main/java/us/codecraft/webmagic/selector/XsoupSelector.java b/webmagic-core/src/main/java/us/codecraft/webmagic/selector/XsoupSelector.java deleted file mode 100644 index ea46290..0000000 --- a/webmagic-core/src/main/java/us/codecraft/webmagic/selector/XsoupSelector.java +++ /dev/null @@ -1,32 +0,0 @@ -package us.codecraft.webmagic.selector; - -import org.jsoup.nodes.Element; -import us.codecraft.xsoup.XPathEvaluator; -import us.codecraft.xsoup.Xsoup; - -import java.util.List; - -/** - * XPath selector based on Xsoup.
- * - * @author code4crafter@gmail.com
- * @since 0.3.0 - */ -public class XsoupSelector extends BaseElementSelector { - - private XPathEvaluator xPathEvaluator; - - public XsoupSelector(String xpathStr) { - this.xPathEvaluator = Xsoup.compile(xpathStr); - } - - @Override - public String select(Element element) { - return xPathEvaluator.evaluate(element).get(); - } - - @Override - public List selectList(Element element) { - return xPathEvaluator.evaluate(element).list(); - } -} diff --git a/webmagic-core/src/main/java/us/codecraft/webmagic/selector/thread/CountableThreadPool.java b/webmagic-core/src/main/java/us/codecraft/webmagic/selector/thread/CountableThreadPool.java new file mode 100644 index 0000000..ac41668 --- /dev/null +++ b/webmagic-core/src/main/java/us/codecraft/webmagic/selector/thread/CountableThreadPool.java @@ -0,0 +1,97 @@ +package us.codecraft.webmagic.selector.thread; + +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.atomic.AtomicInteger; +import java.util.concurrent.locks.Condition; +import java.util.concurrent.locks.ReentrantLock; + +/** + * Thread pool for workers.

+ * Use {@link java.util.concurrent.ExecutorService} as inner implement.

+ * New feature:

+ * 1. Block when thread pool is full to avoid poll many urls without process.

+ * 2. Count of thread alive for monitor. + * + * @author code4crafer@gmail.com + * @since 0.5.0 + */ +public class CountableThreadPool { + + private int threadNum; + + private AtomicInteger threadAlive = new AtomicInteger(); + + private ReentrantLock reentrantLock = new ReentrantLock(); + + private Condition condition = reentrantLock.newCondition(); + + public CountableThreadPool(int threadNum) { + this.threadNum = threadNum; + this.executorService = Executors.newFixedThreadPool(threadNum); + } + + public CountableThreadPool(int threadNum, ExecutorService executorService) { + this.threadNum = threadNum; + this.executorService = executorService; + } + + public void setExecutorService(ExecutorService executorService) { + this.executorService = executorService; + } + + public int getThreadAlive() { + return threadAlive.get(); + } + + public int getThreadNum() { + return threadNum; + } + + private ExecutorService executorService; + + public void execute(final Runnable runnable) { + + + if (threadAlive.get() >= threadNum) { + try { + reentrantLock.lock(); + while (threadAlive.get() >= threadNum) { + try { + condition.await(); + } catch (InterruptedException e) { + } + } + } finally { + reentrantLock.unlock(); + } + } + threadAlive.incrementAndGet(); + executorService.execute(new Runnable() { + @Override + public void run() { + try { + runnable.run(); + } finally { + try { + reentrantLock.lock(); + threadAlive.decrementAndGet(); + condition.signal(); + } finally { + reentrantLock.unlock(); + } + } + } + }); + } + + public boolean isShutdown() { + return executorService.isShutdown(); + } + + public void shutdown() { + executorService.shutdown(); + } + + +} diff --git a/webmagic-core/src/main/java/us/codecraft/webmagic/utils/EnvironmentUtil.java b/webmagic-core/src/main/java/us/codecraft/webmagic/utils/EnvironmentUtil.java deleted file mode 100644 index 7aa5c13..0000000 --- a/webmagic-core/src/main/java/us/codecraft/webmagic/utils/EnvironmentUtil.java +++ /dev/null @@ -1,28 +0,0 @@ -package us.codecraft.webmagic.utils; - -import org.apache.commons.lang3.BooleanUtils; - -import java.util.Properties; - -/** - * @author code4crafter@gmail.com - * @since 0.3.0 - */ -public abstract class EnvironmentUtil { - - private static final String USE_XSOUP = "xsoup"; - - public static boolean useXsoup() { - Properties properties = System.getProperties(); - Object o = properties.get(USE_XSOUP); - if (o == null) { - return true; - } - return BooleanUtils.toBoolean(((String) o).toLowerCase()); - } - - public static void setUseXsoup(boolean useXsoup) { - Properties properties = System.getProperties(); - properties.setProperty(USE_XSOUP, BooleanUtils.toString(useXsoup, "true", "false")); - } -} diff --git a/webmagic-core/src/main/java/us/codecraft/webmagic/utils/HttpConstant.java b/webmagic-core/src/main/java/us/codecraft/webmagic/utils/HttpConstant.java new file mode 100644 index 0000000..2a76ecc --- /dev/null +++ b/webmagic-core/src/main/java/us/codecraft/webmagic/utils/HttpConstant.java @@ -0,0 +1,35 @@ +package us.codecraft.webmagic.utils; + +/** + * Some constants of Http protocal. + * @author code4crafer@gmail.com + * @since 0.5.0 + */ +public abstract class HttpConstant { + + public static abstract class Method { + + public static final String GET = "GET"; + + public static final String HEAD = "HEAD"; + + public static final String POST = "POST"; + + public static final String PUT = "PUT"; + + public static final String DELETE = "DELETE"; + + public static final String TRACE = "TRACE"; + + public static final String CONNECT = "CONNECT"; + + } + + public static abstract class Header { + + public static final String REFERER = "Referer"; + + public static final String USER_AGENT = "User-Agent"; + } + +} diff --git a/webmagic-core/src/main/java/us/codecraft/webmagic/utils/ThreadUtils.java b/webmagic-core/src/main/java/us/codecraft/webmagic/utils/ThreadUtils.java deleted file mode 100644 index cdfe6d0..0000000 --- a/webmagic-core/src/main/java/us/codecraft/webmagic/utils/ThreadUtils.java +++ /dev/null @@ -1,27 +0,0 @@ -package us.codecraft.webmagic.utils; - -import com.google.common.util.concurrent.MoreExecutors; - -import java.util.concurrent.ExecutorService; -import java.util.concurrent.SynchronousQueue; -import java.util.concurrent.ThreadPoolExecutor; -import java.util.concurrent.TimeUnit; - -/** - * @author code4crafer@gmail.com - * @since 0.1.0 - */ -public class ThreadUtils { - - public static ExecutorService newFixedThreadPool(int threadSize) { - if (threadSize <= 0) { - throw new IllegalArgumentException("ThreadSize must be greater than 0!"); - } - if (threadSize == 1) { - return MoreExecutors.sameThreadExecutor(); - - } - return new ThreadPoolExecutor(threadSize - 1, threadSize - 1, 0L, TimeUnit.MILLISECONDS, - new SynchronousQueue(), new ThreadPoolExecutor.CallerRunsPolicy()); - } -} diff --git a/webmagic-core/src/main/java/us/codecraft/webmagic/utils/UrlUtils.java b/webmagic-core/src/main/java/us/codecraft/webmagic/utils/UrlUtils.java index 456b3cc..60eacee 100644 --- a/webmagic-core/src/main/java/us/codecraft/webmagic/utils/UrlUtils.java +++ b/webmagic-core/src/main/java/us/codecraft/webmagic/utils/UrlUtils.java @@ -43,12 +43,22 @@ public class UrlUtils { if (url.startsWith("?")) url = base.getPath() + url; URL abs = new URL(base, url); - return abs.toExternalForm(); + return encodeIllegalCharacterInUrl(abs.toExternalForm()); } catch (MalformedURLException e) { return ""; } } + /** + * + * @param url + * @return + */ + public static String encodeIllegalCharacterInUrl(String url) { + //TODO more charator support + return url.replace(" ", "%20"); + } + public static String getHost(String url) { String host = url; int i = StringUtils.ordinalIndexOf(url, "/", 3); @@ -73,18 +83,37 @@ public class UrlUtils { return domain; } - private static Pattern patternForHref = Pattern.compile("(]*href=)[\"']{0,1}([^\"'<>\\s]*)[\"']{0,1}", Pattern.CASE_INSENSITIVE); + /** + * allow blank space in quote + */ + private static Pattern patternForHrefWithQuote = Pattern.compile("(]*href=)[\"']([^\"'<>]*)[\"']", Pattern.CASE_INSENSITIVE); + + /** + * disallow blank space without quote + */ + private static Pattern patternForHrefWithoutQuote = Pattern.compile("(]*href=)([^\"'<>\\s]+)", Pattern.CASE_INSENSITIVE); public static String fixAllRelativeHrefs(String html, String url) { + html = replaceByPattern(html, url, patternForHrefWithQuote); + html = replaceByPattern(html, url, patternForHrefWithoutQuote); + return html; + } + + public static String replaceByPattern(String html, String url, Pattern pattern) { StringBuilder stringBuilder = new StringBuilder(); - Matcher matcher = patternForHref.matcher(html); + Matcher matcher = pattern.matcher(html); int lastEnd = 0; + boolean modified = false; while (matcher.find()) { + modified = true; stringBuilder.append(StringUtils.substring(html, lastEnd, matcher.start())); stringBuilder.append(matcher.group(1)); stringBuilder.append("\"").append(canonicalizeUrl(matcher.group(2), url)).append("\""); lastEnd = matcher.end(); } + if (!modified) { + return html; + } stringBuilder.append(StringUtils.substring(html, lastEnd)); return stringBuilder.toString(); } diff --git a/webmagic-core/src/test/java/us/codecraft/webmagic/HtmlTest.java b/webmagic-core/src/test/java/us/codecraft/webmagic/HtmlTest.java index c900014..fa66c3a 100644 --- a/webmagic-core/src/test/java/us/codecraft/webmagic/HtmlTest.java +++ b/webmagic-core/src/test/java/us/codecraft/webmagic/HtmlTest.java @@ -1,6 +1,5 @@ package us.codecraft.webmagic; -import org.junit.Assert; import org.junit.Test; import us.codecraft.webmagic.selector.Html; @@ -14,7 +13,8 @@ public class HtmlTest { @Test public void testRegexSelector() { Html selectable = new Html("aaaaaaab"); - Assert.assertEquals("abbabbab", (selectable.regex("(.*)").replace("aa(a)", "$1bb").toString())); +// Assert.assertEquals("abbabbab", (selectable.regex("(.*)").replace("aa(a)", "$1bb").toString())); + System.out.println(selectable.regex("(.*)").replace("aa(a)", "$1bb").toString()); } diff --git a/webmagic-core/src/test/java/us/codecraft/webmagic/ResultItemsTest.java b/webmagic-core/src/test/java/us/codecraft/webmagic/ResultItemsTest.java new file mode 100644 index 0000000..0aa9e94 --- /dev/null +++ b/webmagic-core/src/test/java/us/codecraft/webmagic/ResultItemsTest.java @@ -0,0 +1,22 @@ +package us.codecraft.webmagic; + +import org.junit.Test; + + +import static org.assertj.core.api.Assertions.assertThat; + +/** + * @author code4crafter@gmail.com + */ +public class ResultItemsTest { + + @Test + public void testOrderOfEntries() throws Exception { + ResultItems resultItems = new ResultItems(); + resultItems.put("a", "a"); + resultItems.put("b", "b"); + resultItems.put("c", "c"); + assertThat(resultItems.getAll().keySet()).containsExactly("a","b","c"); + + } +} diff --git a/webmagic-core/src/test/java/us/codecraft/webmagic/SpiderTest.java b/webmagic-core/src/test/java/us/codecraft/webmagic/SpiderTest.java index 9d950ae..ba29387 100644 --- a/webmagic-core/src/test/java/us/codecraft/webmagic/SpiderTest.java +++ b/webmagic-core/src/test/java/us/codecraft/webmagic/SpiderTest.java @@ -37,7 +37,7 @@ public class SpiderTest { @Test public void testWaitAndNotify() throws InterruptedException { for (int i = 0; i < 10000; i++) { - System.out.println("round" + i); + System.out.println("round " + i); testRound(); } } diff --git a/webmagic-core/src/test/java/us/codecraft/webmagic/downloader/HttpClientDownloaderTest.java b/webmagic-core/src/test/java/us/codecraft/webmagic/downloader/HttpClientDownloaderTest.java index ac01926..ab84665 100644 --- a/webmagic-core/src/test/java/us/codecraft/webmagic/downloader/HttpClientDownloaderTest.java +++ b/webmagic-core/src/test/java/us/codecraft/webmagic/downloader/HttpClientDownloaderTest.java @@ -8,6 +8,8 @@ import us.codecraft.webmagic.Site; import us.codecraft.webmagic.Task; import us.codecraft.webmagic.selector.Html; +import java.io.UnsupportedEncodingException; + import static org.assertj.core.api.Assertions.assertThat; import static org.junit.Assert.assertTrue; @@ -28,10 +30,16 @@ public class HttpClientDownloaderTest { @Test public void testDownloader() { HttpClientDownloader httpClientDownloader = new HttpClientDownloader(); - Html html = httpClientDownloader.download("http://www.oschina.net"); + Html html = httpClientDownloader.download("https://github.com"); assertTrue(!html.getText().isEmpty()); } + @Test(expected = IllegalArgumentException.class) + public void testDownloaderInIllegalUrl() throws UnsupportedEncodingException { + HttpClientDownloader httpClientDownloader = new HttpClientDownloader(); + httpClientDownloader.download("http://www.oschina.net/>"); + } + @Test public void testCycleTriedTimes() { HttpClientDownloader httpClientDownloader = new HttpClientDownloader(); diff --git a/webmagic-core/src/test/java/us/codecraft/webmagic/selector/ExtractorsTest.java b/webmagic-core/src/test/java/us/codecraft/webmagic/selector/ExtractorsTest.java index b398007..e8da48d 100644 --- a/webmagic-core/src/test/java/us/codecraft/webmagic/selector/ExtractorsTest.java +++ b/webmagic-core/src/test/java/us/codecraft/webmagic/selector/ExtractorsTest.java @@ -29,6 +29,6 @@ public class ExtractorsTest { Assert.assertEquals("bb", and($("title"), regex("aa(bb)cc")).select(html2)); OrSelector or = or($("div h1 a", "innerHtml"), xpath("//title")); Assert.assertEquals("aabbcc", or.select(html)); - Assert.assertEquals("aabbcc", or.select(html2)); + Assert.assertEquals("aabbcc", or.select(html2)); } } diff --git a/webmagic-extension/src/test/java/us/codecraft/webmagic/selector/JsonPathSelectorTest.java b/webmagic-core/src/test/java/us/codecraft/webmagic/selector/JsonPathSelectorTest.java similarity index 100% rename from webmagic-extension/src/test/java/us/codecraft/webmagic/selector/JsonPathSelectorTest.java rename to webmagic-core/src/test/java/us/codecraft/webmagic/selector/JsonPathSelectorTest.java diff --git a/webmagic-core/src/test/java/us/codecraft/webmagic/selector/JsonTest.java b/webmagic-core/src/test/java/us/codecraft/webmagic/selector/JsonTest.java new file mode 100644 index 0000000..89afbb6 --- /dev/null +++ b/webmagic-core/src/test/java/us/codecraft/webmagic/selector/JsonTest.java @@ -0,0 +1,20 @@ +package us.codecraft.webmagic.selector; + +import org.junit.Test; + +import static org.assertj.core.api.Assertions.assertThat; + +/** + * @author code4crafter@gmai.com + * @since 0.5.0 + */ +public class JsonTest { + + private String text = "callback({\"name\":\"json\"})"; + + @Test + public void testRemovePadding() throws Exception { + String name = new Json(text).removePadding("callback").jsonPath("$.name").get(); + assertThat(name).isEqualTo("json"); + } +} diff --git a/webmagic-core/src/test/java/us/codecraft/webmagic/selector/SelectorTest.java b/webmagic-core/src/test/java/us/codecraft/webmagic/selector/SelectorTest.java new file mode 100644 index 0000000..249a837 --- /dev/null +++ b/webmagic-core/src/test/java/us/codecraft/webmagic/selector/SelectorTest.java @@ -0,0 +1,26 @@ +package us.codecraft.webmagic.selector; + +import org.junit.Test; + +import java.util.List; + +import static org.assertj.core.api.Assertions.assertThat; + +/** + * @author code4crafter@gmail.com + */ +public class SelectorTest { + + private String html = "
"; + + @Test + public void testChain() throws Exception { + Html selectable = new Html(html); + List linksWithoutChain = selectable.links().all(); + Selectable xpath = selectable.xpath("//div"); + List linksWithChainFirstCall = xpath.links().all(); + List linksWithChainSecondCall = xpath.links().all(); + assertThat(linksWithoutChain).hasSameSizeAs(linksWithChainFirstCall); + assertThat(linksWithChainFirstCall).hasSameSizeAs(linksWithChainSecondCall); + } +} diff --git a/webmagic-core/src/test/java/us/codecraft/webmagic/utils/EnvironmentUtilTest.java b/webmagic-core/src/test/java/us/codecraft/webmagic/utils/EnvironmentUtilTest.java deleted file mode 100644 index cb620e7..0000000 --- a/webmagic-core/src/test/java/us/codecraft/webmagic/utils/EnvironmentUtilTest.java +++ /dev/null @@ -1,18 +0,0 @@ -package us.codecraft.webmagic.utils; - -import org.junit.Test; - -import static junit.framework.Assert.*; - -/** - * @author code4crafter@gmail.com - */ -public class EnvironmentUtilTest { - - @Test - public void test() { - assertTrue(EnvironmentUtil.useXsoup()); - EnvironmentUtil.setUseXsoup(false); - assertFalse(EnvironmentUtil.useXsoup()); - } -} diff --git a/webmagic-core/src/test/java/us/codecraft/webmagic/utils/UrlUtilsTest.java b/webmagic-core/src/test/java/us/codecraft/webmagic/utils/UrlUtilsTest.java index abe6adc..565fde4 100644 --- a/webmagic-core/src/test/java/us/codecraft/webmagic/utils/UrlUtilsTest.java +++ b/webmagic-core/src/test/java/us/codecraft/webmagic/utils/UrlUtilsTest.java @@ -3,6 +3,8 @@ package us.codecraft.webmagic.utils; import org.junit.Assert; import org.junit.Test; +import static org.assertj.core.api.Assertions.assertThat; + /** * @author code4crafter@gmail.com
* Date: 13-4-21 @@ -12,19 +14,44 @@ public class UrlUtilsTest { @Test public void testFixRelativeUrl() { - String fixrelativeurl = UrlUtils.canonicalizeUrl("aa", "http://www.dianping.com/sh/ss/com"); - System.out.println("fix: " + fixrelativeurl); - Assert.assertEquals("http://www.dianping.com/sh/ss/aa", fixrelativeurl); + String absoluteUrl = UrlUtils.canonicalizeUrl("aa", "http://www.dianping.com/sh/ss/com"); + assertThat(absoluteUrl).isEqualTo("http://www.dianping.com/sh/ss/aa"); - fixrelativeurl = UrlUtils.canonicalizeUrl("../aa", "http://www.dianping.com/sh/ss/com"); - Assert.assertEquals("http://www.dianping.com/sh/aa", fixrelativeurl); + absoluteUrl = UrlUtils.canonicalizeUrl("../aa", "http://www.dianping.com/sh/ss/com"); + assertThat(absoluteUrl).isEqualTo("http://www.dianping.com/sh/aa"); - fixrelativeurl = UrlUtils.canonicalizeUrl("..aa", "http://www.dianping.com/sh/ss/com"); - Assert.assertEquals("http://www.dianping.com/sh/ss/..aa", fixrelativeurl); - fixrelativeurl = UrlUtils.canonicalizeUrl("../../aa", "http://www.dianping.com/sh/ss/com/"); - Assert.assertEquals("http://www.dianping.com/sh/aa", fixrelativeurl); - fixrelativeurl = UrlUtils.canonicalizeUrl("../../aa", "http://www.dianping.com/sh/ss/com"); - Assert.assertEquals("http://www.dianping.com/aa", fixrelativeurl); + absoluteUrl = UrlUtils.canonicalizeUrl("..aa", "http://www.dianping.com/sh/ss/com"); + assertThat(absoluteUrl).isEqualTo("http://www.dianping.com/sh/ss/..aa"); + + absoluteUrl = UrlUtils.canonicalizeUrl("../../aa", "http://www.dianping.com/sh/ss/com/"); + assertThat(absoluteUrl).isEqualTo("http://www.dianping.com/sh/aa"); + + absoluteUrl = UrlUtils.canonicalizeUrl("../../aa", "http://www.dianping.com/sh/ss/com"); + assertThat(absoluteUrl).isEqualTo("http://www.dianping.com/aa"); + } + + @Test + public void testFixAllRelativeHrefs() { + String originHtml = ""; + String replacedHtml = UrlUtils.fixAllRelativeHrefs(originHtml, "http://www.dianping.com/"); + assertThat(replacedHtml).isEqualTo(""); + + originHtml = ""; + replacedHtml = UrlUtils.fixAllRelativeHrefs(originHtml, "http://www.dianping.com/"); + assertThat(replacedHtml).isEqualTo(""); + + originHtml = ""; + replacedHtml = UrlUtils.fixAllRelativeHrefs(originHtml, "http://www.dianping.com/"); + assertThat(replacedHtml).isEqualTo(""); + + originHtml = ""; + replacedHtml = UrlUtils.fixAllRelativeHrefs(originHtml, "http://www.dianping.com/"); + assertThat(replacedHtml).isEqualTo(""); + } + + @Test + public void test(){ + UrlUtils.canonicalizeUrl("start tag", "http://www.dianping.com/"); } @Test diff --git a/webmagic-core/src/test/resources/log4j.xml b/webmagic-core/src/test/resources/log4j.xml index 9084694..c2b5a2f 100644 --- a/webmagic-core/src/test/resources/log4j.xml +++ b/webmagic-core/src/test/resources/log4j.xml @@ -8,21 +8,11 @@ - - - - - - - - - - diff --git a/webmagic-extension/pom.xml b/webmagic-extension/pom.xml index a822077..5d93cdc 100644 --- a/webmagic-extension/pom.xml +++ b/webmagic-extension/pom.xml @@ -3,17 +3,13 @@ us.codecraft webmagic-parent - 0.4.3 + 0.5.0 4.0.0 webmagic-extension - - com.alibaba - fastjson - redis.clients jedis @@ -28,11 +24,6 @@ junit junit - - com.jayway.jsonpath - json-path - 0.8.1 - diff --git a/webmagic-extension/src/main/java/us/codecraft/webmagic/configurable/ConfigurablePageProcessor.java b/webmagic-extension/src/main/java/us/codecraft/webmagic/configurable/ConfigurablePageProcessor.java new file mode 100644 index 0000000..902dfdd --- /dev/null +++ b/webmagic-extension/src/main/java/us/codecraft/webmagic/configurable/ConfigurablePageProcessor.java @@ -0,0 +1,51 @@ +package us.codecraft.webmagic.configurable; + +import us.codecraft.webmagic.Page; +import us.codecraft.webmagic.Site; +import us.codecraft.webmagic.processor.PageProcessor; +import us.codecraft.webmagic.utils.Experimental; + +import java.util.List; + +/** + * @author code4crafter@gmail.com
+ */ +@Experimental +public class ConfigurablePageProcessor implements PageProcessor { + + private Site site; + + private List extractRules; + + public ConfigurablePageProcessor(Site site, List extractRules) { + this.site = site; + this.extractRules = extractRules; + } + + @Override + public void process(Page page) { + for (ExtractRule extractRule : extractRules) { + if (extractRule.isMulti()) { + List results = page.getHtml().selectDocumentForList(extractRule.getSelector()); + if (extractRule.isNotNull() && results.size() == 0) { + page.setSkip(true); + } else { + page.getResultItems().put(extractRule.getFieldName(), results); + } + } else { + String result = page.getHtml().selectDocument(extractRule.getSelector()); + if (extractRule.isNotNull() && result == null) { + page.setSkip(true); + } else { + page.getResultItems().put(extractRule.getFieldName(), result); + } + } + } + } + + @Override + public Site getSite() { + return site; + } + +} diff --git a/webmagic-extension/src/main/java/us/codecraft/webmagic/configurable/ExpressionType.java b/webmagic-extension/src/main/java/us/codecraft/webmagic/configurable/ExpressionType.java new file mode 100644 index 0000000..bd84be3 --- /dev/null +++ b/webmagic-extension/src/main/java/us/codecraft/webmagic/configurable/ExpressionType.java @@ -0,0 +1,11 @@ +package us.codecraft.webmagic.configurable; + +/** + * @author code4crafter@gmail.com + * @date 14-4-5 + */ +public enum ExpressionType { + + XPath, Regex, Css, JsonPath; + +} diff --git a/webmagic-extension/src/main/java/us/codecraft/webmagic/configurable/ExtractRule.java b/webmagic-extension/src/main/java/us/codecraft/webmagic/configurable/ExtractRule.java new file mode 100644 index 0000000..82337c4 --- /dev/null +++ b/webmagic-extension/src/main/java/us/codecraft/webmagic/configurable/ExtractRule.java @@ -0,0 +1,113 @@ +package us.codecraft.webmagic.configurable; + +import us.codecraft.webmagic.selector.JsonPathSelector; +import us.codecraft.webmagic.selector.Selector; + +import static us.codecraft.webmagic.selector.Selectors.*; + +/** + * @author code4crafter@gmail.com + * @date 14-4-5 + */ +public class ExtractRule { + + private String fieldName; + + private ExpressionType expressionType; + + private String expressionValue; + + private String[] expressionParams; + + private boolean multi = false; + + private volatile Selector selector; + + private boolean notNull = false; + + public String getFieldName() { + return fieldName; + } + + public void setFieldName(String fieldName) { + this.fieldName = fieldName; + } + + public ExpressionType getExpressionType() { + return expressionType; + } + + public void setExpressionType(ExpressionType expressionType) { + this.expressionType = expressionType; + } + + public String getExpressionValue() { + return expressionValue; + } + + public void setExpressionValue(String expressionValue) { + this.expressionValue = expressionValue; + } + + public String[] getExpressionParams() { + return expressionParams; + } + + public void setExpressionParams(String[] expressionParams) { + this.expressionParams = expressionParams; + } + + public boolean isMulti() { + return multi; + } + + public void setMulti(boolean multi) { + this.multi = multi; + } + + public Selector getSelector() { + if (selector == null) { + synchronized (this) { + if (selector == null) { + selector = compileSelector(); + } + } + } + return selector; + } + + private Selector compileSelector() { + switch (expressionType) { + case Css: + if (expressionParams.length >= 1) { + return $(expressionValue, expressionParams[0]); + } else { + return $(expressionValue); + } + case XPath: + return xpath(expressionValue); + case Regex: + if (expressionParams.length >= 1) { + return regex(expressionValue, Integer.parseInt(expressionParams[0])); + } else { + return regex(expressionValue); + } + case JsonPath: + return new JsonPathSelector(expressionValue); + default: + return xpath(expressionValue); + } + } + + public void setSelector(Selector selector) { + this.selector = selector; + } + + public boolean isNotNull() { + return notNull; + } + + public void setNotNull(boolean notNull) { + this.notNull = notNull; + } +} diff --git a/webmagic-extension/src/main/java/us/codecraft/webmagic/configurable/PropertyLoader.java b/webmagic-extension/src/main/java/us/codecraft/webmagic/configurable/PropertyLoader.java deleted file mode 100644 index bffbcf2..0000000 --- a/webmagic-extension/src/main/java/us/codecraft/webmagic/configurable/PropertyLoader.java +++ /dev/null @@ -1,18 +0,0 @@ -package us.codecraft.webmagic.configurable; - -import us.codecraft.webmagic.processor.PageProcessor; - -import java.util.Map; - -/** - * Inject property to object by {@link Inject} annotation. - * - * @author yihua.huang@dianping.com - */ -public class PropertyLoader { - - public T load(T object, Map properties) { - return object; - } - -} diff --git a/webmagic-extension/src/main/java/us/codecraft/webmagic/example/ConfigurableBlogPageProcessor.java b/webmagic-extension/src/main/java/us/codecraft/webmagic/example/ConfigurableBlogPageProcessor.java deleted file mode 100644 index 28d3ab0..0000000 --- a/webmagic-extension/src/main/java/us/codecraft/webmagic/example/ConfigurableBlogPageProcessor.java +++ /dev/null @@ -1,51 +0,0 @@ -package us.codecraft.webmagic.example; - -import java.util.List; -import us.codecraft.webmagic.Page; -import us.codecraft.webmagic.Site; -import us.codecraft.webmagic.Spider; -import us.codecraft.webmagic.configurable.Inject; -import us.codecraft.webmagic.processor.PageProcessor; - -/** - * @author code4crafter@gmail.com
- */ -public class ConfigurableBlogPageProcessor implements PageProcessor { - - private Site site = Site.me().setDomain("my.oschina.net"); - - @Inject("linkRegex") - private String linkRegex; - - @Inject("titleXpath") - private String titleXpath; - - @Inject("contentXpath") - private String contentXpath; - - @Inject("tagsXpath") - private String tagsXpath; - - @Override - public void process(Page page) { - List links = page.getHtml().links().regex(linkRegex).all(); - page.addTargetRequests(links); - page.putField("title", page.getHtml().xpath(titleXpath).toString()); - if (page.getResultItems().get("title") == null) { - //skip this page - page.setSkip(true); - } - page.putField("content", page.getHtml().smartContent().toString()); - page.putField("tags", page.getHtml().xpath(tagsXpath).all()); - } - - @Override - public Site getSite() { - return site; - - } - - public static void main(String[] args) { - Spider.create(new ConfigurableBlogPageProcessor()).addUrl("http://my.oschina.net/flashsword/blog").thread(2).run(); - } -} diff --git a/webmagic-extension/src/main/java/us/codecraft/webmagic/example/MonitorExample.java b/webmagic-extension/src/main/java/us/codecraft/webmagic/example/MonitorExample.java new file mode 100644 index 0000000..734f042 --- /dev/null +++ b/webmagic-extension/src/main/java/us/codecraft/webmagic/example/MonitorExample.java @@ -0,0 +1,26 @@ +package us.codecraft.webmagic.example; + +import us.codecraft.webmagic.Spider; +import us.codecraft.webmagic.monitor.SpiderMonitor; +import us.codecraft.webmagic.processor.example.GithubRepoPageProcessor; +import us.codecraft.webmagic.processor.example.OschinaBlogPageProcessor; + +/** + * @author code4crafer@gmail.com + * @since 0.5.0 + */ +public class MonitorExample { + + public static void main(String[] args) throws Exception { + + Spider oschinaSpider = Spider.create(new OschinaBlogPageProcessor()) + .addUrl("http://my.oschina.net/flashsword/blog"); + Spider githubSpider = Spider.create(new GithubRepoPageProcessor()) + .addUrl("https://github.com/code4craft"); + + SpiderMonitor.instance().register(oschinaSpider); + SpiderMonitor.instance().register(githubSpider); + oschinaSpider.start(); + githubSpider.start(); + } +} diff --git a/webmagic-extension/src/main/java/us/codecraft/webmagic/example/OschinaBlog.java b/webmagic-extension/src/main/java/us/codecraft/webmagic/example/OschinaBlog.java index e8ac20c..b527ea7 100644 --- a/webmagic-extension/src/main/java/us/codecraft/webmagic/example/OschinaBlog.java +++ b/webmagic-extension/src/main/java/us/codecraft/webmagic/example/OschinaBlog.java @@ -26,11 +26,11 @@ public class OschinaBlog { @ExtractBy(value = "//div[@class='BlogTags']/a/text()", multi = true) private List tags; - @Formatter("yyyy-MM-dd HH:mm") @ExtractBy("//div[@class='BlogStat']/regex('\\d+-\\d+-\\d+\\s+\\d+:\\d+')") private Date date; public static void main(String[] args) { + //results will be saved to "/data/webmagic/" in json format OOSpider.create(Site.me(), new JsonFilePageModelPipeline("/data/webmagic/"), OschinaBlog.class) .addUrl("http://my.oschina.net/flashsword/blog").run(); } diff --git a/webmagic-extension/src/main/java/us/codecraft/webmagic/example/PatternProcessorExample.java b/webmagic-extension/src/main/java/us/codecraft/webmagic/example/PatternProcessorExample.java new file mode 100644 index 0000000..8ecb08f --- /dev/null +++ b/webmagic-extension/src/main/java/us/codecraft/webmagic/example/PatternProcessorExample.java @@ -0,0 +1,66 @@ +package us.codecraft.webmagic.example; + +import org.apache.log4j.Logger; +import us.codecraft.webmagic.*; +import us.codecraft.webmagic.handler.CompositePageProcessor; +import us.codecraft.webmagic.handler.CompositePipeline; +import us.codecraft.webmagic.handler.PatternProcessor; +import us.codecraft.webmagic.handler.RequestMatcher; + +/** + * Created with IntelliJ IDEA. + * User: Sebastian MA + * Date: April 04, 2014 + * Time: 21:23 + */ +public class PatternProcessorExample { + + private static Logger log = Logger.getLogger(PatternProcessorExample.class); + + public static void main(String... args) { + + // define a patternProcessor which handles only "http://item.jd.com/.*" + PatternProcessor githubRepoProcessor = new PatternProcessor("https://github\\.com/[\\w\\-]+/[\\w\\-]+") { + + @Override + public RequestMatcher.MatchOther processPage(Page page) { + page.putField("reponame", page.getHtml().xpath("//h1[@class='entry-title public']/strong/a/text()").toString()); + return RequestMatcher.MatchOther.YES; + } + + @Override + public RequestMatcher.MatchOther processResult(ResultItems resultItems, Task task) { + log.info("Extracting from repo" + resultItems.getRequest()); + System.out.println("Repo name: "+resultItems.get("reponame")); + return RequestMatcher.MatchOther.YES; + } + }; + + PatternProcessor githubUserProcessor = new PatternProcessor("https://github\\.com/[\\w\\-]+") { + + @Override + public RequestMatcher.MatchOther processPage(Page page) { + log.info("Extracting from " + page.getUrl()); + page.addTargetRequests(page.getHtml().links().regex("https://github\\.com/[\\w\\-]+/[\\w\\-]+").all()); + page.addTargetRequests(page.getHtml().links().regex("https://github\\.com/[\\w\\-]+").all()); + page.putField("username", page.getHtml().xpath("//span[@class='vcard-fullname']/text()").toString()); + return RequestMatcher.MatchOther.YES; + } + + @Override + public RequestMatcher.MatchOther processResult(ResultItems resultItems, Task task) { + System.out.println("User name: "+resultItems.get("username")); + return RequestMatcher.MatchOther.YES; + } + }; + + CompositePageProcessor pageProcessor = new CompositePageProcessor(Site.me().setDomain("github.com").setRetryTimes(3)); + CompositePipeline pipeline = new CompositePipeline(); + + pageProcessor.setSubPageProcessors(githubRepoProcessor, githubUserProcessor); + pipeline.setSubPipeline(githubRepoProcessor, githubUserProcessor); + + Spider.create(pageProcessor).addUrl("https://github.com/code4craft").thread(5).addPipeline(pipeline).runAsync(); + } + +} diff --git a/webmagic-extension/src/main/java/us/codecraft/webmagic/handler/CompositePageProcessor.java b/webmagic-extension/src/main/java/us/codecraft/webmagic/handler/CompositePageProcessor.java new file mode 100644 index 0000000..2073445 --- /dev/null +++ b/webmagic-extension/src/main/java/us/codecraft/webmagic/handler/CompositePageProcessor.java @@ -0,0 +1,58 @@ +package us.codecraft.webmagic.handler; + +import us.codecraft.webmagic.Page; +import us.codecraft.webmagic.Site; +import us.codecraft.webmagic.processor.PageProcessor; + +import java.util.ArrayList; +import java.util.List; + +/** + * @author code4crafter@gmail.com + * @date 14-4-5 + */ +public class CompositePageProcessor implements PageProcessor { + + private Site site; + + private List subPageProcessors = new ArrayList(); + + public CompositePageProcessor(Site site) { + this.site = site; + } + + @Override + public void process(Page page) { + for (SubPageProcessor subPageProcessor : subPageProcessors) { + if (subPageProcessor.match(page.getRequest())) { + SubPageProcessor.MatchOther matchOtherProcessorProcessor = subPageProcessor.processPage(page); + if (matchOtherProcessorProcessor == null || matchOtherProcessorProcessor != SubPageProcessor.MatchOther.YES) { + return; + } + } + } + } + + public CompositePageProcessor setSite(Site site) { + this.site = site; + return this; + } + + public CompositePageProcessor addSubPageProcessor(SubPageProcessor subPageProcessor) { + this.subPageProcessors.add(subPageProcessor); + return this; + } + + public CompositePageProcessor setSubPageProcessors(SubPageProcessor... subPageProcessors) { + this.subPageProcessors = new ArrayList(); + for (SubPageProcessor subPageProcessor : subPageProcessors) { + this.subPageProcessors.add(subPageProcessor); + } + return this; + } + + @Override + public Site getSite() { + return site; + } +} diff --git a/webmagic-extension/src/main/java/us/codecraft/webmagic/handler/CompositePipeline.java b/webmagic-extension/src/main/java/us/codecraft/webmagic/handler/CompositePipeline.java new file mode 100644 index 0000000..3f09eee --- /dev/null +++ b/webmagic-extension/src/main/java/us/codecraft/webmagic/handler/CompositePipeline.java @@ -0,0 +1,42 @@ +package us.codecraft.webmagic.handler; + +import us.codecraft.webmagic.ResultItems; +import us.codecraft.webmagic.Task; +import us.codecraft.webmagic.pipeline.Pipeline; + +import java.util.ArrayList; +import java.util.List; + +/** + * @author code4crafer@gmail.com + */ +public class CompositePipeline implements Pipeline { + + private List subPipelines = new ArrayList(); + + @Override + public void process(ResultItems resultItems, Task task) { + for (SubPipeline subPipeline : subPipelines) { + if (subPipeline.match(resultItems.getRequest())) { + RequestMatcher.MatchOther matchOtherProcessorProcessor = subPipeline.processResult(resultItems, task); + if (matchOtherProcessorProcessor == null || matchOtherProcessorProcessor != RequestMatcher.MatchOther.YES) { + return; + } + } + } + } + + public CompositePipeline addSubPipeline(SubPipeline subPipeline) { + this.subPipelines.add(subPipeline); + return this; + } + + public CompositePipeline setSubPipeline(SubPipeline... subPipelines) { + this.subPipelines = new ArrayList(); + for (SubPipeline subPipeline : subPipelines) { + this.subPipelines.add(subPipeline); + } + return this; + } + +} diff --git a/webmagic-extension/src/main/java/us/codecraft/webmagic/handler/PatternProcessor.java b/webmagic-extension/src/main/java/us/codecraft/webmagic/handler/PatternProcessor.java new file mode 100644 index 0000000..f9ef286 --- /dev/null +++ b/webmagic-extension/src/main/java/us/codecraft/webmagic/handler/PatternProcessor.java @@ -0,0 +1,13 @@ +package us.codecraft.webmagic.handler; + +/** + * @author code4crafer@gmail.com + */ +public abstract class PatternProcessor extends PatternRequestMatcher implements SubPipeline, SubPageProcessor { + /** + * @param pattern url pattern to handle + */ + public PatternProcessor(String pattern) { + super(pattern); + } +} diff --git a/webmagic-extension/src/main/java/us/codecraft/webmagic/handler/PatternRequestMatcher.java b/webmagic-extension/src/main/java/us/codecraft/webmagic/handler/PatternRequestMatcher.java new file mode 100644 index 0000000..9201a4c --- /dev/null +++ b/webmagic-extension/src/main/java/us/codecraft/webmagic/handler/PatternRequestMatcher.java @@ -0,0 +1,37 @@ +package us.codecraft.webmagic.handler; + +import us.codecraft.webmagic.Request; + +import java.util.regex.Pattern; + +/** + * Created with IntelliJ IDEA. + * User: Sebastian MA + * Date: April 03, 2014 + * Time: 10:00 + *

+ * A PatternHandler is in charge of both page extraction and data processing by implementing + * its two abstract methods. + */ +public abstract class PatternRequestMatcher implements RequestMatcher { + + /** + * match pattern. only matched page should be handled. + */ + protected String pattern; + + private Pattern patternCompiled; + + /** + * @param pattern url pattern to handle + */ + public PatternRequestMatcher(String pattern) { + this.pattern = pattern; + this.patternCompiled = Pattern.compile(pattern); + } + + @Override + public boolean match(Request request) { + return patternCompiled.matcher(request.getUrl()).matches(); + } +} diff --git a/webmagic-extension/src/main/java/us/codecraft/webmagic/handler/RequestMatcher.java b/webmagic-extension/src/main/java/us/codecraft/webmagic/handler/RequestMatcher.java new file mode 100644 index 0000000..31b9a78 --- /dev/null +++ b/webmagic-extension/src/main/java/us/codecraft/webmagic/handler/RequestMatcher.java @@ -0,0 +1,24 @@ +package us.codecraft.webmagic.handler; + +import us.codecraft.webmagic.Request; + +/** + * @author code4crafer@gmail.com + * @since 0.5.0 + */ +public interface RequestMatcher { + + /** + * Check whether to process the page.

+ * Please DO NOT change page status in this method. + * + * @param page + * + * @return + */ + public boolean match(Request page); + + public enum MatchOther { + YES, NO + } +} diff --git a/webmagic-extension/src/main/java/us/codecraft/webmagic/handler/SubPageProcessor.java b/webmagic-extension/src/main/java/us/codecraft/webmagic/handler/SubPageProcessor.java new file mode 100644 index 0000000..1b6e283 --- /dev/null +++ b/webmagic-extension/src/main/java/us/codecraft/webmagic/handler/SubPageProcessor.java @@ -0,0 +1,20 @@ +package us.codecraft.webmagic.handler; + +import us.codecraft.webmagic.Page; + +/** + * @author code4crafter@gmail.com + * @date 14-4-5 + */ +public interface SubPageProcessor extends RequestMatcher { + + /** + * process the page, extract urls to fetch, extract the data and store + * + * @param page + * + * @return whether continue to match + */ + public MatchOther processPage(Page page); + +} diff --git a/webmagic-extension/src/main/java/us/codecraft/webmagic/handler/SubPipeline.java b/webmagic-extension/src/main/java/us/codecraft/webmagic/handler/SubPipeline.java new file mode 100644 index 0000000..4045608 --- /dev/null +++ b/webmagic-extension/src/main/java/us/codecraft/webmagic/handler/SubPipeline.java @@ -0,0 +1,21 @@ +package us.codecraft.webmagic.handler; + +import us.codecraft.webmagic.ResultItems; +import us.codecraft.webmagic.Task; + +/** + * @author code4crafer@gmail.com + * @since 0.5.0 + */ +public interface SubPipeline extends RequestMatcher { + + /** + * process the page, extract urls to fetch, extract the data and store + * + * @param page + * @param task + * @return whether continue to match + */ + public MatchOther processResult(ResultItems resultItems, Task task); + +} diff --git a/webmagic-extension/src/main/java/us/codecraft/webmagic/model/ModelPageProcessor.java b/webmagic-extension/src/main/java/us/codecraft/webmagic/model/ModelPageProcessor.java index 8a40dae..6bfe88d 100644 --- a/webmagic-extension/src/main/java/us/codecraft/webmagic/model/ModelPageProcessor.java +++ b/webmagic-extension/src/main/java/us/codecraft/webmagic/model/ModelPageProcessor.java @@ -7,9 +7,7 @@ import us.codecraft.webmagic.processor.PageProcessor; import us.codecraft.webmagic.selector.Selector; import java.util.ArrayList; -import java.util.HashSet; import java.util.List; -import java.util.Set; import java.util.regex.Matcher; import java.util.regex.Pattern; @@ -25,8 +23,6 @@ class ModelPageProcessor implements PageProcessor { private Site site; - private Set targetUrlPatterns = new HashSet(); - public static ModelPageProcessor create(Site site, Class... clazzs) { ModelPageProcessor modelPageProcessor = new ModelPageProcessor(site); for (Class clazz : clazzs) { @@ -38,8 +34,6 @@ class ModelPageProcessor implements PageProcessor { public ModelPageProcessor addPageModel(Class clazz) { PageModelExtractor pageModelExtractor = PageModelExtractor.create(clazz); - targetUrlPatterns.addAll(pageModelExtractor.getTargetUrlPatterns()); - targetUrlPatterns.addAll(pageModelExtractor.getHelpUrlPatterns()); pageModelExtractorList.add(pageModelExtractor); return this; } @@ -55,11 +49,14 @@ class ModelPageProcessor implements PageProcessor { extractLinks(page, pageModelExtractor.getTargetUrlRegionSelector(), pageModelExtractor.getTargetUrlPatterns()); Object process = pageModelExtractor.process(page); if (process == null || (process instanceof List && ((List) process).size() == 0)) { - page.getResultItems().setSkip(true); + continue; } postProcessPageModel(pageModelExtractor.getClazz(), process); page.putField(pageModelExtractor.getClazz().getCanonicalName(), process); } + if (page.getResultItems().getAll().size() == 0) { + page.getResultItems().setSkip(true); + } } private void extractLinks(Page page, Selector urlRegionSelector, List urlPatterns) { @@ -67,7 +64,7 @@ class ModelPageProcessor implements PageProcessor { if (urlRegionSelector == null) { links = page.getHtml().links().all(); } else { - links = urlRegionSelector.selectList(page.getHtml().toString()); + links = page.getHtml().selectList(urlRegionSelector).links().all(); } for (String link : links) { for (Pattern targetUrlPattern : urlPatterns) { diff --git a/webmagic-extension/src/main/java/us/codecraft/webmagic/model/PageModelExtractor.java b/webmagic-extension/src/main/java/us/codecraft/webmagic/model/PageModelExtractor.java index 5e4da11..9816c71 100644 --- a/webmagic-extension/src/main/java/us/codecraft/webmagic/model/PageModelExtractor.java +++ b/webmagic-extension/src/main/java/us/codecraft/webmagic/model/PageModelExtractor.java @@ -9,6 +9,7 @@ import us.codecraft.webmagic.model.formatter.BasicTypeFormatter; import us.codecraft.webmagic.model.formatter.ObjectFormatter; import us.codecraft.webmagic.model.formatter.ObjectFormatters; import us.codecraft.webmagic.selector.*; +import us.codecraft.webmagic.utils.ClassUtils; import us.codecraft.webmagic.utils.ExtractorUtils; import java.lang.annotation.Annotation; @@ -53,7 +54,7 @@ class PageModelExtractor { this.clazz = clazz; initClassExtractors(); fieldExtractors = new ArrayList(); - for (Field field : clazz.getDeclaredFields()) { + for (Field field : ClassUtils.getFieldsIncludeSuperClass(clazz)) { field.setAccessible(true); FieldExtractor fieldExtractor = getAnnotationExtractBy(clazz, field); FieldExtractor fieldExtractorTmp = getAnnotationExtractCombo(clazz, field); @@ -76,9 +77,21 @@ class PageModelExtractor { } private void checkFormat(Field field, FieldExtractor fieldExtractor) { + //check custom formatter + Formatter formatter = field.getAnnotation(Formatter.class); + if (formatter != null && !formatter.formatter().equals(ObjectFormatter.class)) { + if (formatter != null) { + if (!formatter.formatter().equals(ObjectFormatter.class)) { + ObjectFormatter objectFormatter = initFormatter(formatter.formatter()); + objectFormatter.initParam(formatter.value()); + fieldExtractor.setObjectFormatter(objectFormatter); + return; + } + } + } if (!fieldExtractor.isMulti() && !String.class.isAssignableFrom(field.getType())) { Class fieldClazz = BasicTypeFormatter.detectBasicClass(field.getType()); - ObjectFormatter objectFormatter = getObjectFormatter(field, fieldClazz); + ObjectFormatter objectFormatter = getObjectFormatter(field, fieldClazz, formatter); if (objectFormatter == null) { throw new IllegalStateException("Can't find formatter for field " + field.getName() + " of type " + fieldClazz); } else { @@ -88,10 +101,9 @@ class PageModelExtractor { if (!List.class.isAssignableFrom(field.getType())) { throw new IllegalStateException("Field " + field.getName() + " must be list"); } - Formatter formatter = field.getAnnotation(Formatter.class); if (formatter != null) { if (!formatter.subClazz().equals(Void.class)) { - ObjectFormatter objectFormatter = getObjectFormatter(field, formatter.subClazz()); + ObjectFormatter objectFormatter = getObjectFormatter(field, formatter.subClazz(), formatter); if (objectFormatter == null) { throw new IllegalStateException("Can't find formatter for field " + field.getName() + " of type " + formatter.subClazz()); } else { @@ -102,14 +114,7 @@ class PageModelExtractor { } } - private ObjectFormatter getObjectFormatter(Field field, Class fieldClazz) { - Formatter formatter = field.getAnnotation(Formatter.class); - if (formatter != null) { - if (!formatter.formatter().equals(ObjectFormatter.class)) { - ObjectFormatter objectFormatter = initFormatter(formatter.formatter()); - objectFormatter.initParam(formatter.value()); - } - } + private ObjectFormatter getObjectFormatter(Field field, Class fieldClazz, Formatter formatter) { return initFormatter(ObjectFormatters.get(fieldClazz)); } @@ -340,9 +345,7 @@ class PageModelExtractor { private Object convert(String value, ObjectFormatter objectFormatter) { try { Object format = objectFormatter.format(value); - if (logger.isDebugEnabled()) { - logger.debug("String " + value + " is converted to " + format); - } + logger.debug("String {} is converted to {}", value, format); return format; } catch (Exception e) { logger.error("convert " + value + " to " + objectFormatter.clazz() + " error!", e); diff --git a/webmagic-extension/src/main/java/us/codecraft/webmagic/model/annotation/Formatter.java b/webmagic-extension/src/main/java/us/codecraft/webmagic/model/annotation/Formatter.java index e603c59..a3a56f8 100644 --- a/webmagic-extension/src/main/java/us/codecraft/webmagic/model/annotation/Formatter.java +++ b/webmagic-extension/src/main/java/us/codecraft/webmagic/model/annotation/Formatter.java @@ -21,7 +21,7 @@ public @interface Formatter { * * @return formatter params */ - String[] value(); + String[] value() default ""; /** * Specific the class of field of class of elements in collection for field.
diff --git a/webmagic-extension/src/main/java/us/codecraft/webmagic/model/formatter/DateFormatter.java b/webmagic-extension/src/main/java/us/codecraft/webmagic/model/formatter/DateFormatter.java index b0f6e77..6305d7b 100644 --- a/webmagic-extension/src/main/java/us/codecraft/webmagic/model/formatter/DateFormatter.java +++ b/webmagic-extension/src/main/java/us/codecraft/webmagic/model/formatter/DateFormatter.java @@ -10,7 +10,8 @@ import java.util.Date; */ public class DateFormatter implements ObjectFormatter { - private String[] datePatterns = new String[]{"yyyy-MM-dd HH:mm"}; + public static final String[] DEFAULT_PATTERN = new String[]{"yyyy-MM-dd HH:mm"}; + private String[] datePatterns = DEFAULT_PATTERN; @Override public Date format(String raw) throws Exception { @@ -24,6 +25,8 @@ public class DateFormatter implements ObjectFormatter { @Override public void initParam(String[] extra) { - datePatterns = extra; + if (extra != null && !(extra.length == 1 && extra[0].length() == 0)) { + datePatterns = extra; + } } } diff --git a/webmagic-extension/src/main/java/us/codecraft/webmagic/monitor/SpiderMonitor.java b/webmagic-extension/src/main/java/us/codecraft/webmagic/monitor/SpiderMonitor.java new file mode 100644 index 0000000..a870c1d --- /dev/null +++ b/webmagic-extension/src/main/java/us/codecraft/webmagic/monitor/SpiderMonitor.java @@ -0,0 +1,110 @@ +package us.codecraft.webmagic.monitor; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import us.codecraft.webmagic.Request; +import us.codecraft.webmagic.Spider; +import us.codecraft.webmagic.SpiderListener; +import us.codecraft.webmagic.utils.Experimental; + +import javax.management.*; +import java.lang.management.ManagementFactory; +import java.util.ArrayList; +import java.util.Collections; +import java.util.List; +import java.util.concurrent.atomic.AtomicBoolean; +import java.util.concurrent.atomic.AtomicInteger; + +/** + * @author code4crafer@gmail.com + * @since 0.5.0 + */ +@Experimental +public class SpiderMonitor { + + private static SpiderMonitor INSTANCE = new SpiderMonitor(); + + private AtomicBoolean started = new AtomicBoolean(false); + + private Logger logger = LoggerFactory.getLogger(getClass()); + + private MBeanServer mbeanServer; + + private String jmxServerName; + + private List spiderStatuses = new ArrayList(); + + protected SpiderMonitor() { + jmxServerName = "WebMagic"; + mbeanServer = ManagementFactory.getPlatformMBeanServer(); + } + + /** + * Register spider for monitor. + * + * @param spiders + * @return + */ + public synchronized SpiderMonitor register(Spider... spiders) throws JMException { + for (Spider spider : spiders) { + MonitorSpiderListener monitorSpiderListener = new MonitorSpiderListener(); + if (spider.getSpiderListeners() == null) { + List spiderListeners = new ArrayList(); + spiderListeners.add(monitorSpiderListener); + spider.setSpiderListeners(spiderListeners); + } else { + spider.getSpiderListeners().add(monitorSpiderListener); + } + SpiderStatusMXBean spiderStatusMBean = getSpiderStatusMBean(spider, monitorSpiderListener); + registerMBean(spiderStatusMBean); + spiderStatuses.add(spiderStatusMBean); + } + return this; + } + + protected SpiderStatusMXBean getSpiderStatusMBean(Spider spider, MonitorSpiderListener monitorSpiderListener) { + return new SpiderStatus(spider, monitorSpiderListener); + } + + public static SpiderMonitor instance() { + return INSTANCE; + } + + public class MonitorSpiderListener implements SpiderListener { + + private final AtomicInteger successCount = new AtomicInteger(0); + + private final AtomicInteger errorCount = new AtomicInteger(0); + + private List errorUrls = Collections.synchronizedList(new ArrayList()); + + @Override + public void onSuccess(Request request) { + successCount.incrementAndGet(); + } + + @Override + public void onError(Request request) { + errorUrls.add(request.getUrl()); + errorCount.incrementAndGet(); + } + + public AtomicInteger getSuccessCount() { + return successCount; + } + + public AtomicInteger getErrorCount() { + return errorCount; + } + + public List getErrorUrls() { + return errorUrls; + } + } + + protected void registerMBean(SpiderStatusMXBean spiderStatus) throws MalformedObjectNameException, InstanceAlreadyExistsException, MBeanRegistrationException, NotCompliantMBeanException { + ObjectName objName = new ObjectName(jmxServerName + ":name=" + spiderStatus.getName()); + mbeanServer.registerMBean(spiderStatus, objName); + } + +} diff --git a/webmagic-extension/src/main/java/us/codecraft/webmagic/monitor/SpiderStatus.java b/webmagic-extension/src/main/java/us/codecraft/webmagic/monitor/SpiderStatus.java new file mode 100644 index 0000000..a87c040 --- /dev/null +++ b/webmagic-extension/src/main/java/us/codecraft/webmagic/monitor/SpiderStatus.java @@ -0,0 +1,91 @@ +package us.codecraft.webmagic.monitor; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import us.codecraft.webmagic.Spider; +import us.codecraft.webmagic.scheduler.MonitorableScheduler; + +import java.util.Date; +import java.util.List; + +/** + * @author code4crafer@gmail.com + * @since 0.5.0 + */ +public class SpiderStatus implements SpiderStatusMXBean { + + protected final Spider spider; + + protected Logger logger = LoggerFactory.getLogger(getClass()); + + protected final SpiderMonitor.MonitorSpiderListener monitorSpiderListener; + + public SpiderStatus(Spider spider, SpiderMonitor.MonitorSpiderListener monitorSpiderListener) { + this.spider = spider; + this.monitorSpiderListener = monitorSpiderListener; + } + + public String getName() { + return spider.getUUID(); + } + + public int getLeftPageCount() { + if (spider.getScheduler() instanceof MonitorableScheduler) { + return ((MonitorableScheduler) spider.getScheduler()).getLeftRequestsCount(spider); + } + logger.warn("Get leftPageCount fail, try to use a Scheduler implement MonitorableScheduler for monitor count!"); + return -1; + } + + public int getTotalPageCount() { + if (spider.getScheduler() instanceof MonitorableScheduler) { + return ((MonitorableScheduler) spider.getScheduler()).getTotalRequestsCount(spider); + } + logger.warn("Get totalPageCount fail, try to use a Scheduler implement MonitorableScheduler for monitor count!"); + return -1; + } + + @Override + public int getSuccessPageCount() { + return monitorSpiderListener.getSuccessCount().get(); + } + + @Override + public int getErrorPageCount() { + return monitorSpiderListener.getErrorCount().get(); + } + + public List getErrorPages() { + return monitorSpiderListener.getErrorUrls(); + } + + @Override + public String getStatus() { + return spider.getStatus().name(); + } + + @Override + public int getThread() { + return spider.getThreadAlive(); + } + + public void start() { + spider.start(); + } + + public void stop() { + spider.stop(); + } + + @Override + public Date getStartTime() { + return spider.getStartTime(); + } + + @Override + public int getPagePerSecond() { + int runSeconds = (int) (System.currentTimeMillis() - getStartTime().getTime()) / 1000; + return getSuccessPageCount() / runSeconds; + } + +} diff --git a/webmagic-extension/src/main/java/us/codecraft/webmagic/monitor/SpiderStatusMXBean.java b/webmagic-extension/src/main/java/us/codecraft/webmagic/monitor/SpiderStatusMXBean.java new file mode 100644 index 0000000..e49ff8f --- /dev/null +++ b/webmagic-extension/src/main/java/us/codecraft/webmagic/monitor/SpiderStatusMXBean.java @@ -0,0 +1,35 @@ +package us.codecraft.webmagic.monitor; + +import java.util.Date; +import java.util.List; + +/** + * @author code4crafer@gmail.com + * @since 0.5.0 + */ +public interface SpiderStatusMXBean { + + public String getName(); + + public String getStatus(); + + public int getThread(); + + public int getTotalPageCount(); + + public int getLeftPageCount(); + + public int getSuccessPageCount(); + + public int getErrorPageCount(); + + public List getErrorPages(); + + public void start(); + + public void stop(); + + public Date getStartTime(); + + public int getPagePerSecond(); +} diff --git a/webmagic-extension/src/main/java/us/codecraft/webmagic/scheduler/FileCacheQueueScheduler.java b/webmagic-extension/src/main/java/us/codecraft/webmagic/scheduler/FileCacheQueueScheduler.java index 38e8a79..4215ab8 100644 --- a/webmagic-extension/src/main/java/us/codecraft/webmagic/scheduler/FileCacheQueueScheduler.java +++ b/webmagic-extension/src/main/java/us/codecraft/webmagic/scheduler/FileCacheQueueScheduler.java @@ -23,7 +23,7 @@ import java.util.concurrent.atomic.AtomicInteger; * @author code4crafter@gmail.com
* @since 0.2.0 */ -public class FileCacheQueueScheduler implements Scheduler { +public class FileCacheQueueScheduler extends LocalDuplicatedRemoveScheduler { private Logger logger = LoggerFactory.getLogger(getClass()); @@ -145,18 +145,12 @@ public class FileCacheQueueScheduler implements Scheduler { } @Override - public synchronized void push(Request request, Task task) { + protected void pushWhenNoDuplicate(Request request, Task task) { if (!inited.get()) { init(task); } - if (logger.isDebugEnabled()) { - logger.debug("push to queue " + request.getUrl()); - } - if (urls.add(request.getUrl())) { - queue.add(request); - fileUrlWriter.println(request.getUrl()); - } - + queue.add(request); + fileUrlWriter.println(request.getUrl()); } @Override @@ -167,4 +161,9 @@ public class FileCacheQueueScheduler implements Scheduler { fileCursorWriter.println(cursor.incrementAndGet()); return queue.poll(); } + + @Override + public int getLeftRequestsCount(Task task) { + return queue.size(); + } } diff --git a/webmagic-extension/src/main/java/us/codecraft/webmagic/scheduler/RedisScheduler.java b/webmagic-extension/src/main/java/us/codecraft/webmagic/scheduler/RedisScheduler.java index cd90625..dc2ee2e 100644 --- a/webmagic-extension/src/main/java/us/codecraft/webmagic/scheduler/RedisScheduler.java +++ b/webmagic-extension/src/main/java/us/codecraft/webmagic/scheduler/RedisScheduler.java @@ -14,7 +14,7 @@ import us.codecraft.webmagic.Task; * @author code4crafter@gmail.com
* @since 0.2.0 */ -public class RedisScheduler implements Scheduler { +public class RedisScheduler extends DuplicatedRemoveScheduler implements MonitorableScheduler { private JedisPool pool; @@ -33,21 +33,39 @@ public class RedisScheduler implements Scheduler { } @Override - public synchronized void push(Request request, Task task) { + public void resetDuplicateCheck(Task task) { Jedis jedis = pool.getResource(); try { - // if cycleRetriedTimes is set, allow duplicated. - Object cycleRetriedTimes = request.getExtra(Request.CYCLE_TRIED_TIMES); - // use set to remove duplicate url - if (cycleRetriedTimes != null || !jedis.sismember(SET_PREFIX + task.getUUID(), request.getUrl())) { - // use list to store queue - jedis.rpush(QUEUE_PREFIX + task.getUUID(), request.getUrl()); - jedis.sadd(SET_PREFIX + task.getUUID(), request.getUrl()); - if (request.getExtras() != null) { - String field = DigestUtils.shaHex(request.getUrl()); - String value = JSON.toJSONString(request); - jedis.hset((ITEM_PREFIX + task.getUUID()), field, value); - } + jedis.del(getSetKey(task)); + } finally { + pool.returnResource(jedis); + } + } + + @Override + protected boolean isDuplicate(Request request, Task task) { + Jedis jedis = pool.getResource(); + try { + boolean isDuplicate = !jedis.sismember(getSetKey(task), request.getUrl()); + if (!isDuplicate) { + jedis.sadd(getSetKey(task), request.getUrl()); + } + return isDuplicate; + } finally { + pool.returnResource(jedis); + } + + } + + @Override + protected void pushWhenNoDuplicate(Request request, Task task) { + Jedis jedis = pool.getResource(); + try { + jedis.rpush(getQueueKey(task), request.getUrl()); + if (request.getExtras() != null) { + String field = DigestUtils.shaHex(request.getUrl()); + String value = JSON.toJSONString(request); + jedis.hset((ITEM_PREFIX + task.getUUID()), field, value); } } finally { pool.returnResource(jedis); @@ -58,7 +76,7 @@ public class RedisScheduler implements Scheduler { public synchronized Request poll(Task task) { Jedis jedis = pool.getResource(); try { - String url = jedis.lpop(QUEUE_PREFIX + task.getUUID()); + String url = jedis.lpop(getQueueKey(task)); if (url == null) { return null; } @@ -75,4 +93,34 @@ public class RedisScheduler implements Scheduler { pool.returnResource(jedis); } } + + protected String getSetKey(Task task) { + return SET_PREFIX + task.getUUID(); + } + + protected String getQueueKey(Task task) { + return QUEUE_PREFIX + task.getUUID(); + } + + @Override + public int getLeftRequestsCount(Task task) { + Jedis jedis = pool.getResource(); + try { + Long size = jedis.llen(getQueueKey(task)); + return size.intValue(); + } finally { + pool.returnResource(jedis); + } + } + + @Override + public int getTotalRequestsCount(Task task) { + Jedis jedis = pool.getResource(); + try { + Long size = jedis.scard(getQueueKey(task)); + return size.intValue(); + } finally { + pool.returnResource(jedis); + } + } } diff --git a/webmagic-extension/src/main/java/us/codecraft/webmagic/utils/ClassUtils.java b/webmagic-extension/src/main/java/us/codecraft/webmagic/utils/ClassUtils.java new file mode 100644 index 0000000..ed22a4e --- /dev/null +++ b/webmagic-extension/src/main/java/us/codecraft/webmagic/utils/ClassUtils.java @@ -0,0 +1,26 @@ +package us.codecraft.webmagic.utils; + +import java.lang.reflect.Field; +import java.util.LinkedHashSet; +import java.util.Set; + +/** + * @author code4crafter@gmail.com + * @since 0.5.0 + */ +public abstract class ClassUtils { + + public static Set getFieldsIncludeSuperClass(Class clazz) { + Set fields = new LinkedHashSet(); + Class current = clazz; + while (current != null) { + Field[] currentFields = current.getDeclaredFields(); + for (Field currentField : currentFields) { + fields.add(currentField); + } + current = current.getSuperclass(); + } + return fields; + } + +} diff --git a/webmagic-extension/src/main/java/us/codecraft/webmagic/utils/ExtractorUtils.java b/webmagic-extension/src/main/java/us/codecraft/webmagic/utils/ExtractorUtils.java index 0818fde..54a4439 100644 --- a/webmagic-extension/src/main/java/us/codecraft/webmagic/utils/ExtractorUtils.java +++ b/webmagic-extension/src/main/java/us/codecraft/webmagic/utils/ExtractorUtils.java @@ -37,12 +37,7 @@ public class ExtractorUtils { } private static Selector getXpathSelector(String value) { - Selector selector; - if (EnvironmentUtil.useXsoup()) { - selector = new XsoupSelector(value); - } else { - selector = new XpathSelector(value); - } + Selector selector = new XpathSelector(value); return selector; } diff --git a/webmagic-extension/src/main/java/us/codecraft/webmagic/utils/IPUtils.java b/webmagic-extension/src/main/java/us/codecraft/webmagic/utils/IPUtils.java new file mode 100644 index 0000000..3d41696 --- /dev/null +++ b/webmagic-extension/src/main/java/us/codecraft/webmagic/utils/IPUtils.java @@ -0,0 +1,36 @@ +package us.codecraft.webmagic.utils; + +import java.net.Inet6Address; +import java.net.InetAddress; +import java.net.NetworkInterface; +import java.net.SocketException; +import java.util.Enumeration; + +/** + * @author code4crafer@gmail.com + * @since 0.5.0 + */ +public abstract class IPUtils { + + public static String getFirstNoLoopbackIPAddresses() throws SocketException { + + Enumeration networkInterfaces = NetworkInterface.getNetworkInterfaces(); + + InetAddress localAddress = null; + while (networkInterfaces.hasMoreElements()) { + NetworkInterface networkInterface = networkInterfaces.nextElement(); + Enumeration inetAddresses = networkInterface.getInetAddresses(); + while (inetAddresses.hasMoreElements()) { + InetAddress address = inetAddresses.nextElement(); + if (!address.isLoopbackAddress() && !Inet6Address.class.isInstance(address)) { + return address.getHostAddress(); + } else if (!address.isLoopbackAddress()) { + localAddress = address; + } + } + } + + return localAddress.getHostAddress(); + } + +} diff --git a/webmagic-extension/src/main/resouces/log4j.xml b/webmagic-extension/src/main/resouces/log4j.xml new file mode 100644 index 0000000..c2b5a2f --- /dev/null +++ b/webmagic-extension/src/main/resouces/log4j.xml @@ -0,0 +1,21 @@ + + + + + + + + + + + + + + + + + + + + + diff --git a/webmagic-extension/src/main/resouces/spider-config-draft.xml b/webmagic-extension/src/main/resouces/spider-config-draft.xml new file mode 100644 index 0000000..85aee4d --- /dev/null +++ b/webmagic-extension/src/main/resouces/spider-config-draft.xml @@ -0,0 +1,29 @@ + + + + utf-8 + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/webmagic-extension/src/test/java/us/codecraft/webmagic/configurable/ConfigurablePageProcessorTest.java b/webmagic-extension/src/test/java/us/codecraft/webmagic/configurable/ConfigurablePageProcessorTest.java new file mode 100644 index 0000000..a35fffa --- /dev/null +++ b/webmagic-extension/src/test/java/us/codecraft/webmagic/configurable/ConfigurablePageProcessorTest.java @@ -0,0 +1,39 @@ +package us.codecraft.webmagic.configurable; + +import org.junit.Test; +import us.codecraft.webmagic.ResultItems; +import us.codecraft.webmagic.Site; +import us.codecraft.webmagic.Spider; +import us.codecraft.webmagic.downloader.MockGithubDownloader; + +import java.util.ArrayList; +import java.util.List; + +import static org.assertj.core.api.Assertions.assertThat; + +/** + * @author code4crafter@gmail.com + * @date 14-4-5 + */ +public class ConfigurablePageProcessorTest { + + @Test + public void test() throws Exception { + List extractRules = new ArrayList(); + ExtractRule extractRule = new ExtractRule(); + extractRule.setExpressionType(ExpressionType.XPath); + extractRule.setExpressionValue("//title"); + extractRule.setFieldName("title"); + extractRules.add(extractRule); + extractRule = new ExtractRule(); + extractRule.setExpressionType(ExpressionType.XPath); + extractRule.setExpressionValue("//ul[@class='pagehead-actions']/li[1]//a[@class='social-count js-social-count']/text()"); + extractRule.setFieldName("star"); + extractRules.add(extractRule); + ResultItems resultItems = Spider.create(new ConfigurablePageProcessor(Site.me(), extractRules)) + .setDownloader(new MockGithubDownloader()).get("https://github.com/code4craft/webmagic"); + assertThat(resultItems.getAll()).containsEntry("title", "code4craft/webmagic · GitHub"); + assertThat(resultItems.getAll()).containsEntry("star", " 86 "); + + } +} diff --git a/webmagic-extension/src/test/java/us/codecraft/webmagic/model/BaseRepo.java b/webmagic-extension/src/test/java/us/codecraft/webmagic/model/BaseRepo.java new file mode 100644 index 0000000..2d9cf94 --- /dev/null +++ b/webmagic-extension/src/test/java/us/codecraft/webmagic/model/BaseRepo.java @@ -0,0 +1,12 @@ +package us.codecraft.webmagic.model; + +import us.codecraft.webmagic.model.annotation.ExtractBy; + +/** + * @author code4crafter@gmail.com + */ +public class BaseRepo { + + @ExtractBy("//ul[@class='pagehead-actions']/li[1]//a[@class='social-count js-social-count']/text()") + protected int star; +} diff --git a/webmagic-extension/src/test/java/us/codecraft/webmagic/model/GithubRepo.java b/webmagic-extension/src/test/java/us/codecraft/webmagic/model/GithubRepo.java new file mode 100644 index 0000000..d825a1f --- /dev/null +++ b/webmagic-extension/src/test/java/us/codecraft/webmagic/model/GithubRepo.java @@ -0,0 +1,32 @@ +package us.codecraft.webmagic.model; + +import us.codecraft.webmagic.Site; +import us.codecraft.webmagic.model.annotation.ExtractBy; +import us.codecraft.webmagic.model.annotation.HelpUrl; +import us.codecraft.webmagic.model.annotation.TargetUrl; + +/** + * @author code4crafter@gmail.com
+ * @since 0.3.2 + */ +@TargetUrl("https://github.com/\\w+/\\w+") +@HelpUrl({"https://github.com/\\w+\\?tab=repositories", "https://github.com/\\w+", "https://github.com/explore/*"}) +public class GithubRepo extends BaseRepo{ + + @ExtractBy("//ul[@class='pagehead-actions']/li[2]//a[@class='social-count']/text()") + private int fork; + + public static void main(String[] args) { + OOSpider.create(Site.me().setSleepTime(100) + , new ConsolePageModelPipeline(), GithubRepo.class) + .addUrl("https://github.com/code4craft").thread(10).run(); + } + + public int getStar() { + return star; + } + + public int getFork() { + return fork; + } +} diff --git a/webmagic-extension/src/test/java/us/codecraft/webmagic/model/GithubRepoTest.java b/webmagic-extension/src/test/java/us/codecraft/webmagic/model/GithubRepoTest.java index 85b6858..d9501a2 100644 --- a/webmagic-extension/src/test/java/us/codecraft/webmagic/model/GithubRepoTest.java +++ b/webmagic-extension/src/test/java/us/codecraft/webmagic/model/GithubRepoTest.java @@ -5,7 +5,6 @@ import org.junit.Test; import us.codecraft.webmagic.downloader.MockGithubDownloader; import us.codecraft.webmagic.Site; import us.codecraft.webmagic.Task; -import us.codecraft.webmagic.example.GithubRepo; import us.codecraft.webmagic.pipeline.PageModelPipeline; /** diff --git a/webmagic-extension/src/test/java/us/codecraft/webmagic/model/ModelPageProcessorTest.java b/webmagic-extension/src/test/java/us/codecraft/webmagic/model/ModelPageProcessorTest.java new file mode 100644 index 0000000..74f3f6a --- /dev/null +++ b/webmagic-extension/src/test/java/us/codecraft/webmagic/model/ModelPageProcessorTest.java @@ -0,0 +1,45 @@ +package us.codecraft.webmagic.model; + +import org.junit.Test; +import us.codecraft.webmagic.Page; +import us.codecraft.webmagic.Request; +import us.codecraft.webmagic.model.annotation.ExtractBy; +import us.codecraft.webmagic.model.annotation.TargetUrl; +import us.codecraft.webmagic.selector.PlainText; + +import static org.assertj.core.api.Assertions.assertThat; + +/** + * @author code4crafter@gmail.com + * @date 14-4-4 + */ +public class ModelPageProcessorTest { + + @TargetUrl("http://codecraft.us/foo") + public static class ModelFoo { + + @ExtractBy(value = "//div/@foo", notNull = true) + private String foo; + + } + + @TargetUrl("http://codecraft.us/bar") + public static class ModelBar { + + @ExtractBy(value = "//div/@bar", notNull = true) + private String bar; + + } + + @Test + public void testMultiModel_should_not_skip_when_match() throws Exception { + Page page = new Page(); + page.setRawText("
"); + page.setRequest(new Request("http://codecraft.us/foo")); + page.setUrl(PlainText.create("http://codecraft.us/foo")); + ModelPageProcessor modelPageProcessor = ModelPageProcessor.create(null, ModelFoo.class, ModelBar.class); + modelPageProcessor.process(page); + assertThat(page.getResultItems().isSkip()).isFalse(); + + } +} diff --git a/webmagic-extension/src/test/java/us/codecraft/webmagic/monitor/CustomSpiderStatus.java b/webmagic-extension/src/test/java/us/codecraft/webmagic/monitor/CustomSpiderStatus.java new file mode 100644 index 0000000..75679da --- /dev/null +++ b/webmagic-extension/src/test/java/us/codecraft/webmagic/monitor/CustomSpiderStatus.java @@ -0,0 +1,19 @@ +package us.codecraft.webmagic.monitor; + +import us.codecraft.webmagic.Spider; + +/** + * @author code4crafer@gmail.com + */ +public class CustomSpiderStatus extends SpiderStatus implements CustomSpiderStatusMXBean { + + public CustomSpiderStatus(Spider spider, SpiderMonitor.MonitorSpiderListener monitorSpiderListener) { + super(spider, monitorSpiderListener); + } + + + @Override + public String getSchedulerName() { + return spider.getScheduler().getClass().getName(); + } +} diff --git a/webmagic-extension/src/test/java/us/codecraft/webmagic/monitor/CustomSpiderStatusMXBean.java b/webmagic-extension/src/test/java/us/codecraft/webmagic/monitor/CustomSpiderStatusMXBean.java new file mode 100644 index 0000000..5dd8ace --- /dev/null +++ b/webmagic-extension/src/test/java/us/codecraft/webmagic/monitor/CustomSpiderStatusMXBean.java @@ -0,0 +1,10 @@ +package us.codecraft.webmagic.monitor; + +/** + * @author code4crafer@gmail.com + */ +public interface CustomSpiderStatusMXBean extends SpiderStatusMXBean { + + public String getSchedulerName(); + +} diff --git a/webmagic-extension/src/test/java/us/codecraft/webmagic/monitor/SpiderMonitorTest.java b/webmagic-extension/src/test/java/us/codecraft/webmagic/monitor/SpiderMonitorTest.java new file mode 100644 index 0000000..3baa0d6 --- /dev/null +++ b/webmagic-extension/src/test/java/us/codecraft/webmagic/monitor/SpiderMonitorTest.java @@ -0,0 +1,31 @@ +package us.codecraft.webmagic.monitor; + +import org.junit.Test; +import us.codecraft.webmagic.Spider; +import us.codecraft.webmagic.processor.example.GithubRepoPageProcessor; +import us.codecraft.webmagic.processor.example.OschinaBlogPageProcessor; + +/** + * @author code4crafer@gmail.com + * @since 0.5.0 + */ +public class SpiderMonitorTest { + + @Test + public void testInherit() throws Exception { + SpiderMonitor spiderMonitor = new SpiderMonitor(){ + @Override + protected SpiderStatusMXBean getSpiderStatusMBean(Spider spider, MonitorSpiderListener monitorSpiderListener) { + return new CustomSpiderStatus(spider, monitorSpiderListener); + } + }; + + Spider oschinaSpider = Spider.create(new OschinaBlogPageProcessor()) + .addUrl("http://my.oschina.net/flashsword/blog").thread(2); + Spider githubSpider = Spider.create(new GithubRepoPageProcessor()) + .addUrl("https://github.com/code4craft"); + + spiderMonitor.register(oschinaSpider, githubSpider); + + } +} diff --git a/webmagic-extension/src/test/java/us/codecraft/webmagic/utils/IPUtilsTest.java b/webmagic-extension/src/test/java/us/codecraft/webmagic/utils/IPUtilsTest.java new file mode 100644 index 0000000..9d78fb9 --- /dev/null +++ b/webmagic-extension/src/test/java/us/codecraft/webmagic/utils/IPUtilsTest.java @@ -0,0 +1,14 @@ +package us.codecraft.webmagic.utils; + +import org.junit.Test; + +/** + * @author code4crafer@gmail.com + */ +public class IPUtilsTest { + + @Test + public void testGetFirstNoLoopbackIPAddresses() throws Exception { + System.out.println(IPUtils.getFirstNoLoopbackIPAddresses()); + } +} diff --git a/webmagic-extension/src/test/resouces/log4j.xml b/webmagic-extension/src/test/resouces/log4j.xml index a58e889..c2b5a2f 100644 --- a/webmagic-extension/src/test/resouces/log4j.xml +++ b/webmagic-extension/src/test/resouces/log4j.xml @@ -8,23 +8,13 @@ - - - - - - - - - - - + diff --git a/webmagic-lucene/README.md b/webmagic-lucene/README.md deleted file mode 100644 index 77050ab..0000000 --- a/webmagic-lucene/README.md +++ /dev/null @@ -1,3 +0,0 @@ -webmagic-lucene --------- -尝试将webmagic与lucene结合,打造一个搜索引擎。开发中,不作为webmagic主要模块。 \ No newline at end of file diff --git a/webmagic-lucene/pom.xml b/webmagic-lucene/pom.xml deleted file mode 100644 index 0d2cb84..0000000 --- a/webmagic-lucene/pom.xml +++ /dev/null @@ -1,46 +0,0 @@ - - - - webmagic-parent - us.codecraft - 0.4.3 - - 4.0.0 - - webmagic-lucene - - - - org.apache.lucene - lucene-analyzers-common - 4.4.0 - - - org.apache.lucene - lucene-queryparser - 4.4.0 - - - us.codecraft - webmagic-extension - ${project.version} - - - junit - junit - - - - - - - maven-deploy-plugin - - true - - - - - - - \ No newline at end of file diff --git a/webmagic-lucene/src/main/java/us/codecraft/webmagic/pipeline/LucenePipeline.java b/webmagic-lucene/src/main/java/us/codecraft/webmagic/pipeline/LucenePipeline.java deleted file mode 100644 index 6fe2702..0000000 --- a/webmagic-lucene/src/main/java/us/codecraft/webmagic/pipeline/LucenePipeline.java +++ /dev/null @@ -1,92 +0,0 @@ -package us.codecraft.webmagic.pipeline; - -import org.apache.lucene.analysis.Analyzer; -import org.apache.lucene.analysis.standard.StandardAnalyzer; -import org.apache.lucene.document.Document; -import org.apache.lucene.document.Field; -import org.apache.lucene.document.TextField; -import org.apache.lucene.index.DirectoryReader; -import org.apache.lucene.index.IndexWriter; -import org.apache.lucene.index.IndexWriterConfig; -import org.apache.lucene.queryparser.classic.ParseException; -import org.apache.lucene.queryparser.classic.QueryParser; -import org.apache.lucene.search.IndexSearcher; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.ScoreDoc; -import org.apache.lucene.store.Directory; -import org.apache.lucene.store.RAMDirectory; -import org.apache.lucene.util.Version; -import us.codecraft.webmagic.ResultItems; -import us.codecraft.webmagic.Task; - -import java.io.IOException; -import java.util.ArrayList; -import java.util.List; -import java.util.Map; - -/** - * @author code4crafter@gmail.com
- * Date: 13-8-5
- * Time: 下午2:11
- */ -public class LucenePipeline implements Pipeline { - - private Directory directory; - - private Analyzer analyzer; - - private IndexWriterConfig config; - - private void init() throws IOException { - analyzer = new StandardAnalyzer(Version.LUCENE_44); - directory = new RAMDirectory(); - config = new IndexWriterConfig(Version.LUCENE_44, analyzer); - } - - public LucenePipeline() { - try { - init(); - } catch (IOException e) { - e.printStackTrace(); - } - } - - public List search(String fieldName, String value) throws IOException, ParseException { - List documents = new ArrayList(); - DirectoryReader ireader = DirectoryReader.open(directory); - IndexSearcher isearcher = new IndexSearcher(ireader); - // Parse a simple query that searches for "text": - QueryParser parser = new QueryParser(Version.LUCENE_44, fieldName, analyzer); - Query query = parser.parse(value); - ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs; - // Iterate through the results: - for (int i = 0; i < hits.length; i++) { - Document hitDoc = isearcher.doc(hits[i].doc); - documents.add(hitDoc); - } - ireader.close(); - return documents; - } - - @Override - public void process(ResultItems resultItems, Task task) { - if (resultItems.isSkip()){ - return; - } - Document doc = new Document(); - Map all = resultItems.getAll(); - if (all==null){ - return; - } - for (Map.Entry objectEntry : all.entrySet()) { - doc.add(new Field(objectEntry.getKey(), objectEntry.getValue().toString(), TextField.TYPE_STORED)); - } - try { - IndexWriter indexWriter = new IndexWriter(directory, config); - indexWriter.addDocument(doc); - indexWriter.close(); - } catch (IOException e) { - e.printStackTrace(); - } - } -} diff --git a/webmagic-lucene/src/main/test/java/us/codecraft/webmagic/lucene/OschinaBlog.java b/webmagic-lucene/src/main/test/java/us/codecraft/webmagic/lucene/OschinaBlog.java deleted file mode 100644 index b350370..0000000 --- a/webmagic-lucene/src/main/test/java/us/codecraft/webmagic/lucene/OschinaBlog.java +++ /dev/null @@ -1,61 +0,0 @@ -package us.codecraft.webmagic.lucene; - -import org.apache.lucene.document.Document; -import org.apache.lucene.queryparser.classic.ParseException; -import us.codecraft.webmagic.Site; -import us.codecraft.webmagic.model.annotation.ExtractBy; -import us.codecraft.webmagic.model.OOSpider; -import us.codecraft.webmagic.model.annotation.TargetUrl; -import us.codecraft.webmagic.pipeline.LucenePipeline; - -import java.io.IOException; -import java.util.List; - -/** - * @author code4crafter@gmail.com
- * Date: 13-8-2
- * Time: 上午7:52
- */ -@TargetUrl("http://my.oschina.net/flashsword/blog/\\d+") -public class OschinaBlog { - - @ExtractBy("//title") - private String title; - - @ExtractBy(value = "div.BlogContent", type = ExtractBy.Type.Css) - private String content; - - @Override - public String toString() { - return "OschinaBlog{" + - "title='" + title + '\'' + - ", content='" + content + '\'' + - '}'; - } - - public static void main(String[] args) { - LucenePipeline pipeline = new LucenePipeline(); - OOSpider.create(Site.me().addStartUrl("http://my.oschina.net/flashsword/blog"), OschinaBlog.class).pipeline(pipeline).runAsync(); - while (true) { - try { - List search = pipeline.search("title", "webmagic"); - System.out.println(search); - Thread.sleep(3000); - } catch (IOException e) { - e.printStackTrace(); - } catch (ParseException e) { - e.printStackTrace(); - } catch (InterruptedException e) { - e.printStackTrace(); - } - } - } - - public String getTitle() { - return title; - } - - public String getContent() { - return content; - } -} diff --git a/webmagic-panel/README.md b/webmagic-panel/README.md deleted file mode 100644 index 30ddd13..0000000 --- a/webmagic-panel/README.md +++ /dev/null @@ -1,20 +0,0 @@ -Worker: - -任务执行者,提供Http接口,监控运行状态,终止和开始job - -队列: - -仍然使用redis - -Panel: - -提供Web管理后台,管理 - - - -1. 新建任务 - 1. 通过脚本 - 2. 配置 - 3. 分配机器 -2. 已有任务 -3. 任务查看 \ No newline at end of file diff --git a/webmagic-samples/pom.xml b/webmagic-samples/pom.xml index d3efdb4..4769c21 100644 --- a/webmagic-samples/pom.xml +++ b/webmagic-samples/pom.xml @@ -3,7 +3,7 @@ webmagic-parent us.codecraft - 0.4.3 + 0.5.0 4.0.0 @@ -34,6 +34,25 @@ true + + org.apache.maven.plugins + maven-dependency-plugin + + + copy-dependencies + package + + copy-dependencies + + + ${project.build.directory}/lib + false + false + true + + + + org.apache.maven.plugins maven-jar-plugin diff --git a/webmagic-samples/src/main/java/us/codecraft/webmagic/model/samples/BaiduNews.java b/webmagic-samples/src/main/java/us/codecraft/webmagic/model/samples/BaiduNews.java new file mode 100644 index 0000000..4795662 --- /dev/null +++ b/webmagic-samples/src/main/java/us/codecraft/webmagic/model/samples/BaiduNews.java @@ -0,0 +1,43 @@ +package us.codecraft.webmagic.model.samples; + +import us.codecraft.webmagic.Site; +import us.codecraft.webmagic.model.OOSpider; +import us.codecraft.webmagic.model.annotation.ExtractBy; + +/** + * @author code4crafter@gmail.com + * @date 14-4-9 + */ +public class BaiduNews { + + @ExtractBy("//h3[@class='c-title']/a/text()") + private String name; + + @ExtractBy("//div[@class='c-summary']/text()") + private String description; + + @Override + public String toString() { + return "BaiduNews{" + + "name='" + name + '\'' + + ", description='" + description + '\'' + + '}'; + } + + public static void main(String[] args) { + OOSpider ooSpider = OOSpider.create(Site.me().setSleepTime(0), BaiduNews.class); + //single download + BaiduNews baike = ooSpider.get("http://news.baidu.com/ns?tn=news&cl=2&rn=20&ct=1&fr=bks0000&ie=utf-8&word=httpclient"); + System.out.println(baike); + + ooSpider.close(); + } + + public String getName() { + return name; + } + + public String getDescription() { + return description; + } +} \ No newline at end of file diff --git a/webmagic-samples/src/main/java/us/codecraft/webmagic/model/samples/Kr36NewsModel.java b/webmagic-samples/src/main/java/us/codecraft/webmagic/model/samples/Kr36NewsModel.java index 936f132..a1ef3fd 100644 --- a/webmagic-samples/src/main/java/us/codecraft/webmagic/model/samples/Kr36NewsModel.java +++ b/webmagic-samples/src/main/java/us/codecraft/webmagic/model/samples/Kr36NewsModel.java @@ -1,14 +1,19 @@ package us.codecraft.webmagic.model.samples; import us.codecraft.webmagic.Site; +import us.codecraft.webmagic.Spider; import us.codecraft.webmagic.Task; import us.codecraft.webmagic.model.OOSpider; +import us.codecraft.webmagic.monitor.SpiderMonitor; import us.codecraft.webmagic.pipeline.PageModelPipeline; import us.codecraft.webmagic.model.annotation.ExtractBy; import us.codecraft.webmagic.model.annotation.ExtractByUrl; import us.codecraft.webmagic.model.annotation.HelpUrl; import us.codecraft.webmagic.model.annotation.TargetUrl; +import javax.management.JMException; +import java.io.IOException; + /** * @author code4crafter@gmail.com
*/ @@ -25,14 +30,17 @@ public class Kr36NewsModel { @ExtractByUrl private String url; - public static void main(String[] args) { + public static void main(String[] args) throws IOException, JMException { //Just for benchmark - OOSpider.create(Site.me().addStartUrl("http://www.36kr.com/").setSleepTime(0), new PageModelPipeline() { + Spider thread = OOSpider.create(Site.me().addStartUrl("http://www.36kr.com/").setSleepTime(0), new PageModelPipeline() { @Override public void process(Object o, Task task) { } - },Kr36NewsModel.class).thread(20).run(); + }, Kr36NewsModel.class).thread(20); + thread.start(); + SpiderMonitor spiderMonitor = SpiderMonitor.instance(); + spiderMonitor.register(thread); } public String getTitle() { diff --git a/webmagic-samples/src/main/java/us/codecraft/webmagic/model/samples/News163.java b/webmagic-samples/src/main/java/us/codecraft/webmagic/model/samples/News163.java index e9dfb26..45bee2f 100644 --- a/webmagic-samples/src/main/java/us/codecraft/webmagic/model/samples/News163.java +++ b/webmagic-samples/src/main/java/us/codecraft/webmagic/model/samples/News163.java @@ -3,7 +3,6 @@ package us.codecraft.webmagic.model.samples; import us.codecraft.webmagic.MultiPageModel; import us.codecraft.webmagic.Site; import us.codecraft.webmagic.model.OOSpider; -import us.codecraft.webmagic.model.annotation.ComboExtract; import us.codecraft.webmagic.model.annotation.ExtractBy; import us.codecraft.webmagic.model.annotation.ExtractByUrl; import us.codecraft.webmagic.model.annotation.TargetUrl; @@ -26,9 +25,8 @@ public class News163 implements MultiPageModel { @ExtractByUrl(value = "http://news\\.163\\.com/\\d+/\\d+/\\d+/\\w+_(\\d+)\\.html", notNull = false) private String page; - @ComboExtract(value = {@ExtractBy("//div[@class=\"ep-pages\"]//a/@href"), - @ExtractBy(value = "http://news\\.163\\.com/\\d+/\\d+/\\d+/\\w+_(\\d+)\\.html", type = ExtractBy.Type.Regex)}, - multi = true, notNull = false) + @ExtractBy(value = "//div[@class=\"ep-pages\"]//a/regex('http://news\\.163\\.com/\\d+/\\d+/\\d+/\\w+_(\\d+)\\.html',1)" + , multi = true, notNull = false) private List otherPage; @ExtractBy("//h1[@id=\"h1title\"]/text()") @@ -74,8 +72,8 @@ public class News163 implements MultiPageModel { } public static void main(String[] args) { - OOSpider.create(Site.me().addStartUrl("http://news.163.com/13/0802/05/958I1E330001124J_2.html"), News163.class) - .scheduler(new RedisScheduler("localhost")).clearPipeline().pipeline(new MultiPagePipeline()).pipeline(new ConsolePipeline()).run(); + OOSpider.create(Site.me(), News163.class).addUrl("http://news.163.com/13/0802/05/958I1E330001124J_2.html") + .scheduler(new RedisScheduler("localhost")).addPipeline(new MultiPagePipeline()).addPipeline(new ConsolePipeline()).run(); } } diff --git a/webmagic-samples/src/main/java/us/codecraft/webmagic/model/samples/QQMeishi.java b/webmagic-samples/src/main/java/us/codecraft/webmagic/model/samples/QQMeishi.java new file mode 100644 index 0000000..f4f8591 --- /dev/null +++ b/webmagic-samples/src/main/java/us/codecraft/webmagic/model/samples/QQMeishi.java @@ -0,0 +1,27 @@ +package us.codecraft.webmagic.model.samples; + +import us.codecraft.webmagic.Site; +import us.codecraft.webmagic.model.ConsolePageModelPipeline; +import us.codecraft.webmagic.model.OOSpider; +import us.codecraft.webmagic.model.annotation.ExtractBy; +import us.codecraft.webmagic.model.annotation.TargetUrl; + +/** + * @author code4crafter@gmail.com + * @date 14-4-11 + */ +@TargetUrl("http://meishi.qq.com/beijing/c/all[\\-p2]*") +@ExtractBy(value = "//ul[@id=\"promos_list2\"]/li",multi = true) +public class QQMeishi { + + @ExtractBy("//div[@class=info]/a[@class=title]/h4/text()") + private String shopName; + + @ExtractBy("//div[@class=info]/a[@class=title]/text()") + private String promo; + + public static void main(String[] args) { + OOSpider.create(Site.me(), new ConsolePageModelPipeline(), QQMeishi.class).addUrl("http://meishi.qq.com/beijing/c/all").thread(4).run(); + } + +} diff --git a/webmagic-samples/src/main/java/us/codecraft/webmagic/samples/AngularJSProcessor.java b/webmagic-samples/src/main/java/us/codecraft/webmagic/samples/AngularJSProcessor.java new file mode 100644 index 0000000..ab560e4 --- /dev/null +++ b/webmagic-samples/src/main/java/us/codecraft/webmagic/samples/AngularJSProcessor.java @@ -0,0 +1,48 @@ +package us.codecraft.webmagic.samples; + +import org.apache.commons.collections.CollectionUtils; +import us.codecraft.webmagic.Page; +import us.codecraft.webmagic.Site; +import us.codecraft.webmagic.Spider; +import us.codecraft.webmagic.processor.PageProcessor; +import us.codecraft.webmagic.selector.JsonPathSelector; + +import java.util.List; + +/** + * @author code4crafter@gmail.com + * @since 0.5.0 + */ +public class AngularJSProcessor implements PageProcessor { + + private Site site = Site.me(); + + private static final String ARITICALE_URL = "http://angularjs\\.cn/api/article/\\w+"; + + private static final String LIST_URL = "http://angularjs\\.cn/api/article/latest.*"; + + @Override + public void process(Page page) { + if (page.getUrl().regex(LIST_URL).match()) { + List ids = new JsonPathSelector("$.data[*]._id").selectList(page.getRawText()); + if (CollectionUtils.isNotEmpty(ids)) { + for (String id : ids) { + page.addTargetRequest("http://angularjs.cn/api/article/" + id); + } + } + } else { + page.putField("title", new JsonPathSelector("$.data.title").select(page.getRawText())); + page.putField("content", new JsonPathSelector("$.data.content").select(page.getRawText())); + } + + } + + @Override + public Site getSite() { + return site; + } + + public static void main(String[] args) { + Spider.create(new AngularJSProcessor()).addUrl("http://angularjs.cn/api/article/latest?p=1&s=20").run(); + } +} diff --git a/webmagic-samples/src/main/java/us/codecraft/webmagic/samples/SinaBlogProcesser.java b/webmagic-samples/src/main/java/us/codecraft/webmagic/samples/SinaBlogProcesser.java deleted file mode 100644 index dcb6eff..0000000 --- a/webmagic-samples/src/main/java/us/codecraft/webmagic/samples/SinaBlogProcesser.java +++ /dev/null @@ -1,37 +0,0 @@ -package us.codecraft.webmagic.samples; - -import us.codecraft.webmagic.Page; -import us.codecraft.webmagic.Site; -import us.codecraft.webmagic.Spider; -import us.codecraft.webmagic.processor.PageProcessor; - -/** - * @author code4crafter@gmail.com
- */ -public class SinaBlogProcesser implements PageProcessor { - - private Site site; - - @Override - public void process(Page page) { - page.addTargetRequests(page.getHtml().xpath("//div[@class='articalfrontback SG_j_linedot1 clearfix']").links().all()); - page.putField("title", page.getHtml().xpath("//div[@class='articalTitle']/h2")); - page.putField("content",page.getHtml().xpath("//div[@id='articlebody']//div[@class='articalContent']")); - page.putField("id",page.getUrl().regex("http://blog\\.sina\\.com\\.cn/s/blog_(\\w+)")); - page.putField("date",page.getHtml().xpath("//div[@id='articlebody']//span[@class='time SG_txtc']").regex("\\((.*)\\)")); -// page.putField("tags",page.getHtml().xpath("//td[@class='blog_tag']/h3/a")); - } - - @Override - public Site getSite() { - if (site==null){ - site = Site.me().setDomain("blog.sina.com.cn").addStartUrl("http://blog.sina.com.cn/s/blog_4701280b0102egl0.html").setSleepTime(3000). - setUserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31"); - } - return site; - } - - public static void main(String[] args) { - Spider.create(new SinaBlogProcesser()).run(); - } -} diff --git a/webmagic-samples/src/main/java/us/codecraft/webmagic/samples/SinaBlogProcessor.java b/webmagic-samples/src/main/java/us/codecraft/webmagic/samples/SinaBlogProcessor.java new file mode 100644 index 0000000..2872e02 --- /dev/null +++ b/webmagic-samples/src/main/java/us/codecraft/webmagic/samples/SinaBlogProcessor.java @@ -0,0 +1,48 @@ +package us.codecraft.webmagic.samples; + +import us.codecraft.webmagic.Page; +import us.codecraft.webmagic.Site; +import us.codecraft.webmagic.Spider; +import us.codecraft.webmagic.processor.PageProcessor; + +/** + * @author code4crafter@gmail.com
+ */ +public class SinaBlogProcessor implements PageProcessor { + + public static final String URL_LIST = "http://blog\\.sina\\.com\\.cn/s/articlelist_1487828712_0_\\d+\\.html"; + + public static final String URL_POST = "http://blog\\.sina\\.com\\.cn/s/blog_\\w+\\.html"; + + private Site site = Site + .me() + .setDomain("blog.sina.com.cn") + .setSleepTime(3000) + .setUserAgent( + "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31"); + + @Override + public void process(Page page) { + //列表页 + if (page.getUrl().regex(URL_LIST).match()) { + page.addTargetRequests(page.getHtml().xpath("//div[@class=\"articleList\"]").links().regex(URL_POST).all()); + page.addTargetRequests(page.getHtml().links().regex(URL_LIST).all()); + //文章页 + } else { + page.putField("title", page.getHtml().xpath("//div[@class='articalTitle']/h2")); + page.putField("content", page.getHtml().xpath("//div[@id='articlebody']//div[@class='articalContent']")); + page.putField("date", + page.getHtml().xpath("//div[@id='articlebody']//span[@class='time SG_txtc']").regex("\\((.*)\\)")); + } + } + + @Override + public Site getSite() { + return site; + } + + public static void main(String[] args) { + Spider.create(new SinaBlogProcessor()).addUrl("http://blog.sina.com.cn/s/articlelist_1487828712_0_1.html") + .run(); + } +} diff --git a/webmagic-samples/src/main/java/us/codecraft/webmagic/samples/formatter/StringTemplateFormatter.java b/webmagic-samples/src/main/java/us/codecraft/webmagic/samples/formatter/StringTemplateFormatter.java new file mode 100644 index 0000000..7b38125 --- /dev/null +++ b/webmagic-samples/src/main/java/us/codecraft/webmagic/samples/formatter/StringTemplateFormatter.java @@ -0,0 +1,26 @@ +package us.codecraft.webmagic.samples.formatter; + +import us.codecraft.webmagic.model.formatter.ObjectFormatter; + +/** + * @author yihua.huang@dianping.com + */ +public class StringTemplateFormatter implements ObjectFormatter { + + private String template; + + @Override + public String format(String raw) throws Exception { + return String.format(template, raw); + } + + @Override + public Class clazz() { + return String.class; + } + + @Override + public void initParam(String[] extra) { + template = extra[0]; + } +} diff --git a/webmagic-samples/src/main/java/us/codecraft/webmagic/samples/scheduler/ZipCodePageProcessor.java b/webmagic-samples/src/main/java/us/codecraft/webmagic/samples/scheduler/ZipCodePageProcessor.java index ddbaa08..3f2de70 100644 --- a/webmagic-samples/src/main/java/us/codecraft/webmagic/samples/scheduler/ZipCodePageProcessor.java +++ b/webmagic-samples/src/main/java/us/codecraft/webmagic/samples/scheduler/ZipCodePageProcessor.java @@ -9,8 +9,9 @@ import us.codecraft.webmagic.processor.PageProcessor; import us.codecraft.webmagic.scheduler.PriorityScheduler; import java.util.List; +import java.util.regex.Matcher; +import java.util.regex.Pattern; -import static us.codecraft.webmagic.selector.Selectors.regex; import static us.codecraft.webmagic.selector.Selectors.xpath; /** @@ -19,16 +20,16 @@ import static us.codecraft.webmagic.selector.Selectors.xpath; public class ZipCodePageProcessor implements PageProcessor { private Site site = Site.me().setCharset("gb2312") - .setSleepTime(100).addStartUrl("http://www.ip138.com/post/"); + .setSleepTime(100); @Override public void process(Page page) { if (page.getUrl().toString().equals("http://www.ip138.com/post/")) { processCountry(page); - } else if (page.getUrl().regex("http://www\\.ip138\\.com/post/\\w+[/]?$").toString() != null) { - processProvince(page); - } else { + } else if (page.getUrl().regex("http://www\\.ip138\\.com/\\d{6}[/]?$").toString() != null) { processDistrict(page); + } else { + processProvince(page); } } @@ -45,28 +46,26 @@ public class ZipCodePageProcessor implements PageProcessor { private void processProvince(Page page) { //这里仅靠xpath没法精准定位,所以使用正则作为筛选,不符合正则的会被过滤掉 - List districts = page.getHtml().xpath("//body/table/tbody/tr/td").regex(".*http://www\\.ip138\\.com/post/\\w+/\\w+.*").all(); + List districts = page.getHtml().xpath("//body/table/tbody/tr[@bgcolor=\"#ffffff\"]").all(); + Pattern pattern = Pattern.compile("([^<>]+).*?href=\"(.*?)\"",Pattern.DOTALL); for (String district : districts) { - String link = xpath("//@href").select(district); - String title = xpath("/text()").select(district); - Request request = new Request(link).setPriority(1).putExtra("province", page.getRequest().getExtra("province")).putExtra("district", title); - page.addTargetRequest(request); + Matcher matcher = pattern.matcher(district); + while (matcher.find()) { + String title = matcher.group(1); + String link = matcher.group(2); + Request request = new Request(link).setPriority(1).putExtra("province", page.getRequest().getExtra("province")).putExtra("district", title); + page.addTargetRequest(request); + } } } private void processDistrict(Page page) { String province = page.getRequest().getExtra("province").toString(); String district = page.getRequest().getExtra("district").toString(); - List counties = page.getHtml().xpath("//body/table/tbody/tr").regex(".*\\d+.*").all(); - String regex = "]*>([^<>]+)]*>([^<>]+)]*>([^<>]+)]*>([^<>]+)"; - for (String county : counties) { - String county0 = regex(regex, 1).select(county); - String county1 = regex(regex, 2).select(county); - String zipCode = regex(regex, 3).select(county); - page.putField("result", StringUtils.join(new String[]{province, district, - county0, county1, zipCode}, "\t")); - } - List links = page.getHtml().links().regex("http://www\\.ip138\\.com/post/\\w+/\\w+").all(); + String zipCode = page.getHtml().regex("

邮编:(\\d+)

").toString(); + page.putField("result", StringUtils.join(new String[]{province, district, + zipCode}, "\t")); + List links = page.getHtml().links().regex("http://www\\.ip138\\.com/\\d{6}[/]?$").all(); for (String link : links) { page.addTargetRequest(new Request(link).setPriority(2).putExtra("province", province).putExtra("district", district)); } @@ -79,11 +78,8 @@ public class ZipCodePageProcessor implements PageProcessor { } public static void main(String[] args) { - Spider.create(new ZipCodePageProcessor()).scheduler(new PriorityScheduler()).run(); + Spider spider = Spider.create(new ZipCodePageProcessor()).scheduler(new PriorityScheduler()).addUrl("http://www.ip138.com/post/"); - PriorityScheduler scheduler = new PriorityScheduler(); - Spider spider = Spider.create(new ZipCodePageProcessor()).scheduler(scheduler); - scheduler.push(new Request("http://www.baidu.com/s?wd=webmagic&f=12&rsp=0&oq=webmagix&tn=baiduhome_pg&ie=utf-8"),spider); spider.run(); } } diff --git a/webmagic-samples/src/test/java/us/codecraft/webmagic/processor/SinablogProcessorTest.java b/webmagic-samples/src/test/java/us/codecraft/webmagic/processor/SinablogProcessorTest.java index 026f8d5..d7cd5d5 100644 --- a/webmagic-samples/src/test/java/us/codecraft/webmagic/processor/SinablogProcessorTest.java +++ b/webmagic-samples/src/test/java/us/codecraft/webmagic/processor/SinablogProcessorTest.java @@ -5,7 +5,7 @@ import org.junit.Test; import us.codecraft.webmagic.Spider; import us.codecraft.webmagic.pipeline.FilePipeline; import us.codecraft.webmagic.pipeline.JsonFilePipeline; -import us.codecraft.webmagic.samples.SinaBlogProcesser; +import us.codecraft.webmagic.samples.SinaBlogProcessor; import us.codecraft.webmagic.scheduler.FileCacheQueueScheduler; import java.io.IOException; @@ -20,7 +20,7 @@ public class SinablogProcessorTest { @Ignore @Test public void test() throws IOException { - SinaBlogProcesser sinaBlogProcesser = new SinaBlogProcesser(); + SinaBlogProcessor sinaBlogProcessor = new SinaBlogProcessor(); //pipeline是抓取结束后的处理 //默认放到/data/webmagic/ftl/[domain]目录下 JsonFilePipeline pipeline = new JsonFilePipeline("/data/webmagic/"); @@ -29,7 +29,7 @@ public class SinablogProcessorTest { //ConsolePipeline输出结果到控制台 //FileCacheQueueSchedular保存url,支持断点续传,临时文件输出到/data/temp/webmagic/cache目录 //Spider.run()执行 - Spider.create(sinaBlogProcesser).pipeline(new FilePipeline()).pipeline(pipeline).scheduler(new FileCacheQueueScheduler("/data/temp/webmagic/cache/")). + Spider.create(sinaBlogProcessor).pipeline(new FilePipeline()).pipeline(pipeline).scheduler(new FileCacheQueueScheduler("/data/temp/webmagic/cache/")). run(); } } diff --git a/webmagic-saxon/pom.xml b/webmagic-saxon/pom.xml index 5c41d0b..a444d39 100644 --- a/webmagic-saxon/pom.xml +++ b/webmagic-saxon/pom.xml @@ -3,7 +3,7 @@ webmagic-parent us.codecraft - 0.4.3 + 0.5.0 4.0.0 @@ -15,9 +15,15 @@ webmagic-core ${project.version} + + net.sourceforge.htmlcleaner + htmlcleaner + 2.5 + net.sf.saxon Saxon-HE + 9.5.1-1 junit @@ -36,4 +42,4 @@ - \ No newline at end of file + diff --git a/webmagic-saxon/src/test/java/us/codecraft/webmagic/selector/XpathSelectorTest.java b/webmagic-saxon/src/test/java/us/codecraft/webmagic/selector/XpathSelectorTest.java index 895ec4b..728bd69 100644 --- a/webmagic-saxon/src/test/java/us/codecraft/webmagic/selector/XpathSelectorTest.java +++ b/webmagic-saxon/src/test/java/us/codecraft/webmagic/selector/XpathSelectorTest.java @@ -1350,7 +1350,7 @@ public class XpathSelectorTest { + "\n" + "\n" + " \n" + " \n" + " \n" + "\n"; String text2 = "
aaa
"; XpathSelector xpathSelector = new XpathSelector( - "//div[@id='main']/div[@class='blog_main']/div[1][@class='blog_title']/h3/a"); + "//div[@id='main']/div[@class='blog_main']/div[@class='blog_title']/h3/a/text()"); String select = xpathSelector.select(text); Assert.assertEquals("jsoup 解析页面商品信息", select); } diff --git a/webmagic-scripts/README.md b/webmagic-scripts/README.md old mode 100644 new mode 100755 diff --git a/webmagic-scripts/deploy.sh b/webmagic-scripts/deploy.sh old mode 100644 new mode 100755 diff --git a/webmagic-scripts/pom.xml b/webmagic-scripts/pom.xml old mode 100644 new mode 100755 index 9fb2b31..2b846d8 --- a/webmagic-scripts/pom.xml +++ b/webmagic-scripts/pom.xml @@ -3,7 +3,7 @@ webmagic-parent us.codecraft - 0.4.3 + 0.5.0 4.0.0 @@ -16,6 +16,10 @@ jruby 1.7.6
+ org.python + jython + 2.5.3 + commons-cli commons-cli @@ -40,25 +44,6 @@ - - org.apache.maven.plugins - maven-dependency-plugin - - - copy-dependencies - package - - copy-dependencies - - - ${project.build.directory}/lib - false - false - true - - - - maven-compiler-plugin diff --git a/webmagic-scripts/src/main/java/us/codecraft/webmagic/scripts/Language.java b/webmagic-scripts/src/main/java/us/codecraft/webmagic/scripts/Language.java old mode 100644 new mode 100755 index c7ddcda..2f9d22d --- a/webmagic-scripts/src/main/java/us/codecraft/webmagic/scripts/Language.java +++ b/webmagic-scripts/src/main/java/us/codecraft/webmagic/scripts/Language.java @@ -7,7 +7,9 @@ public enum Language { JavaScript("javascript","js/defines.js",""), - JRuby("jruby","ruby/defines.rb",""); + JRuby("jruby","ruby/defines.rb",""), + + Jython("jython","python/defines.py",""); private String engineName; diff --git a/webmagic-scripts/src/main/java/us/codecraft/webmagic/scripts/ScriptConsole.java b/webmagic-scripts/src/main/java/us/codecraft/webmagic/scripts/ScriptConsole.java old mode 100644 new mode 100755 diff --git a/webmagic-scripts/src/main/java/us/codecraft/webmagic/scripts/ScriptEnginePool.java b/webmagic-scripts/src/main/java/us/codecraft/webmagic/scripts/ScriptEnginePool.java old mode 100644 new mode 100755 diff --git a/webmagic-scripts/src/main/java/us/codecraft/webmagic/scripts/ScriptProcessor.java b/webmagic-scripts/src/main/java/us/codecraft/webmagic/scripts/ScriptProcessor.java old mode 100644 new mode 100755 index 5801851..1822318 --- a/webmagic-scripts/src/main/java/us/codecraft/webmagic/scripts/ScriptProcessor.java +++ b/webmagic-scripts/src/main/java/us/codecraft/webmagic/scripts/ScriptProcessor.java @@ -1,6 +1,8 @@ package us.codecraft.webmagic.scripts; import org.apache.commons.io.IOUtils; +import org.jruby.RubyHash; +import org.python.core.PyDictionary; import us.codecraft.webmagic.Page; import us.codecraft.webmagic.Site; import us.codecraft.webmagic.processor.PageProcessor; @@ -10,6 +12,8 @@ import javax.script.ScriptEngine; import javax.script.ScriptException; import java.io.IOException; import java.io.InputStream; +import java.util.Iterator; +import java.util.Map; /** * @author code4crafter@gmail.com @@ -50,20 +54,35 @@ public class ScriptProcessor implements PageProcessor { context.setAttribute("page", page, ScriptContext.ENGINE_SCOPE); context.setAttribute("config", site, ScriptContext.ENGINE_SCOPE); try { - engine.eval(defines + "\n" + script, context); -// switch (language) { -// case JavaScript: -// NativeObject o = (NativeObject) engine.get("result"); -// if (o != null) { -// for (Map.Entry objectObjectEntry : o.entrySet()) { -// page.getResultItems().put(objectObjectEntry.getKey().toString(), objectObjectEntry.getValue()); + switch (language) { + case JavaScript: + engine.eval(defines + "\n" + script, context); +// NativeObject o = (NativeObject) engine.get("result"); +// if (o != null) { +// for (Object o1 : o.getIds()) { +// String key = String.valueOf(o1); +// page.getResultItems().put(key, NativeObject.getProperty(o, key)); +// } // } -// } -// break; -// case JRuby: -// Object o1 = engine.get("result"); -// break; -// } + break; + case JRuby: + RubyHash oRuby = (RubyHash) engine.eval(defines + "\n" + script, context); + Iterator itruby = oRuby.entrySet().iterator(); + while (itruby.hasNext()) { + Map.Entry pairs = (Map.Entry) itruby.next(); + page.getResultItems().put(pairs.getKey().toString(), pairs.getValue()); + } + break; + case Jython: + engine.eval(defines + "\n" + script, context); + PyDictionary oJython = (PyDictionary) engine.get("result"); + Iterator it = oJython.entrySet().iterator(); + while (it.hasNext()) { + Map.Entry pairs = (Map.Entry) it.next(); + page.getResultItems().put(pairs.getKey().toString(), pairs.getValue()); + } + break; + } } catch (ScriptException e) { e.printStackTrace(); } @@ -72,6 +91,7 @@ public class ScriptProcessor implements PageProcessor { } } + @Override public Site getSite() { return site; diff --git a/webmagic-scripts/src/main/java/us/codecraft/webmagic/scripts/ScriptProcessorBuilder.java b/webmagic-scripts/src/main/java/us/codecraft/webmagic/scripts/ScriptProcessorBuilder.java old mode 100644 new mode 100755 diff --git a/webmagic-scripts/src/main/resources/js/defines.js b/webmagic-scripts/src/main/resources/js/defines.js old mode 100644 new mode 100755 diff --git a/webmagic-scripts/src/main/resources/js/github.js b/webmagic-scripts/src/main/resources/js/github.js old mode 100644 new mode 100755 diff --git a/webmagic-scripts/src/main/resources/js/oschina.js b/webmagic-scripts/src/main/resources/js/oschina.js old mode 100644 new mode 100755 index 305682e..02191c3 --- a/webmagic-scripts/src/main/resources/js/oschina.js +++ b/webmagic-scripts/src/main/resources/js/oschina.js @@ -9,3 +9,4 @@ var config = { title = $("div.BlogTitle h1"), content = $("div.BlogContent") urls("http://my\\.oschina\\.net/flashsword/blog/\\d+") +config; diff --git a/webmagic-scripts/src/main/resources/log4j.xml b/webmagic-scripts/src/main/resources/log4j.xml old mode 100644 new mode 100755 diff --git a/webmagic-scripts/src/main/resources/python/defines.py b/webmagic-scripts/src/main/resources/python/defines.py new file mode 100755 index 0000000..913a4b4 --- /dev/null +++ b/webmagic-scripts/src/main/resources/python/defines.py @@ -0,0 +1,13 @@ +def xpath(str): + return page.getHtml().xpath(str).toString() + +def css(str): + return page.getHtml().css(str).toString() + +def urls(str): + links=page.getHtml().links().regex(str).all() + page.addTargetRequests(links); + +def tomap(key,value): + return "hello world" + diff --git a/webmagic-scripts/src/main/resources/python/oschina.py b/webmagic-scripts/src/main/resources/python/oschina.py new file mode 100755 index 0000000..51a188b --- /dev/null +++ b/webmagic-scripts/src/main/resources/python/oschina.py @@ -0,0 +1,4 @@ +title=xpath("div[@class=BlogTitle]") +urls="http://my\\.oschina\\.net/flashsword/blog/\\d+" + +result={"title":title,"urls":urls} diff --git a/webmagic-scripts/src/main/resources/ruby/defines.rb b/webmagic-scripts/src/main/resources/ruby/defines.rb old mode 100644 new mode 100755 diff --git a/webmagic-scripts/src/main/resources/ruby/github.rb b/webmagic-scripts/src/main/resources/ruby/github.rb old mode 100644 new mode 100755 diff --git a/webmagic-scripts/src/main/resources/ruby/oschina.rb b/webmagic-scripts/src/main/resources/ruby/oschina.rb index cbced0b..dbea13b 100644 --- a/webmagic-scripts/src/main/resources/ruby/oschina.rb +++ b/webmagic-scripts/src/main/resources/ruby/oschina.rb @@ -1,3 +1,6 @@ +urls "http://my\\.oschina\\.net/flashsword/blog/\\d+" title = css "div.BlogTitle h1" content = css "div.BlogContent" -urls "http://my\\.oschina\\.net/flashsword/blog/\\d+" \ No newline at end of file + +return {"title"=>title,"content"=>content} + diff --git a/webmagic-scripts/src/test/java/us/codecraft/webmagic/scripts/ScriptProcessorTest.java b/webmagic-scripts/src/test/java/us/codecraft/webmagic/scripts/ScriptProcessorTest.java old mode 100644 new mode 100755 index ec3f674..ffeb9c9 --- a/webmagic-scripts/src/test/java/us/codecraft/webmagic/scripts/ScriptProcessorTest.java +++ b/webmagic-scripts/src/test/java/us/codecraft/webmagic/scripts/ScriptProcessorTest.java @@ -1,5 +1,6 @@ package us.codecraft.webmagic.scripts; +import org.junit.Ignore; import org.junit.Test; import us.codecraft.webmagic.Spider; @@ -7,6 +8,7 @@ import us.codecraft.webmagic.Spider; * @author code4crafter@gmail.com * @since 0.4.1 */ +@Ignore public class ScriptProcessorTest { @Test @@ -22,4 +24,12 @@ public class ScriptProcessorTest { pageProcessor.getSite().setSleepTime(0); Spider.create(pageProcessor).addUrl("http://my.oschina.net/flashsword/blog").setSpawnUrl(false).run(); } + + + @Test + public void testPythonProcessor() { + ScriptProcessor pageProcessor = ScriptProcessorBuilder.custom().language(Language.Jython).scriptFromClassPathFile("python/oschina.py").build(); + pageProcessor.getSite().setSleepTime(0); + Spider.create(pageProcessor).addUrl("http://my.oschina.net/flashsword/blog").setSpawnUrl(false).run(); + } } diff --git a/webmagic-scripts/src/test/resouces/log4j.xml b/webmagic-scripts/src/test/resouces/log4j.xml old mode 100644 new mode 100755 diff --git a/webmagic-selenium/pom.xml b/webmagic-selenium/pom.xml index 4b06443..038b371 100644 --- a/webmagic-selenium/pom.xml +++ b/webmagic-selenium/pom.xml @@ -3,7 +3,7 @@ webmagic-parent us.codecraft - 0.4.3 + 0.5.0 4.0.0 @@ -37,4 +37,4 @@ - \ No newline at end of file + diff --git a/webmagic-selenium/src/main/java/us/codecraft/webmagic/downloader/selenium/WebDriverPool.java b/webmagic-selenium/src/main/java/us/codecraft/webmagic/downloader/selenium/WebDriverPool.java index 98b93a9..f628ede 100644 --- a/webmagic-selenium/src/main/java/us/codecraft/webmagic/downloader/selenium/WebDriverPool.java +++ b/webmagic-selenium/src/main/java/us/codecraft/webmagic/downloader/selenium/WebDriverPool.java @@ -87,4 +87,5 @@ class WebDriverPool { webDriver.quit(); } } + } diff --git a/webmagic-selenium/src/test/java/us/codecraft/webmagic/samples/HuabanProcessor.java b/webmagic-selenium/src/test/java/us/codecraft/webmagic/samples/HuabanProcessor.java index 1696a3f..2854a76 100644 --- a/webmagic-selenium/src/test/java/us/codecraft/webmagic/samples/HuabanProcessor.java +++ b/webmagic-selenium/src/test/java/us/codecraft/webmagic/samples/HuabanProcessor.java @@ -22,7 +22,7 @@ public class HuabanProcessor implements PageProcessor { public void process(Page page) { page.addTargetRequests(page.getHtml().links().regex("http://huaban\\.com/.*").all()); if (page.getUrl().toString().contains("pins")) { - page.putField("img", page.getHtml().xpath("//div[@id='pin_img']/img/@src").toString()); + page.putField("img", page.getHtml().xpath("//div[@id='pin_img']/a/img/@src").toString()); } else { page.getResultItems().setSkip(true); } @@ -30,16 +30,17 @@ public class HuabanProcessor implements PageProcessor { @Override public Site getSite() { - if (site == null) { - site = Site.me().setDomain("huaban.com").addStartUrl("http://huaban.com/").setSleepTime(0); + if (null == site) { + site = Site.me().setDomain("huaban.com").setSleepTime(0); } return site; } public static void main(String[] args) { Spider.create(new HuabanProcessor()).thread(5) - .pipeline(new FilePipeline("/data/webmagic/test/")) - .downloader(new SeleniumDownloader("/Users/yihua/Downloads/chromedriver")) + .addPipeline(new FilePipeline("/data/webmagic/test/")) + .setDownloader(new SeleniumDownloader("/Users/yihua/Downloads/chromedriver")) + .addUrl("http://huaban.com/") .runAsync(); } } diff --git a/zh_docs/README.md b/zh_docs/README.md index c58469a..b336367 100644 --- a/zh_docs/README.md +++ b/zh_docs/README.md @@ -1,9 +1,13 @@ -webmagic ---------- +![logo](https://raw.github.com/code4craft/webmagic/master/asserts/logo.jpg) + + [![Build Status](https://travis-ci.org/code4craft/webmagic.png?branch=master)](https://travis-ci.org/code4craft/webmagic) + [Readme in English](https://github.com/code4craft/webmagic/tree/master/en_docs) +[用户手册](https://github.com/code4craft/webmagic/blob/master/user-manual.md) + >webmagic是一个开源的Java垂直爬虫框架,目标是简化爬虫的开发流程,让开发者专注于逻辑功能的开发。webmagic的核心非常简单,但是覆盖爬虫的整个流程,也是很好的学习爬虫开发的材料。作者曾经在前公司进行过一年的垂直爬虫的开发,webmagic就是为了解决爬虫开发的一些重复劳动而产生的框架。 >web爬虫是一种技术,webmagic致力于将这种技术的实现成本降低,但是出于对资源提供者的尊重,webmagic不会做反封锁的事情,包括:验证码破解、代理切换、自动登录等。 @@ -25,22 +29,37 @@ python爬虫 **scrapy** [https://github.com/scrapy/scrapy](https://github.com/sc Java爬虫 **Spiderman** [https://gitcafe.com/laiweiwei/Spiderman](https://gitcafe.com/laiweiwei/Spiderman) +webmagic的github地址:[https://github.com/code4craft/webmagic](https://github.com/code4craft/webmagic)。 + ## 快速开始 ### 使用maven webmagic使用maven管理依赖,在项目中添加对应的依赖即可使用webmagic: - - us.codecraft - webmagic-core - 0.4.2 - - - us.codecraft - webmagic-extension - 0.4.2 - +```xml + + us.codecraft + webmagic-core + 0.4.3 + + + us.codecraft + webmagic-extension + 0.4.3 + +``` + +WebMagic 使用slf4j-log4j12作为slf4j的实现.如果你自己定制了slf4j的实现,请在项目中去掉此依赖。 + +```xml + + + org.slf4j + slf4j-log4j12 + + +``` #### 项目结构 @@ -68,11 +87,7 @@ webmagic还包含两个可用的扩展包,因为这两个包都依赖了比较 ### 不使用maven -不使用maven的用户,可以下载这个二进制打包版本(感谢[oschina](http://www.oschina.net/)): - - git clone http://git.oschina.net/flashsword20/webmagic-bin.git - -在**bin/lib**目录下,有项目依赖的所有jar包,直接在IDE里import即可。 +在项目的**lib**目录下,有依赖的所有jar包,直接在IDE里import即可。 ### 第一个爬虫 @@ -80,32 +95,34 @@ webmagic还包含两个可用的扩展包,因为这两个包都依赖了比较 PageProcessor是webmagic-core的一部分,定制一个PageProcessor即可实现自己的爬虫逻辑。以下是抓取osc博客的一段代码: - public class OschinaBlogPageProcesser implements PageProcessor { +```java +public class OschinaBlogPageProcesser implements PageProcessor { - private Site site = Site.me().setDomain("my.oschina.net") - .addStartUrl("http://my.oschina.net/flashsword/blog"); + private Site site = Site.me().setDomain("my.oschina.net"); - @Override - public void process(Page page) { - List links = page.getHtml().links().regex("http://my\\.oschina\\.net/flashsword/blog/\\d+").all(); - page.addTargetRequests(links); - page.putField("title", page.getHtml().xpath("//div[@class='BlogEntity']/div[@class='BlogTitle']/h1").toString()); - page.putField("content", page.getHtml().$("div.content").toString()); - page.putField("tags",page.getHtml().xpath("//div[@class='BlogTags']/a/text()").all()); - } - - @Override - public Site getSite() { - return site; - - } - - public static void main(String[] args) { - Spider.create(new OschinaBlogPageProcesser()) - .pipeline(new ConsolePipeline()).run(); - } + @Override + public void process(Page page) { + List links = page.getHtml().links().regex("http://my\\.oschina\\.net/flashsword/blog/\\d+").all(); + page.addTargetRequests(links); + page.putField("title", page.getHtml().xpath("//div[@class='BlogEntity']/div[@class='BlogTitle']/h1").toString()); + page.putField("content", page.getHtml().$("div.content").toString()); + page.putField("tags",page.getHtml().xpath("//div[@class='BlogTags']/a/text()").all()); } + @Override + public Site getSite() { + return site; + + } + + public static void main(String[] args) { + Spider.create(new OschinaBlogPageProcesser()).addUrl("http://my.oschina.net/flashsword/blog") + .addPipeline(new ConsolePipeline()).run(); + } +} +``` + + 这里通过page.addTargetRequests()方法来增加要抓取的URL,并通过page.putField()来保存抽取结果。page.getHtml().xpath()则是按照某个规则对结果进行抽取,这里抽取支持链式调用。调用结束后,toString()表示转化为单个String,all()则转化为一个String列表。 Spider是爬虫的入口类。Pipeline是结果输出和持久化的接口,这里ConsolePipeline表示结果输出到控制台。 @@ -116,24 +133,26 @@ Spider是爬虫的入口类。Pipeline是结果输出和持久化的接口,这 webmagic-extension包括了注解方式编写爬虫的方法,只需基于一个POJO增加注解即可完成一个爬虫。以下仍然是抓取oschina博客的一段代码,功能与OschinaBlogPageProcesser完全相同: - @TargetUrl("http://my.oschina.net/flashsword/blog/\\d+") - public class OschinaBlog { +```java +@TargetUrl("http://my.oschina.net/flashsword/blog/\\d+") +public class OschinaBlog { - @ExtractBy("//title") - private String title; + @ExtractBy("//title") + private String title; - @ExtractBy(value = "div.BlogContent",type = ExtractBy.Type.Css) - private String content; + @ExtractBy(value = "div.BlogContent",type = ExtractBy.Type.Css) + private String content; - @ExtractBy(value = "//div[@class='BlogTags']/a/text()", multi = true) - private List tags; + @ExtractBy(value = "//div[@class='BlogTags']/a/text()", multi = true) + private List tags; - public static void main(String[] args) { - OOSpider.create( - Site.me().addStartUrl("http://my.oschina.net/flashsword/blog"), - new ConsolePageModelPipeline(), OschinaBlog.class).run(); - } - } + public static void main(String[] args) { + OOSpider.create( + Site.me(), + new ConsolePageModelPipeline(), OschinaBlog.class).addUrl("http://my.oschina.net/flashsword/blog").run(); + } +} +``` 这个例子定义了一个Model类,Model类的字段'title'、'content'、'tags'均为要抽取的属性。这个类在Pipeline里是可以复用的。 @@ -145,10 +164,43 @@ webmagic-extension包括了注解方式编写爬虫的方法,只需基于一 webmagic-samples目录里有一些定制PageProcessor以抽取不同站点的例子。 -作者还有一个使用webmagic进行抽取并持久化到数据库的项目[JobHunter](http://git.oschina.net/flashsword20/jobhunter)。这个项目整合了Spring,自定义了Pipeline,使用mybatis进行数据持久化。 +webmagic的使用可以参考:[oschina openapi 应用:博客搬家](http://my.oschina.net/oscfox/blog/194507) + ### 协议 webmagic遵循[Apache 2.0协议](http://opensource.org/licenses/Apache-2.0) +### 贡献者: +以下是为WebMagic提交过代码或者issue的朋友: + +* [ccliangbo](https://github.com/ccliangbo) +* [yuany](https://github.com/yuany) +* [yxssfxwzy](https://github.com/yxssfxwzy) +* [linkerlin](https://github.com/linkerlin) +* [d0ngw](https://github.com/d0ngw) +* [xuchaoo](https://github.com/xuchaoo) +* [supermicah](https://github.com/supermicah) +* [SimpleExpress](https://github.com/SimpleExpress) +* [aruanruan](https://github.com/aruanruan) +* [l1z2g9](https://github.com/l1z2g9) +* [zhegexiaohuozi](https://github.com/zhegexiaohuozi) +* [ywooer](https://github.com/ywooer) +* [yyw258520](https://github.com/yyw258520) +* [perfecking](https://github.com/perfecking) +* [lidongyang](http://my.oschina.net/lidongyang) +* [seveniu](https://github.com/seveniu) +* [sebastian1118](https://github.com/sebastian1118) + +### 邮件组: + +Gmail: +[https://groups.google.com/forum/#!forum/webmagic-java](https://groups.google.com/forum/#!forum/webmagic-java) + +QQ: +[http://list.qq.com/cgi-bin/qf_invite?id=023a01f505246785f77c5a5a9aff4e57ab20fcdde871e988](http://list.qq.com/cgi-bin/qf_invite?id=023a01f505246785f77c5a5a9aff4e57ab20fcdde871e988) + +### QQ群: + +373225642 diff --git a/zh_docs/user-manual-new.md b/zh_docs/user-manual-new.md new file mode 100644 index 0000000..a8ae5c2 --- /dev/null +++ b/zh_docs/user-manual-new.md @@ -0,0 +1,410 @@ +WebMagic in Action +======== + +WebMagic是一个简单灵活、便于二次开发的爬虫框架。除了可以便捷的实现一个爬虫,WebMagic还提供多线程功能,以及基本的分布式功能。 + +你可以直接使用WebMagic进行爬虫开发,也可以定制WebMagic以适应复杂项目的需要。 + +## 1. 在项目中使用WebMagic + +WebMagic主要包含两个jar包:`webmagic-core-{version}.jar`和`webmagic-extension-{version}.jar`。在项目中添加这两个包的依赖,即可使用WebMagic。 + +### 1.1 使用Maven + +WebMagic基于Maven进行构建,推荐使用Maven来安装WebMagic。在项目中添加以下坐标即可: + +```xml + + us.codecraft + webmagic-extension + 0.4.3 + +``` + +WebMagic使用slf4j-log4j12作为slf4j的实现.如果你自己定制了slf4j的实现,请在项目中去掉此依赖。 + +```xml + + us.codecraft + webmagic-extension + 0.4.3 + + + org.slf4j + slf4j-log4j12 + + + +``` + +### 1.2 不使用Maven + +不使用maven的用户,可以下载附带二进制jar包的版本(感谢[oschina](http://www.oschina.net/)): + + git clone http://git.oschina.net/flashsword20/webmagic.git + +在**lib**目录下,有项目依赖的所有jar包,直接在IDE里,将这些jar添加到Libraries即可。 + +![import jars](http://static.oschina.net/uploads/space/2014/0403/143318_gBQE_190591.jpeg) + +### 1.3 第一个项目 + +在你的项目中添加了WebMagic的依赖之后,即可开始第一个爬虫的开发了!我们这里拿一个抓取Github信息的例子: + +```java +import us.codecraft.webmagic.Page; +import us.codecraft.webmagic.Site; +import us.codecraft.webmagic.Spider; +import us.codecraft.webmagic.processor.PageProcessor; + +public class GithubRepoPageProcessor implements PageProcessor { + + private Site site = Site.me().setRetryTimes(3).setSleepTime(100); + + @Override + public void process(Page page) { + page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all()); + page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString()); + page.putField("name", page.getHtml().xpath("//h1[@class='entry-title public']/strong/a/text()").toString()); + if (page.getResultItems().get("name")==null){ + //skip this page + page.setSkip(true); + } + page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()")); + } + + @Override + public Site getSite() { + return site; + } + + public static void main(String[] args) { + Spider.create(new GithubRepoPageProcessor()).addUrl("https://github.com/code4craft").thread(5).run(); + } +} +``` + +点击main方法,选择“运行”,你会发现爬虫已经可以正常工作了! + +![runlog](http://static.oschina.net/uploads/space/2014/0403/103741_3Gf5_190591.png) + +
+ +## 2.下载和编译源码 + +WebMagic是一个纯Java项目,如果你熟悉Maven,那么下载并编译源码是非常简单的。如果不熟悉Maven也没关系,这部分会介绍如何在Eclipse里导入这个项目。 + +### 2.1 下载源码 + +WebMagic目前有两个仓库: + +* [https://github.com/code4craft/webmagic](https://github.com/code4craft/webmagic) + +github上的仓库保存最新版本,所有issue、pull request都在这里。大家觉得项目不错的话别忘了去给个star哦! + +* [http://git.oschina.net/flashsword20/webmagic](http://git.oschina.net/flashsword20/webmagic) + +此仓库包含所有编译好的依赖包,只保存项目的稳定版本,最新版本仍在github上更新。oschina在国内比较稳定,主要作为镜像。 + +无论在哪个仓库,使用 + + git clone https://github.com/code4craft/webmagic.git + +或者 + + git clone http://git.oschina.net/flashsword20/webmagic.git + +即可下载最新代码。 + +如果你对git本身使用也不熟悉,建议看看@黄勇的 [从 Git@OSC 下载 Smart 源码](http://my.oschina.net/huangyong/blog/200075) + +### 2.2 导入项目 + +Intellij Idea默认自带Maven支持,import项目时选择Maven项目即可。 + +#### 2.2.1 使用m2e插件 + +使用Eclipse的用户,推荐安装m2e插件,安装地址:https://www.eclipse.org/m2e/download/[](https://www.eclipse.org/m2e/download/) + +安装后,在File->Import中选择Maven->Existing Maven Projects即可导入项目。 + +![m2e-import](http://static.oschina.net/uploads/space/2014/0403/104427_eNuc_190591.png) + +导入后看到项目选择界面,点击finish即可。 + +![m2e-import2](http://static.oschina.net/uploads/space/2014/0403/104735_6vwG_190591.png) + +#### 2.2.2 使用Maven Eclipse插件 + +如果没有安装m2e插件,只要你安装了Maven,也是比较好办的。在项目根目录下使用命令: + + mvn eclipse:eclipse + +生成maven项目结构的eclipse配置文件,然后在File->Import中选择General->Existing Projects into Workspace即可导入项目。 + +![eclipse-import-1](http://static.oschina.net/uploads/space/2014/0403/100025_DAcy_190591.png) + +导入后看到项目选择界面,点击finish即可。 + +![eclipse-import-2](http://static.oschina.net/uploads/space/2014/0403/100227_73DJ_190591.png) + +### 2.3 编译和执行源码 + +导入成功之后,应该就没有编译错误了!此时你可以运行一下webmagic-core项目中自带的exmaple:"us.codecraft.webmagic.processor.example.GithubRepoPageProcessor"。 + +同样,看到控制台输出如下,则表示源码编译和执行成功了! + +![runlog](http://static.oschina.net/uploads/space/2014/0403/103741_3Gf5_190591.png) + +
+ +## 3. 基本的爬虫 + +### 3.1 实现PageProcessor + +在WebMagic里,实现一个基本的爬虫只需要编写一个类,实现`PageProcessor`接口即可。这个类基本上包含了抓取一个网站,你需要写的所有代码。 + +以之前的`GithubRepoPageProcessor`为例,我将PageProcessor的定制分为三个部分,分别是爬虫的配置、页面元素的抽取和链接的发现。 + +```java +public class GithubRepoPageProcessor implements PageProcessor { + + // 部分一:抓取网站的相关配置,包括编码、抓取间隔、重试次数等 + private Site site = Site.me().setRetryTimes(3).setSleepTime(1000); + + @Override + // process是定制爬虫逻辑的核心接口,在这里编写抽取逻辑 + public void process(Page page) { + // 部分二:定义如何抽取页面信息,并保存下来 + page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString()); + page.putField("name", page.getHtml().xpath("//h1[@class='entry-title public']/strong/a/text()").toString()); + if (page.getResultItems().get("name") == null) { + //skip this page + page.setSkip(true); + } + page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()")); + + // 部分三:从页面发现后续的url地址来抓取 + page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all()); + } + + @Override + public Site getSite() { + return site; + } + + public static void main(String[] args) { + + Spider.create(new GithubRepoPageProcessor()) + //从"https://github.com/code4craft"开始抓 + .addUrl("https://github.com/code4craft") + //开启5个线程抓取 + .thread(5) + //启动爬虫 + .run(); + } +} +``` + +#### 3.1.1 爬虫的配置 + +第一部分关于爬虫的配置,包括编码、抓取间隔、超时时间、重试次数等,也包括一些模拟的参数,例如User Agent、cookie,以及代理的设置,我们会在第5章-“爬虫的配置”里进行介绍。在这里我们先简单设置一下:重试次数为3次,抓取间隔为一秒。 + +#### 3.1.2 页面元素的抽取 + +第二部分是爬虫的核心部分:对于下载到的Html页面,你如何从中抽取到你想要的信息?WebMagic里主要使用了三种抽取技术:XPath、正则表达式和CSS选择器。 + +1. XPath + + XPath本来是用于XML中获取元素的一种查询语言,但是用于Html也是比较方便的。例如: + + ```java + page.getHtml().xpath("//h1[@class='entry-title public']/strong/a/text()") + ``` + 这段代码使用了XPath,它的意思是“查找所有class属性为'entry-title public'的h1元素,并找到他的strong子节点的a子节点,并提取a节点的文本信息”。 +对应的Html是这样子的: + + ![xpath-html](http://static.oschina.net/uploads/space/2014/0404/104607_Aqq8_190591.png) + +2. CSS选择器 + + CSS选择器是与XPath类似的语言。如果大家做过前端开发,肯定知道$('h1.entry-title')这种写法的含义。客观的说,它比XPath写起来要简单一些,但是如果写复杂一点的抽取规则,就相对要麻烦一点。 + +3. 正则表达式 + + 正则表达式则是一种通用的文本抽取语言。 + + ```java + page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all()); + ``` + + 这段代码就用到了正则表达式,它表示匹配所有"https://github.com/code4craft/webmagic"这样的链接。 + +XPath、CSS选择器和正则表达式的具体用法会在第4章“抽取工具详解”中讲到。 + +#### 3.1.3 链接的发现 + +有了处理页面的逻辑,我们的爬虫就接近完工了! + +但是现在还有一个问题:一个站点的页面是很多的,一开始我们不可能全部列举出来,于是如何发现后续的链接,是一个爬虫不可缺少的一部分。 + +```java +page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all()); +``` + +这段代码的分为两部分,`page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all()`用于获取所有满足"(https://github\\.com/\\w+/\\w+)"这个正则表达式的链接,`page.addTargetRequests()`则将这些链接加入到待抓取的队列中去。 + +### 3.2 使用Selectable的链式API + +`Selectable`相关的链式API是WebMagic的一个核心功能。使用Selectable接口,你可以直接完成页面元素的链式抽取,也无需去关心抽取的细节。 + +在刚才的例子中可以看到,page.getHtml()返回的是一个`Html`对象,它实现了`Selectable`接口。这个接口包含一些重要的方法,我将它分为两类:抽取部分和获取结果部分。 + +#### 3.2.1 抽取部分API: + +| 方法 | 说明 | 示例 | +| ------------ | ------------- | ------------ | +| xpath(String xpath) | 使用XPath选择 | html.xpath("//div[@class='title']") | +| \$(String selector) | 使用Css选择器选择 | html.\$("div.title") | +| \$(String selector,String attr) | 使用Css选择器选择 | html.\$("div.title","text") | +| css(String selector) | 功能同$(),使用Css选择器选择 | html.css("div.title") | +| links() | 选择所有链接 | html.links() | +| regex(String regex) | 使用正则表达式抽取 | html.regex("\(.\*?)\") | +| regex(String regex,int group) | 使用正则表达式抽取,并指定捕获组 | html.regex("\(.\*?)\",1) | +| replace(String regex, String replacement) | 替换内容| html.replace("\","")| + +这部分抽取API返回的都是一个`Selectable`接口,意思是说,抽取是支持链式调用的。下面我用一个实例来讲解链式API的使用。 + +例如,我现在要抓取github上所有的Java项目,这些项目可以在[https://github.com/search?l=Java&p=1&q=stars%3A%3E1&s=stars&type=Repositories](https://github.com/search?l=Java&p=1&q=stars%3A%3E1&s=stars&type=Repositories)搜索结果中看到。 + +为了避免抓取范围太宽,我指定只从分页部分抓取链接。这个抓取规则是比较复杂的,我会要怎么写呢? + +![selectable-chain-ui](http://static.oschina.net/uploads/space/2014/0404/151454_2T01_190591.png) + +首先看到页面的html结构是这个样子的: + +![selectable-chain](http://static.oschina.net/uploads/space/2014/0404/151632_88Oq_190591.png) + +那么我可以先用CSS选择器提取出这个div,然后在取到所有的链接。为了保险起见,我再使用正则表达式限定一下提取出的URL的格式,那么最终的写法是这样子的: + +```java +List urls = page.getHtml().css("div.pagination").links().regex(".*/search/\?l=java.*").all(); +``` + +然后,我们可以把这些URL加到抓取列表中去: + +```java +List urls = page.getHtml().css("div.pagination").links().regex(".*/search/\?l=java.*").all(); +page.addTargetRequests(urls); +``` + +是不是比较简单?除了发现链接,Selectable的链式抽取还可以完成很多工作。我们会在第9章示例中再讲到。 + +#### 3.2.2 获取结果的API: + +当链式调用结束时,我们一般都想要拿到一个字符串类型的结果。这时候就需要用到获取结果的API了。我们知道,一条抽取规则,无论是XPath、CSS选择器或者正则表达式,总有可能抽取到多条元素。WebMagic对这些进行了统一,你可以通过不同的API获取到一个或者多个元素。 + +| 方法 | 说明 | 示例 | +| ------------ | ------------- | ------------ | +| get() | 返回一条String类型的结果 | String link= html.links().get()| +| toString() | 功能同get(),返回一条String类型的结果 | String link= html.links().toString()| +| all() | 返回所有抽取结果 | List links= html.links().all()| +| match() | 是否有匹配结果 | if (html.links().match()){ xxx; }| + +例如,我们知道页面只会有一条结果,那么可以使用selectable.get()或者selectable.toString()拿到这条结果。 + +这里selectable.toString()采用了toString()这个接口,是为了在输出以及和一些框架结合的时候,更加方便。因为一般情况下,我们都只需要选择一个元素! + +selectable.all()则会获取到所有元素。 + +好了,到现在为止,在回过头看看3.1中的GithubRepoPageProcessor,可能就觉得更加清晰了吧?指定main方法,已经可以看到抓取结果在控制台输出了。 + +### 3.3 保存结果 + +好了,爬虫编写完成,现在我们可能还有一个问题:我如果想把抓取的结果保存下来,要怎么做呢?WebMagic用于保存结果的组件叫做`Pipeline`。例如我们通过“控制台输出结果”这件事也是通过一个内置的Pipeline完成的,它叫做`ConsolePipeline`。那么,我现在想要把结果用Json的格式保存下来,怎么做呢?我只需要将Pipeline的实现换成"JsonFilePipeline"就可以了。 + +```java + public static void main(String[] args) { + + Spider.create(new GithubRepoPageProcessor()) + //从"https://github.com/code4craft"开始抓 + .addUrl("https://github.com/code4craft") + .addPipeline(new JsonFilePipeline("D:\webmagic\")) + //开启5个线程抓取 + .thread(5) + //启动爬虫 + .run(); + } +``` + +这样子下载下来的文件就会保存在D盘的webmagic目录中了。 + +通过定制Pipeline,我们还可以实现保存结果到文件、数据库等一系列功能。这个会在第7章“抽取结果的处理”中介绍。 + +至此为止,我们已经完成了一个基本爬虫的编写,也具有了一些定制功能。 + +
+ +## 4. 抽取工具详解 + +### 4.1 XPath + +### 4.2 CSS选择器 + +### 4.3 正则表达式 + +### 4.4 JsonPath + +## 5. 配置爬虫 + +### 5.1 抓取频率 + +### 5.2 编码 + +### 5.3 代理 + +### 5.4 设置cookie/UA等http头信息 + +### 5.5 重试机制 + +### 5.6 多线程 + +## 6. 爬虫的启动和终止 + +### 6.1 启动爬虫 + +### 6.2 终止爬虫 + +### 6.3 设置执行时间 + +### 6.4 定期抓取 + +## 7. 抽取结果的处理 + +### 7.1 输出到控制台 + +### 7.2 保存到文件 + +### 7.3 JSON格式输出 + +### 7.4 自定义持久化方式(mysql/mongodb…) + +## 8. 管理URL + +### 8.1 手动添加URL + +### 8.2 在URL中保存信息 + +### 8.3 几种URL管理方式 + +### 8.4 自己管理爬虫的URL + +## 9. 实例 + +### 9.1 基本的列表+详情页的抓取 + +### 9.2 抓取动态页面 + +### 9.3 分页抓取 + +### 9.4 定期抓取 \ No newline at end of file