Commit Graph

599 Commits (884f51ba3bf336cc79b1487ca4faef644fe4bd76)

Author SHA1 Message Date
yihua.huang 5706bb90af update xsoup to 0.3.1 2016-01-20 12:59:11 +08:00
yihua.huang 7586e3d75c add some test for github repo downloader 2016-01-19 08:05:53 +08:00
x1ny 90e14b31b0 修正FileCacheQueueScheduler导致程序不能正常结束和未关闭流
FileCacheQueueScheduler中开启了一个线程周期运行来保存数据但在爬虫结束后没有关闭导致程序无法结束,以及没有关闭io流。

解决方法:
让FileCacheQueueScheduler实现Closable接口,在close方法中关闭线程以及流。
在Spider的close方法中添加对scheduler的关闭操作。
2015-11-12 23:10:20 +08:00
yihua.huang 56e0cd513a compile error fix 2015-04-15 23:21:06 +08:00
yihua.huang c5740b1840 change assert #200 2015-04-15 08:32:08 +08:00
yihua.huang 67eb632f4d test for issue #200 2015-04-15 08:31:45 +08:00
高军 590561a6e4 修正site.setHttpProxy()不起作用的bug 2015-03-09 15:54:15 +08:00
edwardsbean 19474e4716 add SimpleProxyPool and IProxyPool 2015-02-28 17:50:10 +08:00
edwardsbean 4978665633 add retry sleep time 2015-01-21 13:30:02 +08:00
yihua.huang 8ffc1a7093 add NPE check for POST method 2015-01-13 14:10:00 +08:00
zhugw bc666e927d Update Site.java
setCycleRetryTimes的javadoc是这么说的:Set cycleRetryTimes times when download fail, 0 by default. Only work in RedisScheduler.
而通过查看源码发现似乎并没有做限制,即只能用于RedisScheduler. 故想问一下该javadoc是否过时了?
2014-09-12 12:42:57 +08:00
yihua.huang 147401ce5e remove duplicate setPath in ProxyPool 2014-09-09 22:58:44 +08:00
yihua.huang e7668e01b8 fix SourceRegion error and add some tests on it #144 2014-08-21 14:29:06 +08:00
yihua.huang 4446669c24 fix test 2014-08-18 10:54:24 +08:00
yihua.huang 9866297ec4 Disable jsoup entity escape by Default. Set Html.DISABLE_HTML_ENTITY_ESCAPE to false to enable it. #149 2014-08-14 08:04:56 +08:00
yihua.huang 4e6e946dd7 more friendly exception message in PlainText #144 2014-08-13 10:02:16 +08:00
yihua.huang af9939622b move thread package out of selector (because it is add by mistake at the beginning) 2014-06-25 18:19:50 +08:00
yihua.huang eae37c868b new sample 2014-06-10 17:38:54 +08:00
yihua.huang b3a282e58d some fix for tests #130 2014-06-10 00:05:30 +08:00
yihua.huang 074d767f45 Merge branch 'proxy' of github.com:yxssfxwzy/webmagic into yxssfxwzy-proxy 2014-06-09 23:51:36 +08:00
zwf 2f89cfc31a add test and fix bug of proxy module 2014-06-09 13:32:02 +08:00
yihua.huang eb89d66566 fix test 2014-06-04 22:28:27 +08:00
yihua.huang 5e8ca02ec6 contributor 2014-06-04 22:26:56 +08:00
yihua.huang 8c33be48a6 Merge branch 'stable' of github.com:code4craft/webmagic 2014-06-04 17:37:45 +08:00
yihua.huang 5f8c3fd5c5 update version 2014-06-04 17:33:30 +08:00
yihua.huang 7a64847a3c Bugfix: selector does not works well in element #113 2014-06-03 20:03:33 +08:00
yihua.huang 8d67fd0357 change back return proxy from spider to httpclientdownloader #128 2014-05-28 08:08:51 +08:00
yihua.huang 40bf8ca58f change return proxy from spider to httpclientdownloader #128 2014-05-28 07:57:42 +08:00
yihua.huang 1f21d9cc14 spell mistake fix #128 2014-05-28 07:29:19 +08:00
Yihua Huang e310139d00 Merge pull request #128 from yxssfxwzy/proxy
多个代理的管理
2014-05-28 07:22:08 +08:00
yihua.huang b165090434 Bugfix:Type convert error in JsonPathSelector #129 2014-05-27 21:19:22 +08:00
yihua.huang a5d1b56e44 fix ut #113 2014-05-27 18:07:53 +08:00
yihua.huang 3939074a23 Bugfix: nodes() only return the first element #113 2014-05-27 17:53:06 +08:00
yihua.huang 41c2ea9498 refactor of selectable cont' #113
1. remove lazy init of Html
2. rename strings to sourceTexts for better meaning
3. make getSourceTexts abstract and DO NOT always store strings
4. instead store parsed elements of document in HtmlNode
2014-05-27 17:34:19 +08:00
yihua.huang f9825c214a refactor selectable for html fragment #113 2014-05-27 16:00:51 +08:00
yihua.huang 03d26c169b Enhance auto charset detect #126
1. Only read from content once to fix stream closed exception
2. invite moco as server test
2014-05-26 17:45:30 +08:00
zwf c146e2c7b4 add proxy pool 2014-05-19 15:59:31 +08:00
yihua.huang 21982d3460 remove cpdetector temporary #126 2014-05-14 23:52:27 +08:00
fengwuze fcbfb75608 修改自动从网页中获取字符的代码块,抽取出来成为单独的方法。 2014-05-14 19:14:42 +08:00
fengwuze 95494d3c4d 增加处理meta的逻辑。
遗留:
3、网页没有指定编码的情况下,需要采用cpdetector,但目前cpdetector这个在Maven的中央库里面没有,不清楚如何解决。
2014-05-14 14:53:54 +08:00
yihua.huang dde2d89bbe Ignore content in json when bracket when remove padding #124 2014-05-08 23:37:18 +08:00
ywooer 259f0a16c5 Update FilePipeline.java 2014-05-06 18:33:00 +08:00
ywooer 26d38851b5 add charset to Writer 2014-05-06 18:28:50 +08:00
yihua.huang 7668731f08 update version to snapshot 2014-05-05 07:03:55 +08:00
yihua.huang 182dd51689 Merge branch 'stable' of github.com:code4craft/webmagic 2014-05-03 06:19:11 +08:00
yihua.huang 81e6e772ac versions back to 0.5.1 2014-05-03 06:18:57 +08:00
yihua.huang feb604da87 Merge branch 'stable' of github.com:code4craft/webmagic 2014-05-03 06:14:54 +08:00
yihua.huang 358e906379 [maven-release-plugin] prepare for next development iteration 2014-05-03 00:00:13 +08:00
yihua.huang 470750fc0d [maven-release-plugin] prepare release WebMagic-0.5.1 2014-05-02 23:59:55 +08:00
yihua.huang 01aec7e1ab extension point of geturl #118 2014-05-02 23:23:23 +08:00
yihua.huang ec1c2e8cbc test and so on 2014-05-02 23:19:11 +08:00
yihua.huang 4f22f1210e some bug fix #118 2014-05-02 20:38:49 +08:00
yihua.huang 56f033ce8d set setDuplicateRemover for chain api #118 2014-05-02 20:21:23 +08:00
yihua.huang d1140b9e29 add bloom filter for scheduler #118 2014-05-02 20:20:22 +08:00
yihua.huang 8e4814bdc5 fix path seperator 2014-05-02 17:06:34 +08:00
yihua.huang e8d4a9be2b fix remove duplicate error #117 2014-04-29 20:32:06 +08:00
yihua.huang 04ade75606 Merge branch 'stable' of github.com:code4craft/webmagic
Conflicts:
	README.md
	pom.xml
	webmagic-avalon/pom.xml
	webmagic-core/pom.xml
	webmagic-extension/pom.xml
	webmagic-lucene/pom.xml
	webmagic-samples/pom.xml
	webmagic-saxon/pom.xml
	webmagic-scripts/pom.xml
	webmagic-selenium/pom.xml
2014-04-27 15:03:15 +08:00
yihua.huang a08d8cb167 update verion 2014-04-27 14:59:48 +08:00
yihua.huang 42a2676e8c update version 2014-04-27 14:56:21 +08:00
yihua.huang c25b32f1ca [maven-release-plugin] prepare for next development iteration 2014-04-27 12:52:27 +08:00
yihua.huang 7ff83bb11a [maven-release-plugin] prepare release WebMagic-0.5.0 2014-04-27 12:52:12 +08:00
yihua.huang 1104122979 more abstraction in scheduler 2014-04-27 09:30:01 +08:00
yihua.huang 2770811a10 update monitor example 2014-04-26 11:24:22 +08:00
yihua.huang 5ecd909ef2 add timeout for wait/notify #111 2014-04-25 19:41:36 +08:00
yihua.huang c7afdb516e remove thread utils #110 2014-04-25 18:44:45 +08:00
yihua.huang 17e95f2a7f comments 2014-04-25 18:39:01 +08:00
yihua.huang 05eb7831b6 refactor and comments #110 2014-04-25 18:27:40 +08:00
yihua.huang 375e64e845 more monitor status 2014-04-25 18:10:14 +08:00
yihua.huang 018061d2cd fix error in thread pool 2014-04-25 18:01:02 +08:00
yihua.huang cdc423f2bf log 2014-04-25 17:41:41 +08:00
yihua.huang c6661899fd new thread pool #110 2014-04-25 17:33:48 +08:00
yihua.huang 179baa7a22 return when page is null 2014-04-25 16:07:41 +08:00
yihua.huang 0336f4cdb4 remove IllegalStateException when download error for less error log 2014-04-25 16:06:29 +08:00
yihua.huang 11ba5beb42 [refactor]move monitor to webmagic-extension #98 2014-04-25 13:17:13 +08:00
yihua.huang d61f65cef8 update mbean to mxbean #98 2014-04-25 11:31:43 +08:00
yihua.huang ad6a273b12 update test url 2014-04-25 11:28:35 +08:00
yihua.huang 30af23d003 split monitor to server and client mode #98 2014-04-25 11:25:52 +08:00
yihua.huang ced79630d3 specify jndi and jmx #98 2014-04-25 11:11:15 +08:00
yihua.huang 95d3802e77 add formdata support for post request #108 2014-04-24 11:48:58 +08:00
yihua.huang f49bb877c8 clean some code #109 2014-04-24 11:38:13 +08:00
yihua.huang e1aaf1dd11 fix mistake of guava Table #109 2014-04-24 11:05:49 +08:00
yihua.huang 8ba2da146c request method #108 and more cookie #109 config 2014-04-24 10:51:37 +08:00
yihua.huang b06aa489fb [BugFix]Only one url from sourceRegion can be extracted #107 2014-04-18 17:48:26 +08:00
Bo LIANG 08fa3b01c1 when download error, throw an exception instead of calling onError and returning peacefully. #105 2014-04-17 17:53:12 +08:00
yihua.huang 27b37e8164 extension point and sample for JMX support #98 2014-04-17 08:12:37 +08:00
yihua.huang a5db6cf292 some monitor and JMX support #98 2014-04-17 00:35:09 +08:00
yihua.huang f39aa435cf add null check #104 2014-04-16 19:46:32 +08:00
yihua.huang 42bbe40a37 [Bugfix]Urls will be lost when call setScheduler() #104 2014-04-16 19:45:17 +08:00
Bo LIANG 163773af6b combine two try-catch block into one, make it cleaner. 2014-04-16 16:05:08 +08:00
yihua.huang ec446277b1 some refactor in httpclientdownloader 2014-04-15 15:30:37 +08:00
yihua.huang a03f6a8431 eclipse project 2014-04-15 07:44:43 +08:00
yihua.huang 4a035e729a extension point for LocalDuplicatedRemovedScheduler #95 2014-04-13 23:31:13 +08:00
yihua.huang b249e49748 [Bugfix]loop error when add TargetRequest #99 2014-04-13 23:04:09 +08:00
Yihua Huang da2f023c12 Merge pull request #96 from ouyanghuangzheng/master
修改了Spider 和site  几处注释
2014-04-13 13:12:12 +08:00
yihua.huang f7950ebcab fix tests 2014-04-13 13:00:31 +08:00
愤怒的番茄 32ba1b8889 修复几处注释问题 2014-04-13 12:41:15 +08:00
yihua.huang 84b897f83b update AngularJSProcessor 2014-04-13 12:20:57 +08:00
yihua.huang 03c251237b add Json parse support 2014-04-13 10:23:00 +08:00
愤怒的番茄 644e8d1f72 同步官方源码 2014-04-12 22:32:22 +08:00
yihua.huang 969ad1766b change logger style to slf4j for cleaner code 2014-04-06 21:32:20 +08:00
yihua.huang 9b2cb43f47 ConfigurablePageProcessor #91 2014-04-05 23:40:10 +08:00
Bo LIANG b043ac76d6 change the formatter of log.
To use slf4j, we should insert {} into the formatter string.
2014-04-05 11:31:56 +08:00
yihua.huang 7aaf837e15 change logger to slf4j style for performance #84 2014-04-04 20:10:00 +08:00
yihua.huang f9b157951d Merge branch 'master' of github.com:code4craft/webmagic 2014-04-04 20:01:14 +08:00
yihua.huang 22c394e629 [doc] 2014-04-04 20:00:58 +08:00
Bo LIANG 762a3973fd Modify the log levels of LocalDuplicatedRemovedScheduler.java
The old version will print a debug level log each time the push method is
called. So sometimes, when a html page have multiple links for the same
page, the debug log will appears more than once. Also, when we meet a
duplicate URL, it will also print a log, which will be confusing.
I change the level of it to trace. And each time a URL is really push into
queue, print a debug level log.
2014-04-04 15:53:46 +08:00
yihua.huang a1c7e826f7 fix dep of slf4j-log4j12 2014-04-03 23:04:31 +08:00
yihua.huang 01848301d4 encode illegal charactors in url #80 2014-04-01 22:14:30 +08:00
yihua.huang 2780423e60 enable blank space in quotes in UrlUtils.fixAllRelativeHrefs #80 2014-04-01 20:35:11 +08:00
yihua.huang 97b6f46280 Bugfix: break loop in addTargetRequests #81 2014-04-01 20:12:25 +08:00
yihua.huang 8d8194bee4 Change HashMap to LinkedHashMap in ResultItems for same order of input and output #76 2014-03-25 08:23:20 +08:00
yihua.huang 8b35d79569 Do not cache document in Selectable for selected Html element #73 2014-03-19 22:19:06 +08:00
yihua.huang 6201fd6966 add worker as container 2014-03-17 23:01:58 +08:00
yihua.huang 6c11718566 Clean project structure #70 2014-03-14 23:24:38 +08:00
yihua.huang 9606a173cd fix ZipCodePageProcessor 2014-03-13 22:55:50 +08:00
yihua.huang 4f68368db0 Merge branch 'master' of git.oschina.net:flashsword20/webmagic
Conflicts:
	webmagic-core/src/main/java/us/codecraft/webmagic/selector/RegexSelector.java
2014-03-13 08:09:37 +08:00
yihua.huang 98e2bba099 Merge branch 'master' of github.com:code4craft/webmagic
Conflicts:
	README.md
	pom.xml
	webmagic-core/pom.xml
	webmagic-extension/pom.xml
	webmagic-scripts/pom.xml
2014-03-13 08:07:33 +08:00
yihua.huang 757cc9b942 [maven-release-plugin] prepare for next development iteration 2014-03-13 07:49:51 +08:00
yihua.huang 63ffb5c792 [maven-release-plugin] prepare release webmaigc-0.4.3 2014-03-13 07:49:27 +08:00
yihua.huang 66d4d3c192 Merge branch 'master' into 0.4.x 2014-03-13 07:12:29 +08:00
yihua.huang af07280176 remove defend code for httpclient 4.3.1 because it is fixed in 4.3.3 #59 2014-03-13 07:11:56 +08:00
yihua.huang d5a978e00f update version back to 0.4.3 2014-03-13 06:55:05 +08:00
yihua.huang 55368919df add attribute 'text' support for CssSelector #66 2014-03-11 13:18:34 +08:00
yihua.huang 88b50d4182 bigfix: cycleTry will not work when spawnUrl is set to false #62 2014-03-04 07:33:07 +08:00
yihua.huang 2768a1cae4 add test for cycleTriedTimes and fix cycleTriedTimes inc error #60 2014-03-01 15:10:38 +08:00
yihua.huang bbd0d7e600 update httpclient version to 4.3.3 #59 2014-02-28 21:17:02 +08:00
yihua.huang 571061454a #58 add CYCLE_TRIED_TIMES support to QueueScheduler and PriorityScheduler 2014-02-27 23:54:30 +08:00
yihua.huang 0e98183f74 Change log4j to slf4j #55 2014-02-12 09:35:57 +08:00
yihua.huang fa33b15843 property loader 2014-02-11 23:07:31 +08:00
yihua.huang af809c4d55 update version to 0.5.0-snapshot 2014-02-11 22:16:01 +08:00
Almark Ming 2b46b11e55 Update RegexSelector.java
Optimize regex format check

Conflicts:
	webmagic-core/src/main/java/us/codecraft/webmagic/selector/RegexSelector.java
2013-12-21 08:38:17 +08:00
yihua.huang 2a8e1b654d Merge branch 'master' of git.oschina.net:flashsword20/webmagic into osc
Conflicts:
	pom.xml
2013-12-21 07:59:28 +08:00
Almark Ming 91ed66ecac Update RegexSelector.java 2013-12-17 16:57:22 +08:00
Almark Ming 83926970b2 Check valid left parenthesis 2013-12-17 16:55:53 +08:00
yihua.huang b51fb2696b update ut for cookie 2013-12-06 00:30:01 +08:00
yihua.huang ff2f588c41 #48 nullpointer exception 2013-12-04 22:11:20 +08:00
yihua.huang fc97cb58c5 update lib and version 2013-12-04 00:04:29 +08:00
yihua.huang 7c41bec92f Merge branch 'master' of github.com:code4craft/webmagic
Conflicts:
	README.md
	webmagic-samples/pom.xml
	webmagic-selenium/pom.xml
2013-12-03 23:50:26 +08:00
yihua.huang d274310cb2 [maven-release-plugin] prepare for next development iteration 2013-12-03 23:35:06 +08:00
yihua.huang e8c32a32dc [maven-release-plugin] prepare release webmagic-0.4.2 2013-12-03 23:34:57 +08:00
yihua.huang 6a828e923c #46 Downloader thread hang up when timeout 2013-12-03 09:59:54 +08:00
shijinping 9a524aa364 double-check 中再取次httpClient的内容 2013-11-28 14:38:30 +08:00
yihua.huang fd23cb6dc0 Merge branch 'master' of github.com:code4craft/webmagic
Conflicts:
	README.md
	pom.xml
	webmagic-samples/pom.xml
	webmagic-selenium/pom.xml
2013-11-28 13:40:24 +08:00
yihua.huang e7083dc39d [maven-release-plugin] prepare for next development iteration 2013-11-28 13:04:32 +08:00
yihua.huang ae623567b3 [maven-release-plugin] prepare release webmagic-0.4.1 2013-11-28 13:04:22 +08:00
yihua.huang 59ad4cad27 #42 Add jsonpath in annotation mode for json result 2013-11-28 08:25:16 +08:00
yihua.huang c2d6d495b3 #41 add getThreadAlive(),getStatus,getPageCount() to spider 2013-11-28 07:59:24 +08:00
yihua.huang cf62d707e0 #36 Spider does not exit when success 2013-11-27 23:33:18 +08:00
yihua.huang a01312930a #39 Parsing html after page.getHtml() 2013-11-27 22:01:34 +08:00
yihua.huang f63d33b457 update some comments 2013-11-27 21:06:53 +08:00
yihua.huang 04fcf3193f #38 Change algorithm of SmartContentSelector 2013-11-23 13:56:55 +08:00
yihua.huang 296a68920e fix javadoc and add setPipelines() for spider 2013-11-14 13:23:29 +08:00
yihua.huang 47a0360783 #35 add status code to page 2013-11-12 11:51:34 +08:00
yihua.huang bc5c30de17 update scripts 2013-11-12 08:20:59 +08:00
yihua.huang f9daae39cf [maven-release-plugin] prepare for next development iteration 2013-11-11 14:33:11 +08:00
yihua.huang fdb9441519 [maven-release-plugin] prepare release webmagic-0.4.0 2013-11-11 14:33:01 +08:00
yihua.huang 1d75ae7f5b rollback version to 0.4.0 because not deploy success 2013-11-11 11:52:56 +08:00
yihua.huang df8ca8ad09 add scripts 2013-11-10 22:30:48 +08:00
yihua.huang e40b48e77b Merge tag 'webmagic-0.4.0' of github.com:code4craft/webmagic
[maven-release-plugin]  copy for tag webmagic-0.4.0

Conflicts:
	pom.xml
	webmagic-core/pom.xml
	webmagic-extension/pom.xml
2013-11-06 22:48:26 +08:00
yihua.huang 775eb9732f [maven-release-plugin] prepare for next development iteration 2013-11-06 22:17:58 +08:00
yihua.huang 0b4fadc24d [maven-release-plugin] prepare release webmagic-0.4.0 2013-11-06 22:17:47 +08:00
yihua.huang fe6d9bb2e2 get keep-alive rework 2013-11-06 21:53:39 +08:00
yihua.huang fd6d2fd6f8 try to keepalive TCP connection 2013-11-06 21:19:14 +08:00
yihua.huang 425df08523 update version to 0.4.0 2013-11-06 12:50:45 +08:00
yihua.huang e046bb0723 remove useless code 2013-11-06 12:48:14 +08:00
yihua.huang 6e32a19f80 update api for direct download 2013-11-06 12:46:50 +08:00
yihua.huang 807aefe9df change EntityUtil to IOUtil because some encoding error 2013-11-06 07:37:34 +08:00
yihua.huang 00b0a751b4 #33 ignore 'content-encoding' when redirect 2013-11-06 06:57:58 +08:00
yihua.huang 8f774afc84 add direct download 2013-11-06 06:41:04 +08:00
yihua.huang c18b603399 optimize long compare 2013-11-04 07:09:44 +08:00
yihua.huang ed3f3583cc downloader refactor 2013-11-04 01:03:23 +08:00
yihua.huang a37f40e6e6 add cookie supoort 2013-11-04 00:59:48 +08:00
yihua.huang 3c6fced48e update connection client 2013-11-04 00:53:01 +08:00
yihua.huang 09153ff715 #22 http proxy support #32 update httpclient to 4.3.1 2013-11-04 00:47:09 +08:00
yihua.huang edfc319c45 update httpclient to 4.3.1 2013-11-04 00:06:30 +08:00
yihua.huang 160a149b05 todo bugfix 2013-11-03 23:10:09 +08:00
yihua.huang 583a0eba8c #29 refactor some method name 2013-11-03 20:24:26 +08:00
yihua.huang 6fa82a418b #29 seed urls with more information 2013-11-03 20:20:50 +08:00
yihua.huang 1446ada732 some refactor 2013-10-31 22:50:22 +08:00
yihua.huang 84976c81ec remove useless code 2013-10-31 22:48:18 +08:00
yihua.huang b4fcf41168 add exit when comlete option 2013-10-31 22:41:02 +08:00
yihua.huang 352887870c remove shutdown call 2013-10-31 22:22:14 +08:00
yihua.huang a3f9ad198f refactor multi thread code in Spider 2013-10-31 21:52:43 +08:00
yihua.huang 7fb44d2eec #30 reuse PoolingClientConnectionManager for HttpClientDownloader 2013-10-14 23:22:04 +08:00
yihua.huang 5a226387e0 #27 nullpointer fix 2013-10-11 11:32:44 +08:00
yihua.huang 16e12e3bc9 #27 customize http header for downloader 2013-10-11 08:37:21 +08:00
yihua.huang 1a2c84ea78 #27 add timeout config to site 2013-10-11 07:36:16 +08:00
yihua.huang 372cc0ad06 update jar 2013-09-23 13:21:40 +08:00
yihua.huang 4acbc19cee [maven-release-plugin] prepare for next development iteration 2013-09-23 13:12:32 +08:00
yihua.huang cc3b787991 [maven-release-plugin] prepare release webmagic-0.3.2 2013-09-23 13:12:19 +08:00
yihua.huang b131878123 add example 2013-09-23 13:01:28 +08:00
yihua.huang 95ab4edec3 some bugfix 2013-09-23 08:38:54 +08:00
yihua.huang fba330872b fix a thread pool exception 2013-09-22 23:57:15 +08:00
yihua.huang 3c79d031bd fix thread pool 2013-09-22 22:52:52 +08:00
yihua.huang a2fba8caa2 update to 0.3.1 2013-09-09 12:48:01 +08:00
yihua.huang fb693a4ac4 [maven-release-plugin] prepare for next development iteration 2013-09-08 22:25:07 +08:00
yihua.huang bfaaa042b9 [maven-release-plugin] prepare release webmagic-parent-0.3.1 2013-09-08 22:24:48 +08:00
yihua.huang c17a31a21d fix null pointe exception #26 2013-09-08 21:09:49 +08:00
yihua.huang d2e0f0cd33 #25 use URL api in UrlUtils.canonicalizeUrl() 2013-09-06 21:35:23 +08:00
yihua.huang ef4cf49fee add stop method to spider #24 2013-09-06 21:17:36 +08:00
yihua.huang 58150a090d update jar 2013-09-05 20:56:25 +08:00
yihua.huang 57556ab879 merege 2013-09-05 20:53:15 +08:00
yihua.huang 692de76f86 fix issue #21 charset detect error 2013-09-04 15:27:51 +08:00
yihua.huang e7bf425df4 [maven-release-plugin] prepare for next development iteration 2013-09-04 10:51:01 +08:00
yihua.huang 77ff252316 [maven-release-plugin] prepare release webmagic-0.3.0 2013-09-04 10:50:50 +08:00
yihua.huang 1fc8e104ab add cycle retry 2013-09-04 10:32:13 +08:00
yihua.huang d141541ef3 add retry 2013-09-04 09:57:19 +08:00
yihua.huang a1ef2523cc update xsoup version 2013-09-04 09:38:40 +08:00
yihua.huang aefd0569a5 update version 2013-09-04 09:36:56 +08:00
yihua.huang 194518fd82 add switch 2013-09-04 08:21:34 +08:00
yihua.huang 326b97c65a update 2013-09-04 00:15:54 +08:00
yihua.huang 2c3574537a refactor in selectors 2013-09-02 14:14:24 +08:00
yihua.huang 85b7cf1563 complete test 2013-09-02 13:52:41 +08:00
yihua.huang d7cd9e5747 update pom 2013-09-02 11:56:01 +08:00
yihua.huang 55d4a76ab7 newselectors 2013-09-02 08:21:32 +08:00
yihua.huang d7abbd0e4b fix compile error 2013-08-25 16:31:00 +08:00
yihua.huang 5e9e8b2541 add TextContentSelector 2013-08-25 16:30:38 +08:00
yihua.huang 0cc0ccee35 add charset specific for easy call of HttpClientDownloader 2013-08-25 15:41:43 +08:00
yihua.huang 91dcccf7b5 add a sample 2013-08-21 21:55:15 +08:00
yihua.huang ad66d33f38 [maven-release-plugin] prepare for next development iteration 2013-08-20 23:39:59 +08:00
yihua.huang 9dc6b11954 [maven-release-plugin] prepare release webmagic-parent-0.2.1 2013-08-20 23:37:55 +08:00
yihua.huang 4f62dfc8a4 release 2013-08-20 23:37:20 +08:00
yihua.huang 74c940c758 [maven-release-plugin] prepare for next development iteration 2013-08-20 23:19:58 +08:00
yihua.huang a4bb4e3429 [maven-release-plugin] prepare release webmagic-parent-0.2.1 2013-08-20 23:19:27 +08:00
yihua.huang 194f16aa75 update 2013-08-20 23:16:43 +08:00
yihua.huang 0f0f1a9bcd release notes 2013-08-20 22:51:30 +08:00
yihua.huang c1471718df extractors 2013-08-20 22:44:53 +08:00
yihua.huang 20705b34ac add more option to extractors 2013-08-20 22:13:30 +08:00
yihua.huang c70ed57025 remove PriorityScheduler to core 2013-08-20 21:55:58 +08:00
yihua.huang 7003426898 update pom 2013-08-20 21:52:39 +08:00
yihua.huang 606417fdc7 update pom 2013-08-19 09:55:49 +08:00
yihua.huang d460e136ef update version 2013-08-19 09:52:15 +08:00
yihua.huang c79d6ecf09 complete all comments 2013-08-17 23:30:49 +08:00
yihua.huang 90bbe9b951 webmagic-core 2013-08-17 23:24:04 +08:00
yihua.huang 17f8ead28f update comments for selector 2013-08-17 21:33:54 +08:00
yihua.huang 77e6ca2945 update comments 2013-08-17 21:26:44 +08:00
yihua.huang 5073258237 closable 2013-08-17 21:19:24 +08:00
yihua.huang d01c0eb8ce update comments of spider 2013-08-17 21:15:36 +08:00
yihua.huang 5f1f4cbc46 update comments 2013-08-17 20:41:29 +08:00
yihua.huang 1148450ff9 update filecache to more useful 2013-08-17 18:12:47 +08:00
yihua.huang 3ba7a76f44 add combo extract to replace Extract2 Extract3... 2013-08-17 17:23:11 +08:00
yihua.huang 5cb45af3a4 +doc 2013-08-17 12:10:34 +08:00
yihua.huang ef673b985e add a method for httpclientdownloader 2013-08-14 13:32:23 +08:00
yihua.huang 067f3ea0cb add some null pointer check for httpclientdownloader 2013-08-14 13:30:09 +08:00
yihua.huang 9e82256ce3 update docs 2013-08-12 10:08:20 +08:00
yihua.huang 0a902b441c update docs 2013-08-12 09:55:17 +08:00
yihua.huang 0f2c5b5723 update redisscheduler 2013-08-11 18:28:12 +08:00
yihua.huang 787b952932 release notes and docs 2013-08-11 10:21:26 +08:00
yihua.huang 8b15f3c63d add test 2013-08-10 20:33:47 +08:00
yihua.huang ade5714d50 add https support 2013-08-10 18:52:27 +08:00
yihua.huang 21eca688e9 complete docs 2013-08-09 20:56:33 +08:00
yihua.huang 17d2d98cec remove invalid @date 2013-08-09 20:43:06 +08:00
yihua.huang 268bd8d0c4 remove saxon to extension 2013-08-07 23:04:10 +08:00
yihua.huang cff943f698 fix path format error 2013-08-07 13:05:12 +08:00
yihua.huang 5ef231a768 update version 2013-08-07 12:48:32 +08:00
yihua.huang 570533cce5 update readme 2013-08-07 09:45:38 +08:00
yihua.huang 36494bcfa5 add xpath2.0 api 2013-08-06 23:01:43 +08:00
yihua.huang 5c96407a3d fix a null domain error 2013-08-06 22:43:31 +08:00
yihua.huang c7005a0227 json fix 2013-08-06 22:36:37 +08:00
yihua.huang e5f4b3916f change file dir 2013-08-06 22:26:39 +08:00
yihua.huang 7d277e84d4 update lucene pipeline 2013-08-06 21:47:44 +08:00
yihua.huang b40cca1122 move model package to plugin 2013-08-06 20:41:35 +08:00
yihua.huang 4eb3d60083 fix nullpointer exception 2013-08-05 22:06:39 +08:00
yihua.huang b0af45f4bb complete redis support 2013-08-05 21:44:29 +08:00
yihua.huang f3a29d9315 fix pagedmodel bug 2013-08-05 21:03:47 +08:00
yihua.huang 629f8ac2d1 add extractors chain 2013-08-05 20:45:34 +08:00
yihua.huang 27ce3fc176 lazy init 2013-08-05 19:36:49 +08:00
yihua.huang dc9f574e27 update request 2013-08-05 18:17:52 +08:00
yihua.huang d56c681be1 add priority to request 2013-08-05 18:08:28 +08:00
yihua.huang 971e7b6ce2 add core 2013-08-05 13:53:13 +08:00
yihua.huang 619a12b303 add paged support 2013-08-04 21:22:15 +08:00
yihua.huang a5c85c3c8b add annotation ExtractByRaw 2013-08-04 15:12:06 +08:00
yihua.huang 1a50c64e33 update name 2013-08-04 10:05:03 +08:00
yihua.huang a3a868f584 rename 2013-08-04 09:55:50 +08:00
yihua.huang 04a7fa037a update pipeline 2013-08-04 09:53:01 +08:00
yihua.huang 21cae2ff2e update package 2013-08-04 07:53:28 +08:00
yihua.huang cfb8990453 update author 2013-08-04 03:04:30 +08:00
yihua.huang b393e38320 add multi entity extract 2013-08-03 20:42:29 +08:00
yihua.huang bfadac756a fix an attribute bug 2013-08-03 18:36:03 +08:00
yihua.huang 145628557d update afterextract api 2013-08-03 18:01:17 +08:00
yihua.huang aca165b132 add and or selector 2013-08-03 17:38:36 +08:00
yihua.huang 69245e8c03 fix Class.assinable bug 2013-08-03 17:17:59 +08:00
yihua.huang 65518f7672 add list support 2013-08-03 17:01:25 +08:00
yihua.huang d4de60a562 skip test 2013-08-03 16:35:12 +08:00
yihua.huang d26cd82d59 rename package 2013-08-03 16:29:50 +08:00
yihua.huang f84b53514f complete objectpipeline 2013-08-03 15:55:54 +08:00
yihua.huang 866ab0a056 update email 2013-08-03 14:01:18 +08:00
yihua.huang 7c9e9ce869 xpath2.0 2013-08-03 07:28:46 +08:00
yihua.huang 7f27c28d4c simplify api 2013-08-02 23:45:13 +08:00
yihua.huang d7899e94ae test saxon and invite XPath2.0 support 2013-08-02 23:39:34 +08:00
yihua.huang 3fe3d8f044 update 2013-08-02 13:51:42 +08:00
yihua.huang 516ff3310d add failfast 2013-08-02 08:20:55 +08:00
yihua.huang 7a4dbb1f15 invite notnull 2013-08-02 08:09:37 +08:00
yihua.huang 06a39af0f3 add setter support 2013-08-02 07:32:37 +08:00
yihua.huang abba3b7bff add extract by url 2013-08-02 06:59:25 +08:00
yihua.huang f08ffc34fd rename 2013-08-02 06:33:48 +08:00
yihua.huang c5cf05640a processor 2013-08-01 22:53:44 +08:00
yihua.huang 50edd22ef6 add annotation 2013-08-01 22:40:57 +08:00
yihua.huang 7020b8648d fix a thread problem 2013-07-30 21:39:43 +08:00
yihua.huang 52fd5cfc1c fix encoding 2013-07-30 15:24:59 +08:00