Commit Graph

773 Commits (b9544424f240d127ee71ce088caa8d5332b67749)

Author SHA1 Message Date
zhugw eb3c78b9d8 Update FileCacheQueueScheduler.java
这样是不是更严谨? 否则的话,中断后再次启动时, (第一个)入口地址仍会被添加到队列及写入到文件中. 
但是现在有另外一个问题存在,如第一遍全部抓取完毕了(通过spider.getStatus==Stopped判断),休眠24小时,再来抓取(通过递归调用抓取方法).
这时不同于中断后再启动,lineReader==cursor, 于是初始化时队列为空,入口地址又在urls集合中了, 故导致抓取线程马上就结束了.这样的话就没有办法去抓取网站上的新增内容了.
解决方案一:
判断抓取完毕后,紧接着覆盖cursor文件,第二次来抓取时,curosr为0, 于是将urls.txt中的所有url均放入队列中了, 可以通过这些url来发现新增url.
方案二:
对方案一进行优化,方案一虽然可以满足业务要求,但会做很多无用功,如仍会对所有旧target url进行下载,抽取,持久化等操作.而新增的内容一般都会在HelpUrl中, 比如某一页多了一个新帖子,或者多了几页内容. 故第二遍及以后来爬取时可以仅将HelpUrl放入队列中. 

希望能给予反馈,我上述理解对不对, 有什么没有考虑到的情况或者有更简单的方案?谢谢!
2014-09-14 16:20:03 +08:00
Yihua Huang 3a9c1d3002 Merge pull request #159 from zhugw/patch-3
Update Site.java
2014-09-12 13:09:50 +08:00
zhugw bc666e927d Update Site.java
setCycleRetryTimes的javadoc是这么说的:Set cycleRetryTimes times when download fail, 0 by default. Only work in RedisScheduler.
而通过查看源码发现似乎并没有做限制,即只能用于RedisScheduler. 故想问一下该javadoc是否过时了?
2014-09-12 12:42:57 +08:00
yihua.huang 42a30074c9 update urls.contains to DuplicateRemover in FileCacheQueueScheduler #157 2014-09-12 07:52:38 +08:00
Yihua Huang 689e89a9b2 Merge pull request #157 from zhugw/patch-1
Update FileCacheQueueScheduler.java
2014-09-12 07:37:56 +08:00
zhugw 1db940a088 Update FileCacheQueueScheduler.java
在使用过程中发现urls.txt文件存在重复URL的情况,经跟踪源代码,发现初始化加载文件后,读取所有的url放入一集合中,但是之后添加待抓取URL时并未判断是否已存在该集合中(即文件中)了,故导致文件中重复URL的情况.故据此对源码做了修改,还请作者审阅.
2014-09-11 15:46:09 +08:00
yihua.huang 147401ce5e remove duplicate setPath in ProxyPool 2014-09-09 22:58:44 +08:00
yihua.huang 3734865a6a fix package name =.= 2014-08-21 14:39:44 +08:00
yihua.huang e7668e01b8 fix SourceRegion error and add some tests on it #144 2014-08-21 14:29:06 +08:00
yihua.huang 4e5ba02020 fix test cont' 2014-08-18 11:08:17 +08:00
yihua.huang 4446669c24 fix test 2014-08-18 10:54:24 +08:00
yihua.huang 9866297ec4 Disable jsoup entity escape by Default. Set Html.DISABLE_HTML_ENTITY_ESCAPE to false to enable it. #149 2014-08-14 08:04:56 +08:00
yihua.huang 4e6e946dd7 more friendly exception message in PlainText #144 2014-08-13 10:02:16 +08:00
yihua.huang ebb931e0bf update assertj to test scope 2014-06-25 19:01:27 +08:00
yihua.huang af9939622b move thread package out of selector (because it is add by mistake at the beginning) 2014-06-25 18:19:50 +08:00
yihua.huang 2fd8f05fe2 change path seperator for varient OS #139 2014-06-25 14:55:23 +08:00
yihua.huang eae37c868b new sample 2014-06-10 17:38:54 +08:00
yihua.huang b3a282e58d some fix for tests #130 2014-06-10 00:05:30 +08:00
yihua.huang b75e64a61b t push origin masterMerge branch 'yxssfxwzy-proxy' 2014-06-09 23:51:47 +08:00
yihua.huang 074d767f45 Merge branch 'proxy' of github.com:yxssfxwzy/webmagic into yxssfxwzy-proxy 2014-06-09 23:51:36 +08:00
zwf 2f89cfc31a add test and fix bug of proxy module 2014-06-09 13:32:02 +08:00
yihua.huang 4efd471840 remove duplicate jar 2014-06-04 22:46:03 +08:00
yihua.huang 435922f00d Merge branch 'stable' of github.com:code4craft/webmagic 2014-06-04 22:33:58 +08:00
yihua.huang eb89d66566 fix test 2014-06-04 22:28:27 +08:00
yihua.huang 2a15bc0289 contributor 2014-06-04 22:27:16 +08:00
yihua.huang 5e8ca02ec6 contributor 2014-06-04 22:26:56 +08:00
yihua.huang baeb919cbe update bin 2014-06-04 17:38:49 +08:00
yihua.huang 8c33be48a6 Merge branch 'stable' of github.com:code4craft/webmagic 2014-06-04 17:37:45 +08:00
yihua.huang db0195babb update version in docs 2014-06-04 17:35:31 +08:00
yihua.huang 5f8c3fd5c5 update version 2014-06-04 17:33:30 +08:00
yihua.huang 0e9042eefa update pom 2014-06-04 17:17:48 +08:00
yihua.huang 03170178c4 update pom 2014-06-04 17:01:37 +08:00
yihua.huang c83b74f0f4 update pom for deploy 2014-06-04 16:55:34 +08:00
yihua.huang 7a64847a3c Bugfix: selector does not works well in element #113 2014-06-03 20:03:33 +08:00
yihua.huang 8d67fd0357 change back return proxy from spider to httpclientdownloader #128 2014-05-28 08:08:51 +08:00
yihua.huang 40bf8ca58f change return proxy from spider to httpclientdownloader #128 2014-05-28 07:57:42 +08:00
yihua.huang 1f21d9cc14 spell mistake fix #128 2014-05-28 07:29:19 +08:00
Yihua Huang e310139d00 Merge pull request #128 from yxssfxwzy/proxy
多个代理的管理
2014-05-28 07:22:08 +08:00
yihua.huang b165090434 Bugfix:Type convert error in JsonPathSelector #129 2014-05-27 21:19:22 +08:00
yihua.huang 95bdb30296 update xsoup version to release #113 2014-05-27 20:46:48 +08:00
yihua.huang a5d1b56e44 fix ut #113 2014-05-27 18:07:53 +08:00
yihua.huang 3939074a23 Bugfix: nodes() only return the first element #113 2014-05-27 17:53:06 +08:00
yihua.huang 41c2ea9498 refactor of selectable cont' #113
1. remove lazy init of Html
2. rename strings to sourceTexts for better meaning
3. make getSourceTexts abstract and DO NOT always store strings
4. instead store parsed elements of document in HtmlNode
2014-05-27 17:34:19 +08:00
yihua.huang f9825c214a refactor selectable for html fragment #113 2014-05-27 16:00:51 +08:00
yihua.huang 03d26c169b Enhance auto charset detect #126
1. Only read from content once to fix stream closed exception
2. invite moco as server test
2014-05-26 17:45:30 +08:00
zwf c146e2c7b4 add proxy pool 2014-05-19 15:59:31 +08:00
zwf 07ea04223f change_gitignore 2014-05-19 15:56:22 +08:00
yihua.huang 21982d3460 remove cpdetector temporary #126 2014-05-14 23:52:27 +08:00
fengwuze fcbfb75608 修改自动从网页中获取字符的代码块,抽取出来成为单独的方法。 2014-05-14 19:14:42 +08:00
fengwuze 95494d3c4d 增加处理meta的逻辑。
遗留:
3、网页没有指定编码的情况下,需要采用cpdetector,但目前cpdetector这个在Maven的中央库里面没有,不清楚如何解决。
2014-05-14 14:53:54 +08:00