zhugw
|
eb3c78b9d8
|
Update FileCacheQueueScheduler.java
这样是不是更严谨? 否则的话,中断后再次启动时, (第一个)入口地址仍会被添加到队列及写入到文件中.
但是现在有另外一个问题存在,如第一遍全部抓取完毕了(通过spider.getStatus==Stopped判断),休眠24小时,再来抓取(通过递归调用抓取方法).
这时不同于中断后再启动,lineReader==cursor, 于是初始化时队列为空,入口地址又在urls集合中了, 故导致抓取线程马上就结束了.这样的话就没有办法去抓取网站上的新增内容了.
解决方案一:
判断抓取完毕后,紧接着覆盖cursor文件,第二次来抓取时,curosr为0, 于是将urls.txt中的所有url均放入队列中了, 可以通过这些url来发现新增url.
方案二:
对方案一进行优化,方案一虽然可以满足业务要求,但会做很多无用功,如仍会对所有旧target url进行下载,抽取,持久化等操作.而新增的内容一般都会在HelpUrl中, 比如某一页多了一个新帖子,或者多了几页内容. 故第二遍及以后来爬取时可以仅将HelpUrl放入队列中.
希望能给予反馈,我上述理解对不对, 有什么没有考虑到的情况或者有更简单的方案?谢谢!
|
2014-09-14 16:20:03 +08:00 |
yihua.huang
|
42a30074c9
|
update urls.contains to DuplicateRemover in FileCacheQueueScheduler #157
|
2014-09-12 07:52:38 +08:00 |
zhugw
|
1db940a088
|
Update FileCacheQueueScheduler.java
在使用过程中发现urls.txt文件存在重复URL的情况,经跟踪源代码,发现初始化加载文件后,读取所有的url放入一集合中,但是之后添加待抓取URL时并未判断是否已存在该集合中(即文件中)了,故导致文件中重复URL的情况.故据此对源码做了修改,还请作者审阅.
|
2014-09-11 15:46:09 +08:00 |
yihua.huang
|
3734865a6a
|
fix package name =.=
|
2014-08-21 14:39:44 +08:00 |
yihua.huang
|
e7668e01b8
|
fix SourceRegion error and add some tests on it #144
|
2014-08-21 14:29:06 +08:00 |
yihua.huang
|
4e5ba02020
|
fix test cont'
|
2014-08-18 11:08:17 +08:00 |
yihua.huang
|
2fd8f05fe2
|
change path seperator for varient OS #139
|
2014-06-25 14:55:23 +08:00 |
yihua.huang
|
8c33be48a6
|
Merge branch 'stable' of github.com:code4craft/webmagic
|
2014-06-04 17:37:45 +08:00 |
yihua.huang
|
5f8c3fd5c5
|
update version
|
2014-06-04 17:33:30 +08:00 |
yihua.huang
|
928f98dd93
|
auto create folder in JsonFilePipeline #122
|
2014-05-08 15:12:17 +08:00 |
yihua.huang
|
7fbe18b8c0
|
implementation of PageMapper #120
|
2014-05-05 08:01:39 +08:00 |
yihua.huang
|
5dc9fe95a9
|
interface of PageMapper #120
|
2014-05-05 07:43:32 +08:00 |
yihua.huang
|
7668731f08
|
update version to snapshot
|
2014-05-05 07:03:55 +08:00 |
yihua.huang
|
182dd51689
|
Merge branch 'stable' of github.com:code4craft/webmagic
|
2014-05-03 06:19:11 +08:00 |
yihua.huang
|
81e6e772ac
|
versions back to 0.5.1
|
2014-05-03 06:18:57 +08:00 |
yihua.huang
|
feb604da87
|
Merge branch 'stable' of github.com:code4craft/webmagic
|
2014-05-03 06:14:54 +08:00 |
yihua.huang
|
358e906379
|
[maven-release-plugin] prepare for next development iteration
|
2014-05-03 00:00:13 +08:00 |
yihua.huang
|
470750fc0d
|
[maven-release-plugin] prepare release WebMagic-0.5.1
|
2014-05-02 23:59:55 +08:00 |
yihua.huang
|
186b90512e
|
refactor redisscheduler #118
|
2014-05-02 20:24:15 +08:00 |
yihua.huang
|
d1140b9e29
|
add bloom filter for scheduler #118
|
2014-05-02 20:20:22 +08:00 |
yihua.huang
|
e8d4a9be2b
|
fix remove duplicate error #117
|
2014-04-29 20:32:06 +08:00 |
yihua.huang
|
04ade75606
|
Merge branch 'stable' of github.com:code4craft/webmagic
Conflicts:
README.md
pom.xml
webmagic-avalon/pom.xml
webmagic-core/pom.xml
webmagic-extension/pom.xml
webmagic-lucene/pom.xml
webmagic-samples/pom.xml
webmagic-saxon/pom.xml
webmagic-scripts/pom.xml
webmagic-selenium/pom.xml
|
2014-04-27 15:03:15 +08:00 |
yihua.huang
|
a08d8cb167
|
update verion
|
2014-04-27 14:59:48 +08:00 |
yihua.huang
|
42a2676e8c
|
update version
|
2014-04-27 14:56:21 +08:00 |
yihua.huang
|
c25b32f1ca
|
[maven-release-plugin] prepare for next development iteration
|
2014-04-27 12:52:27 +08:00 |
yihua.huang
|
7ff83bb11a
|
[maven-release-plugin] prepare release WebMagic-0.5.0
|
2014-04-27 12:52:12 +08:00 |
yihua.huang
|
1104122979
|
more abstraction in scheduler
|
2014-04-27 09:30:01 +08:00 |
yihua.huang
|
b0fb1c3e10
|
remove copy-dependcies plugin for m2e error
|
2014-04-27 08:22:33 +08:00 |
yihua.huang
|
94a67165e1
|
remove jmx server for simplify #98
|
2014-04-26 20:17:52 +08:00 |
yihua.huang
|
86a45a6643
|
change SpiderMonitor to singleton #98
|
2014-04-26 18:14:25 +08:00 |
yihua.huang
|
ab4d36806e
|
clean code
|
2014-04-26 11:45:21 +08:00 |
yihua.huang
|
04fde8203b
|
add control for monitor
|
2014-04-26 11:44:14 +08:00 |
yihua.huang
|
2770811a10
|
update monitor example
|
2014-04-26 11:24:22 +08:00 |
yihua.huang
|
17e95f2a7f
|
comments
|
2014-04-25 18:39:01 +08:00 |
yihua.huang
|
375e64e845
|
more monitor status
|
2014-04-25 18:10:14 +08:00 |
yihua.huang
|
c6661899fd
|
new thread pool #110
|
2014-04-25 17:33:48 +08:00 |
yihua.huang
|
179baa7a22
|
return when page is null
|
2014-04-25 16:07:41 +08:00 |
yihua.huang
|
4738ae2d14
|
change url find to match #94
|
2014-04-25 16:04:41 +08:00 |
yihua.huang
|
f973889cda
|
refactor subpageprossor etc. #94
|
2014-04-25 15:48:05 +08:00 |
yihua.huang
|
acb63d55d7
|
some check and example #98
|
2014-04-25 13:26:08 +08:00 |
yihua.huang
|
11ba5beb42
|
[refactor]move monitor to webmagic-extension #98
|
2014-04-25 13:17:13 +08:00 |
yihua.huang
|
b06aa489fb
|
[BugFix]Only one url from sourceRegion can be extracted #107
|
2014-04-18 17:48:26 +08:00 |
yihua.huang
|
023c2ac84e
|
spider config draft
|
2014-04-17 16:44:32 +08:00 |
yihua.huang
|
a5db6cf292
|
some monitor and JMX support #98
|
2014-04-17 00:35:09 +08:00 |
yihua.huang
|
aae1ab2cd6
|
fix compile error
|
2014-04-16 18:14:13 +08:00 |
yihua.huang
|
1fbfc92de2
|
Inherit support of Field annotation in Model #103
|
2014-04-16 18:13:44 +08:00 |
yihua.huang
|
a03f6a8431
|
eclipse project
|
2014-04-15 07:44:43 +08:00 |
yihua.huang
|
3a79b1b64a
|
[Bugfix]formatter property does not work when field is String#100
|
2014-04-13 23:02:34 +08:00 |
Yihua Huang
|
cc9d319fd9
|
Merge pull request #94 from sebastian1118/master
update:PatternHandler
|
2014-04-13 13:16:20 +08:00 |
yihua.huang
|
03c251237b
|
add Json parse support
|
2014-04-13 10:23:00 +08:00 |
Tian
|
99e12aafaa
|
update:PatternHandler
|
2014-04-13 10:14:39 +08:00 |
yihua.huang
|
c1e7207869
|
add FileCacheQueueScheduler support for cycleRetryTimes
|
2014-04-07 11:00:09 +08:00 |
yihua.huang
|
969ad1766b
|
change logger style to slf4j for cleaner code
|
2014-04-06 21:32:20 +08:00 |
yihua.huang
|
9b2cb43f47
|
ConfigurablePageProcessor #91
|
2014-04-05 23:40:10 +08:00 |
Bo LIANG
|
159eeea2f5
|
Remove unused variable to make the project cleaner.
|
2014-04-05 18:32:12 +08:00 |
yihua.huang
|
c143fc662c
|
add SubPageProcessor #86
|
2014-04-05 18:17:48 +08:00 |
Yihua Huang
|
474f785dab
|
Merge pull request #86 from sebastian1118/master
new feature: PatternProcessor
|
2014-04-04 23:41:27 +08:00 |
Tian
|
38a12f8641
|
new feature: PatternProcessor
|
2014-04-04 22:02:52 +08:00 |
yihua.huang
|
dafd0b5875
|
[BugFix]multi model in one pageprocessor will be skipped #85
|
2014-04-04 20:36:31 +08:00 |
yihua.huang
|
a1c7e826f7
|
fix dep of slf4j-log4j12
|
2014-04-03 23:04:31 +08:00 |
yihua.huang
|
f3c2503a29
|
add warning of slf4j #78
|
2014-04-01 07:42:23 +08:00 |
yihua.huang
|
8958d774f2
|
add default values for @Formatter
|
2014-03-24 13:52:17 +08:00 |
yihua.huang
|
6c11718566
|
Clean project structure #70
|
2014-03-14 23:24:38 +08:00 |
yihua.huang
|
98e2bba099
|
Merge branch 'master' of github.com:code4craft/webmagic
Conflicts:
README.md
pom.xml
webmagic-core/pom.xml
webmagic-extension/pom.xml
webmagic-scripts/pom.xml
|
2014-03-13 08:07:33 +08:00 |
yihua.huang
|
757cc9b942
|
[maven-release-plugin] prepare for next development iteration
|
2014-03-13 07:49:51 +08:00 |
yihua.huang
|
63ffb5c792
|
[maven-release-plugin] prepare release webmaigc-0.4.3
|
2014-03-13 07:49:27 +08:00 |
yihua.huang
|
d5a978e00f
|
update version back to 0.4.3
|
2014-03-13 06:55:05 +08:00 |
yihua.huang
|
0e98183f74
|
Change log4j to slf4j #55
|
2014-02-12 09:35:57 +08:00 |
yihua.huang
|
fa33b15843
|
property loader
|
2014-02-11 23:07:31 +08:00 |
yihua.huang
|
362fdd0662
|
Merge branch 'master' of github.com:code4craft/webmagic
|
2014-02-11 22:23:56 +08:00 |
yihua.huang
|
af809c4d55
|
update version to 0.5.0-snapshot
|
2014-02-11 22:16:01 +08:00 |
jon
|
a722f9bb66
|
修复由于FileCacheQueueScheduler中fileCursor 文件再次打开时没有初始化抛出NullPointerException的错误
|
2014-01-08 21:24:58 +08:00 |
yihua.huang
|
12a6390cbd
|
update spring4 configuration
|
2013-12-18 01:02:59 +08:00 |
yihua.huang
|
fc97cb58c5
|
update lib and version
|
2013-12-04 00:04:29 +08:00 |
yihua.huang
|
d274310cb2
|
[maven-release-plugin] prepare for next development iteration
|
2013-12-03 23:35:06 +08:00 |
yihua.huang
|
e8c32a32dc
|
[maven-release-plugin] prepare release webmagic-0.4.2
|
2013-12-03 23:34:57 +08:00 |
yihua.huang
|
486d9d276f
|
#45 Remove multi in ExtractBy
|
2013-11-28 18:23:51 +08:00 |
yihua.huang
|
e7083dc39d
|
[maven-release-plugin] prepare for next development iteration
|
2013-11-28 13:04:32 +08:00 |
yihua.huang
|
ae623567b3
|
[maven-release-plugin] prepare release webmagic-0.4.1
|
2013-11-28 13:04:22 +08:00 |
yihua.huang
|
18a3af4a0a
|
add more sample for jsonpath #42
|
2013-11-28 09:58:22 +08:00 |
yihua.huang
|
59ad4cad27
|
#42 Add jsonpath in annotation mode for json result
|
2013-11-28 08:25:16 +08:00 |
yihua.huang
|
cf62d707e0
|
#36 Spider does not exit when success
|
2013-11-27 23:33:18 +08:00 |
yihua.huang
|
a01312930a
|
#39 Parsing html after page.getHtml()
|
2013-11-27 22:01:34 +08:00 |
yihua.huang
|
f9daae39cf
|
[maven-release-plugin] prepare for next development iteration
|
2013-11-11 14:33:11 +08:00 |
yihua.huang
|
fdb9441519
|
[maven-release-plugin] prepare release webmagic-0.4.0
|
2013-11-11 14:33:01 +08:00 |
yihua.huang
|
1d75ae7f5b
|
rollback version to 0.4.0 because not deploy success
|
2013-11-11 11:52:56 +08:00 |
yihua.huang
|
b838c4e433
|
#34 Close reader in FileCacheQueueScheduler
|
2013-11-08 14:59:09 +08:00 |
yihua.huang
|
775eb9732f
|
[maven-release-plugin] prepare for next development iteration
|
2013-11-06 22:17:58 +08:00 |
yihua.huang
|
0b4fadc24d
|
[maven-release-plugin] prepare release webmagic-0.4.0
|
2013-11-06 22:17:47 +08:00 |
yihua.huang
|
fd6d2fd6f8
|
try to keepalive TCP connection
|
2013-11-06 21:19:14 +08:00 |
yihua.huang
|
425df08523
|
update version to 0.4.0
|
2013-11-06 12:50:45 +08:00 |
yihua.huang
|
e046bb0723
|
remove useless code
|
2013-11-06 12:48:14 +08:00 |
yihua.huang
|
6e32a19f80
|
update api for direct download
|
2013-11-06 12:46:50 +08:00 |
yihua.huang
|
807aefe9df
|
change EntityUtil to IOUtil because some encoding error
|
2013-11-06 07:37:34 +08:00 |
yihua.huang
|
8f774afc84
|
add direct download
|
2013-11-06 06:41:04 +08:00 |
yihua.huang
|
2e496402dc
|
add more warning for 0.3.3
|
2013-10-24 13:16:48 +08:00 |
yihua.huang
|
1a2c84ea78
|
#27 add timeout config to site
|
2013-10-11 07:36:16 +08:00 |
yihua.huang
|
3b00190f99
|
api without implementation for #28: add specific url crawl
|
2013-10-10 00:40:44 +08:00 |
yihua.huang
|
4acbc19cee
|
[maven-release-plugin] prepare for next development iteration
|
2013-09-23 13:12:32 +08:00 |
yihua.huang
|
cc3b787991
|
[maven-release-plugin] prepare release webmagic-0.3.2
|
2013-09-23 13:12:19 +08:00 |
yihua.huang
|
6f18eec77e
|
fix a test error
|
2013-09-23 13:07:33 +08:00 |
yihua.huang
|
b131878123
|
add example
|
2013-09-23 13:01:28 +08:00 |
yihua.huang
|
95ab4edec3
|
some bugfix
|
2013-09-23 08:38:54 +08:00 |
yihua.huang
|
250cc5e662
|
change formatter to class
|
2013-09-23 08:17:21 +08:00 |
yihua.huang
|
b18216245b
|
add type convert
|
2013-09-23 07:53:33 +08:00 |
yihua.huang
|
fb693a4ac4
|
[maven-release-plugin] prepare for next development iteration
|
2013-09-08 22:25:07 +08:00 |
yihua.huang
|
bfaaa042b9
|
[maven-release-plugin] prepare release webmagic-parent-0.3.1
|
2013-09-08 22:24:48 +08:00 |
yihua.huang
|
d7c7a78177
|
complete test cases
|
2013-09-08 22:19:02 +08:00 |
yihua.huang
|
c17a31a21d
|
fix null pointe exception #26
|
2013-09-08 21:09:49 +08:00 |
yihua.huang
|
e7bf425df4
|
[maven-release-plugin] prepare for next development iteration
|
2013-09-04 10:51:01 +08:00 |
yihua.huang
|
77ff252316
|
[maven-release-plugin] prepare release webmagic-0.3.0
|
2013-09-04 10:50:50 +08:00 |
yihua.huang
|
d141541ef3
|
add retry
|
2013-09-04 09:57:19 +08:00 |
yihua.huang
|
aefd0569a5
|
update version
|
2013-09-04 09:36:56 +08:00 |
yihua.huang
|
194518fd82
|
add switch
|
2013-09-04 08:21:34 +08:00 |
yihua.huang
|
326b97c65a
|
update
|
2013-09-04 00:15:54 +08:00 |
yihua.huang
|
d7cd9e5747
|
update pom
|
2013-09-02 11:56:01 +08:00 |
yihua.huang
|
478ace7e97
|
add FilePageModelPipeline
|
2013-08-22 07:29:18 +08:00 |
yihua.huang
|
ad66d33f38
|
[maven-release-plugin] prepare for next development iteration
|
2013-08-20 23:39:59 +08:00 |
yihua.huang
|
9dc6b11954
|
[maven-release-plugin] prepare release webmagic-parent-0.2.1
|
2013-08-20 23:37:55 +08:00 |
yihua.huang
|
4f62dfc8a4
|
release
|
2013-08-20 23:37:20 +08:00 |
yihua.huang
|
74c940c758
|
[maven-release-plugin] prepare for next development iteration
|
2013-08-20 23:19:58 +08:00 |
yihua.huang
|
a4bb4e3429
|
[maven-release-plugin] prepare release webmagic-parent-0.2.1
|
2013-08-20 23:19:27 +08:00 |
yihua.huang
|
194f16aa75
|
update
|
2013-08-20 23:16:43 +08:00 |
yihua.huang
|
09ffd468c0
|
fix comments
|
2013-08-20 22:53:16 +08:00 |
yihua.huang
|
c70ed57025
|
remove PriorityScheduler to core
|
2013-08-20 21:55:58 +08:00 |
yihua.huang
|
7003426898
|
update pom
|
2013-08-20 21:52:39 +08:00 |
yihua.huang
|
606417fdc7
|
update pom
|
2013-08-19 09:55:49 +08:00 |
yihua.huang
|
d460e136ef
|
update version
|
2013-08-19 09:52:15 +08:00 |
yihua.huang
|
c79d6ecf09
|
complete all comments
|
2013-08-17 23:30:49 +08:00 |
yihua.huang
|
5073258237
|
closable
|
2013-08-17 21:19:24 +08:00 |
yihua.huang
|
5f1f4cbc46
|
update comments
|
2013-08-17 20:41:29 +08:00 |
yihua.huang
|
6cc1d62a08
|
bugfix: rawhtml do not work
|
2013-08-17 19:42:51 +08:00 |
yihua.huang
|
a994b1c9fd
|
complete extension comments in en
|
2013-08-17 19:35:45 +08:00 |
yihua.huang
|
c59c1fe80d
|
update comments
|
2013-08-17 19:19:27 +08:00 |
yihua.huang
|
59aad6a7f4
|
comments in english
|
2013-08-17 18:33:05 +08:00 |
yihua.huang
|
e566a53936
|
update ignore test
|
2013-08-17 18:13:13 +08:00 |
yihua.huang
|
1148450ff9
|
update filecache to more useful
|
2013-08-17 18:12:47 +08:00 |
yihua.huang
|
3ba7a76f44
|
add combo extract to replace Extract2 Extract3...
|
2013-08-17 17:23:11 +08:00 |
yihua.huang
|
5cb45af3a4
|
+doc
|
2013-08-17 12:10:34 +08:00 |
yihua.huang
|
a339e4ab5c
|
add jsonpathselector
|
2013-08-12 13:36:44 +08:00 |
yihua.huang
|
9e82256ce3
|
update docs
|
2013-08-12 10:08:20 +08:00 |
yihua.huang
|
f21097421b
|
add new constructor to redisscheduler
|
2013-08-11 18:53:13 +08:00 |
yihua.huang
|
0f2c5b5723
|
update redisscheduler
|
2013-08-11 18:28:12 +08:00 |
yihua.huang
|
19229dd855
|
add JsonFilePageModelPipeline
|
2013-08-10 08:27:14 +08:00 |
yihua.huang
|
21eca688e9
|
complete docs
|
2013-08-09 20:56:33 +08:00 |
yihua.huang
|
17d2d98cec
|
remove invalid @date
|
2013-08-09 20:43:06 +08:00 |
yihua.huang
|
fcfa2c30c7
|
complete docs
|
2013-08-09 20:36:27 +08:00 |
yihua.huang
|
c78de7bcbb
|
update notnull default to false
|
2013-08-08 13:10:05 +08:00 |
yihua.huang
|
521fbad987
|
move xpath2.0 support to seperate package
|
2013-08-07 23:21:28 +08:00 |
yihua.huang
|
268bd8d0c4
|
remove saxon to extension
|
2013-08-07 23:04:10 +08:00 |