yihua.huang
|
7ffc6998ef
|
add isExtractLinks to OOSpider #575
|
2017-05-27 16:20:06 +08:00 |
soul
|
bc828e1384
|
修复formatter初始化未传参bug
|
2017-05-25 12:17:10 +08:00 |
yihua.huang
|
d38d51dfcb
|
fix javadoc
|
2017-04-15 12:24:50 +08:00 |
yihua.huang
|
1b04a7f2b3
|
#527 move logic check from downloaderto spider
|
2017-04-09 09:23:10 +08:00 |
yihua.huang
|
c13110c4cb
|
fix samples
|
2017-03-21 07:53:43 +08:00 |
yihua.huang
|
d87c73b472
|
change check-and-set to atomic sadd for redis DuplicateRemover #368
|
2017-03-01 22:24:34 +08:00 |
yihua.huang
|
3e633c6871
|
version
|
2017-01-21 11:51:14 +08:00 |
yihua.huang
|
f45e2f118b
|
for release
|
2017-01-21 11:38:36 +08:00 |
Jsbd
|
6d78d51fc0
|
Merge branch 'master' into master
|
2016-12-27 14:15:40 +08:00 |
yihua.huang
|
00dfebbceb
|
#424 remove guava dep and add fix docs
|
2016-12-18 10:45:50 +08:00 |
yihua.huang
|
a960a39c44
|
fix compile error for example change
|
2016-12-18 08:32:14 +08:00 |
yihua.huang
|
a3ee9e3d08
|
fix example
|
2016-12-18 08:18:26 +08:00 |
Jsbd
|
1b886d48a2
|
新增PhantomJSDownloader构造函数,支持crawl.js路径自定义,因为当其他项目依赖此jar包时,runtime.exec()执行phantomjs命令时无使用法jar包中的crawl.js
|
2016-12-08 14:29:42 +08:00 |
Jsbd
|
d1f2e65e5d
|
新增PhantomJSDownloader构造函数,支持crawl.js路径自定义,因为当其他项目依赖此jar包时,runtime.exec()执行phantomjs命令时无使用法jar包中的crawl.js
|
2016-12-08 14:28:48 +08:00 |
Jsbd
|
ebc61363c8
|
为PhantomJSDownloader添加新的构造函数,支持phantomjs自定义命令
为PhantomJSDownloader添加新的构造函数,支持phantomjs自定义命令
example:
* phantomjs.exe 支持windows环境
* phantomjs --ignore-ssl-errors=yes 忽略抓取地址是https时的一些错误
* /usr/local/bin/phantomjs 命令的绝对路径,避免因系统环境变量引起的IOException
|
2016-12-02 10:17:46 +08:00 |
yihua.huang
|
b92e6b04f0
|
#400 修复FileCacheQueueScheduler自己设置DuplicateRemover会导致NPE的问题
|
2016-11-25 08:30:24 +08:00 |
Yihua Huang
|
1491033534
|
Merge pull request #377 from jerry-sc/monitor-bug
fix the monitor bug which the spider will terminate when a seed url with port
|
2016-11-19 13:01:30 +08:00 |
Jerry
|
e56b8c3efc
|
fix the monitor bug which the spider will terminate when a seed url with port
|
2016-09-22 22:36:18 +08:00 |
郭玉昆
|
700898fe8a
|
fixed #301 修复使用注解抽取JSON数据的问题
|
2016-08-29 17:07:37 +08:00 |
Salon.sai
|
f89a6a6826
|
add: redis scheduler with priority
|
2016-07-05 16:29:01 +08:00 |
Yihua Huang
|
37cb43b667
|
Merge pull request #176 from lavenderx/master
add PhantomJSDownloader
|
2016-05-08 20:36:17 +08:00 |
yihua.huang
|
7edfa26f90
|
complete javadoc
|
2016-01-21 18:34:07 +08:00 |
yihua.huang
|
8b90b91e33
|
complete some javadoc
|
2016-01-21 18:14:10 +08:00 |
yihua.huang
|
9c5716a543
|
complete javadoc
|
2016-01-21 18:05:12 +08:00 |
yihua.huang
|
7586e3d75c
|
add some test for github repo downloader
|
2016-01-19 08:05:53 +08:00 |
Yihua Huang
|
cfde3b7657
|
Merge pull request #237 from SpenceZhou/master
Update RedisScheduler.java
|
2015-12-02 22:17:00 +08:00 |
SpenceZhou
|
165e5a72eb
|
Update RedisScheduler.java
修改redisscheduler中获取爬取总数bug
|
2015-12-02 17:10:42 +08:00 |
x1ny
|
90e14b31b0
|
修正FileCacheQueueScheduler导致程序不能正常结束和未关闭流
FileCacheQueueScheduler中开启了一个线程周期运行来保存数据但在爬虫结束后没有关闭导致程序无法结束,以及没有关闭io流。
解决方法:
让FileCacheQueueScheduler实现Closable接口,在close方法中关闭线程以及流。
在Spider的close方法中添加对scheduler的关闭操作。
|
2015-11-12 23:10:20 +08:00 |
edwardsbean
|
74962d69b9
|
fix bug:MultiPagePipeline and DoubleKeyMap concurrent bug
|
2015-02-13 15:03:13 +08:00 |
dolphineor
|
7628dc6b63
|
move PhantomJSDownloader to webmagic-extension
|
2014-11-26 19:29:30 +08:00 |
yihua.huang
|
8551b668a0
|
remove commented code
|
2014-09-29 14:51:36 +08:00 |
zhugw
|
eb3c78b9d8
|
Update FileCacheQueueScheduler.java
这样是不是更严谨? 否则的话,中断后再次启动时, (第一个)入口地址仍会被添加到队列及写入到文件中.
但是现在有另外一个问题存在,如第一遍全部抓取完毕了(通过spider.getStatus==Stopped判断),休眠24小时,再来抓取(通过递归调用抓取方法).
这时不同于中断后再启动,lineReader==cursor, 于是初始化时队列为空,入口地址又在urls集合中了, 故导致抓取线程马上就结束了.这样的话就没有办法去抓取网站上的新增内容了.
解决方案一:
判断抓取完毕后,紧接着覆盖cursor文件,第二次来抓取时,curosr为0, 于是将urls.txt中的所有url均放入队列中了, 可以通过这些url来发现新增url.
方案二:
对方案一进行优化,方案一虽然可以满足业务要求,但会做很多无用功,如仍会对所有旧target url进行下载,抽取,持久化等操作.而新增的内容一般都会在HelpUrl中, 比如某一页多了一个新帖子,或者多了几页内容. 故第二遍及以后来爬取时可以仅将HelpUrl放入队列中.
希望能给予反馈,我上述理解对不对, 有什么没有考虑到的情况或者有更简单的方案?谢谢!
|
2014-09-14 16:20:03 +08:00 |
yihua.huang
|
42a30074c9
|
update urls.contains to DuplicateRemover in FileCacheQueueScheduler #157
|
2014-09-12 07:52:38 +08:00 |
zhugw
|
1db940a088
|
Update FileCacheQueueScheduler.java
在使用过程中发现urls.txt文件存在重复URL的情况,经跟踪源代码,发现初始化加载文件后,读取所有的url放入一集合中,但是之后添加待抓取URL时并未判断是否已存在该集合中(即文件中)了,故导致文件中重复URL的情况.故据此对源码做了修改,还请作者审阅.
|
2014-09-11 15:46:09 +08:00 |
yihua.huang
|
3734865a6a
|
fix package name =.=
|
2014-08-21 14:39:44 +08:00 |
yihua.huang
|
e7668e01b8
|
fix SourceRegion error and add some tests on it #144
|
2014-08-21 14:29:06 +08:00 |
yihua.huang
|
4e5ba02020
|
fix test cont'
|
2014-08-18 11:08:17 +08:00 |
yihua.huang
|
2fd8f05fe2
|
change path seperator for varient OS #139
|
2014-06-25 14:55:23 +08:00 |
yihua.huang
|
928f98dd93
|
auto create folder in JsonFilePipeline #122
|
2014-05-08 15:12:17 +08:00 |
yihua.huang
|
7fbe18b8c0
|
implementation of PageMapper #120
|
2014-05-05 08:01:39 +08:00 |
yihua.huang
|
5dc9fe95a9
|
interface of PageMapper #120
|
2014-05-05 07:43:32 +08:00 |
yihua.huang
|
186b90512e
|
refactor redisscheduler #118
|
2014-05-02 20:24:15 +08:00 |
yihua.huang
|
d1140b9e29
|
add bloom filter for scheduler #118
|
2014-05-02 20:20:22 +08:00 |
yihua.huang
|
e8d4a9be2b
|
fix remove duplicate error #117
|
2014-04-29 20:32:06 +08:00 |
yihua.huang
|
1104122979
|
more abstraction in scheduler
|
2014-04-27 09:30:01 +08:00 |
yihua.huang
|
b0fb1c3e10
|
remove copy-dependcies plugin for m2e error
|
2014-04-27 08:22:33 +08:00 |
yihua.huang
|
94a67165e1
|
remove jmx server for simplify #98
|
2014-04-26 20:17:52 +08:00 |
yihua.huang
|
86a45a6643
|
change SpiderMonitor to singleton #98
|
2014-04-26 18:14:25 +08:00 |
yihua.huang
|
ab4d36806e
|
clean code
|
2014-04-26 11:45:21 +08:00 |
yihua.huang
|
04fde8203b
|
add control for monitor
|
2014-04-26 11:44:14 +08:00 |
yihua.huang
|
2770811a10
|
update monitor example
|
2014-04-26 11:24:22 +08:00 |
yihua.huang
|
17e95f2a7f
|
comments
|
2014-04-25 18:39:01 +08:00 |
yihua.huang
|
375e64e845
|
more monitor status
|
2014-04-25 18:10:14 +08:00 |
yihua.huang
|
c6661899fd
|
new thread pool #110
|
2014-04-25 17:33:48 +08:00 |
yihua.huang
|
179baa7a22
|
return when page is null
|
2014-04-25 16:07:41 +08:00 |
yihua.huang
|
4738ae2d14
|
change url find to match #94
|
2014-04-25 16:04:41 +08:00 |
yihua.huang
|
f973889cda
|
refactor subpageprossor etc. #94
|
2014-04-25 15:48:05 +08:00 |
yihua.huang
|
acb63d55d7
|
some check and example #98
|
2014-04-25 13:26:08 +08:00 |
yihua.huang
|
11ba5beb42
|
[refactor]move monitor to webmagic-extension #98
|
2014-04-25 13:17:13 +08:00 |
yihua.huang
|
b06aa489fb
|
[BugFix]Only one url from sourceRegion can be extracted #107
|
2014-04-18 17:48:26 +08:00 |
yihua.huang
|
023c2ac84e
|
spider config draft
|
2014-04-17 16:44:32 +08:00 |
yihua.huang
|
a5db6cf292
|
some monitor and JMX support #98
|
2014-04-17 00:35:09 +08:00 |
yihua.huang
|
aae1ab2cd6
|
fix compile error
|
2014-04-16 18:14:13 +08:00 |
yihua.huang
|
1fbfc92de2
|
Inherit support of Field annotation in Model #103
|
2014-04-16 18:13:44 +08:00 |
yihua.huang
|
3a79b1b64a
|
[Bugfix]formatter property does not work when field is String#100
|
2014-04-13 23:02:34 +08:00 |
Yihua Huang
|
cc9d319fd9
|
Merge pull request #94 from sebastian1118/master
update:PatternHandler
|
2014-04-13 13:16:20 +08:00 |
yihua.huang
|
03c251237b
|
add Json parse support
|
2014-04-13 10:23:00 +08:00 |
Tian
|
99e12aafaa
|
update:PatternHandler
|
2014-04-13 10:14:39 +08:00 |
yihua.huang
|
c1e7207869
|
add FileCacheQueueScheduler support for cycleRetryTimes
|
2014-04-07 11:00:09 +08:00 |
yihua.huang
|
969ad1766b
|
change logger style to slf4j for cleaner code
|
2014-04-06 21:32:20 +08:00 |
yihua.huang
|
9b2cb43f47
|
ConfigurablePageProcessor #91
|
2014-04-05 23:40:10 +08:00 |
Bo LIANG
|
159eeea2f5
|
Remove unused variable to make the project cleaner.
|
2014-04-05 18:32:12 +08:00 |
yihua.huang
|
c143fc662c
|
add SubPageProcessor #86
|
2014-04-05 18:17:48 +08:00 |
Yihua Huang
|
474f785dab
|
Merge pull request #86 from sebastian1118/master
new feature: PatternProcessor
|
2014-04-04 23:41:27 +08:00 |
Tian
|
38a12f8641
|
new feature: PatternProcessor
|
2014-04-04 22:02:52 +08:00 |
yihua.huang
|
dafd0b5875
|
[BugFix]multi model in one pageprocessor will be skipped #85
|
2014-04-04 20:36:31 +08:00 |
yihua.huang
|
8958d774f2
|
add default values for @Formatter
|
2014-03-24 13:52:17 +08:00 |
yihua.huang
|
6c11718566
|
Clean project structure #70
|
2014-03-14 23:24:38 +08:00 |
yihua.huang
|
0e98183f74
|
Change log4j to slf4j #55
|
2014-02-12 09:35:57 +08:00 |
yihua.huang
|
fa33b15843
|
property loader
|
2014-02-11 23:07:31 +08:00 |
yihua.huang
|
362fdd0662
|
Merge branch 'master' of github.com:code4craft/webmagic
|
2014-02-11 22:23:56 +08:00 |
yihua.huang
|
af809c4d55
|
update version to 0.5.0-snapshot
|
2014-02-11 22:16:01 +08:00 |
jon
|
a722f9bb66
|
修复由于FileCacheQueueScheduler中fileCursor 文件再次打开时没有初始化抛出NullPointerException的错误
|
2014-01-08 21:24:58 +08:00 |
yihua.huang
|
486d9d276f
|
#45 Remove multi in ExtractBy
|
2013-11-28 18:23:51 +08:00 |
yihua.huang
|
18a3af4a0a
|
add more sample for jsonpath #42
|
2013-11-28 09:58:22 +08:00 |
yihua.huang
|
59ad4cad27
|
#42 Add jsonpath in annotation mode for json result
|
2013-11-28 08:25:16 +08:00 |
yihua.huang
|
cf62d707e0
|
#36 Spider does not exit when success
|
2013-11-27 23:33:18 +08:00 |
yihua.huang
|
a01312930a
|
#39 Parsing html after page.getHtml()
|
2013-11-27 22:01:34 +08:00 |
yihua.huang
|
b838c4e433
|
#34 Close reader in FileCacheQueueScheduler
|
2013-11-08 14:59:09 +08:00 |
yihua.huang
|
fd6d2fd6f8
|
try to keepalive TCP connection
|
2013-11-06 21:19:14 +08:00 |
yihua.huang
|
e046bb0723
|
remove useless code
|
2013-11-06 12:48:14 +08:00 |
yihua.huang
|
6e32a19f80
|
update api for direct download
|
2013-11-06 12:46:50 +08:00 |
yihua.huang
|
807aefe9df
|
change EntityUtil to IOUtil because some encoding error
|
2013-11-06 07:37:34 +08:00 |
yihua.huang
|
8f774afc84
|
add direct download
|
2013-11-06 06:41:04 +08:00 |
yihua.huang
|
2e496402dc
|
add more warning for 0.3.3
|
2013-10-24 13:16:48 +08:00 |
yihua.huang
|
1a2c84ea78
|
#27 add timeout config to site
|
2013-10-11 07:36:16 +08:00 |
yihua.huang
|
3b00190f99
|
api without implementation for #28: add specific url crawl
|
2013-10-10 00:40:44 +08:00 |
yihua.huang
|
6f18eec77e
|
fix a test error
|
2013-09-23 13:07:33 +08:00 |
yihua.huang
|
b131878123
|
add example
|
2013-09-23 13:01:28 +08:00 |
yihua.huang
|
95ab4edec3
|
some bugfix
|
2013-09-23 08:38:54 +08:00 |