yihua.huang
507556d0aa
fix test: ProxyTest.testProxy() do not load exist proxy config
2016-11-19 12:54:39 +08:00
Jerry
e56b8c3efc
fix the monitor bug which the spider will terminate when a seed url with port
2016-09-22 22:36:18 +08:00
yihua.huang
448e528140
update StringUtils to apache lang3 #314
2016-05-24 13:33:17 +08:00
yihua.huang
3e33959b7a
#319 fix javadoc
2016-05-24 13:17:35 +08:00
yihua.huang
8730e3e97a
Merge branch 'fix' of git://github.com/kapsterio/webmagic into kapsterio-fix
2016-05-08 20:46:22 +08:00
yihua.huang
2400ff7e1a
resovle conflict
2016-05-08 20:31:43 +08:00
yihua.huang
b7f3c4bba0
Merge branch 'master' of git://github.com/hepan/webmagic into hepan-master
2016-05-08 20:27:47 +08:00
yihua.huang
d8f978fd20
fix test in JsonPathSelectorTest #289
2016-05-08 19:32:03 +08:00
yihua.huang
61c28a0130
refactor on proxypool
2016-05-08 17:53:15 +08:00
yihua.huang
b871b210c5
Merge branch 'proxy-strategy' of github.com:EdwardsBean/webmagic into EdwardsBean-proxy-strategy
2016-05-08 17:53:02 +08:00
yihua.huang
b5413368de
update ut
2016-05-08 16:23:41 +08:00
Jon
83c27ebbc4
增加IP代理认证功能
2016-05-08 16:17:58 +08:00
yihua.huang
ca072c5575
fix URL regex in GithubRepoPageProcessor #305
2016-05-08 12:09:45 +08:00
hepan
89c6e52863
代理增加用户名密码认证
2016-04-13 15:16:57 +08:00
Linker Lin
047cb8ff8f
updated versions to 0.5.4-SNAPSHOT
2016-04-01 14:51:59 +08:00
zhangheng09
6b179c3d55
这个改动的原因基于两点:1)代理归还给代理池的时机应该是执行完http请求后就要尽早归还 2)http代理应该是HttpClientDownloader该考虑的事,不应该有Spider来处理,Spider并不知道它的downloader是个HttpClientDownloader
2016-03-12 20:09:41 +08:00
zhangheng09
5f106c9c69
当page为null时,意味着非正常的响应状态,应该抛出异常,否则SpiderListener的onSuccess方法和onError方法都会执行
2016-03-12 20:03:27 +08:00
yihua.huang
c0b8e8f8ae
remove .classpath .project
2016-01-22 14:58:22 +08:00
yihua.huang
a8e6de4b90
Merge branch 'master' of git.oschina.net:flashsword20/webmagic
2016-01-22 10:16:58 +08:00
yihua.huang
0fd4623f0a
Merge branch 'osc'
2016-01-21 19:33:30 +08:00
yihua.huang
ce5495ecd5
remove useless files
2016-01-21 19:31:50 +08:00
yihua.huang
8265c7dade
remove submodules for relase
2016-01-21 19:25:13 +08:00
yihua.huang
7edfa26f90
complete javadoc
2016-01-21 18:34:07 +08:00
yihua.huang
8b90b91e33
complete some javadoc
2016-01-21 18:14:10 +08:00
yihua.huang
2b556cf053
update verison to 0.5.3-SNAPSHOT
2016-01-21 18:05:56 +08:00
yihua.huang
9c5716a543
complete javadoc
2016-01-21 18:05:12 +08:00
yihua.huang
db3cbf6ca5
update version to 0.5.3-SNAPSHOT
2016-01-21 17:58:36 +08:00
yihua.huang
81ce1ffc5f
fix ignore
2016-01-21 12:36:49 +08:00
yihua.huang
93764fa2c9
ignore some test
2016-01-21 12:28:32 +08:00
yihua.huang
5706bb90af
update xsoup to 0.3.1
2016-01-20 12:59:11 +08:00
yihua.huang
7586e3d75c
add some test for github repo downloader
2016-01-19 08:05:53 +08:00
x1ny
90e14b31b0
修正FileCacheQueueScheduler导致程序不能正常结束和未关闭流
...
FileCacheQueueScheduler中开启了一个线程周期运行来保存数据但在爬虫结束后没有关闭导致程序无法结束,以及没有关闭io流。
解决方法:
让FileCacheQueueScheduler实现Closable接口,在close方法中关闭线程以及流。
在Spider的close方法中添加对scheduler的关闭操作。
2015-11-12 23:10:20 +08:00
yihua.huang
56e0cd513a
compile error fix
2015-04-15 23:21:06 +08:00
yihua.huang
c5740b1840
change assert #200
2015-04-15 08:32:08 +08:00
yihua.huang
67eb632f4d
test for issue #200
2015-04-15 08:31:45 +08:00
高军
590561a6e4
修正site.setHttpProxy()不起作用的bug
2015-03-09 15:54:15 +08:00
edwardsbean
19474e4716
add SimpleProxyPool and IProxyPool
2015-02-28 17:50:10 +08:00
edwardsbean
4978665633
add retry sleep time
2015-01-21 13:30:02 +08:00
yihua.huang
8ffc1a7093
add NPE check for POST method
2015-01-13 14:10:00 +08:00
zhugw
bc666e927d
Update Site.java
...
setCycleRetryTimes的javadoc是这么说的:Set cycleRetryTimes times when download fail, 0 by default. Only work in RedisScheduler.
而通过查看源码发现似乎并没有做限制,即只能用于RedisScheduler. 故想问一下该javadoc是否过时了?
2014-09-12 12:42:57 +08:00
yihua.huang
147401ce5e
remove duplicate setPath in ProxyPool
2014-09-09 22:58:44 +08:00
yihua.huang
e7668e01b8
fix SourceRegion error and add some tests on it #144
2014-08-21 14:29:06 +08:00
yihua.huang
4446669c24
fix test
2014-08-18 10:54:24 +08:00
yihua.huang
9866297ec4
Disable jsoup entity escape by Default. Set Html.DISABLE_HTML_ENTITY_ESCAPE to false to enable it. #149
2014-08-14 08:04:56 +08:00
yihua.huang
4e6e946dd7
more friendly exception message in PlainText #144
2014-08-13 10:02:16 +08:00
yihua.huang
af9939622b
move thread package out of selector (because it is add by mistake at the beginning)
2014-06-25 18:19:50 +08:00
yihua.huang
eae37c868b
new sample
2014-06-10 17:38:54 +08:00
yihua.huang
b3a282e58d
some fix for tests #130
2014-06-10 00:05:30 +08:00
yihua.huang
074d767f45
Merge branch 'proxy' of github.com:yxssfxwzy/webmagic into yxssfxwzy-proxy
2014-06-09 23:51:36 +08:00
zwf
2f89cfc31a
add test and fix bug of proxy module
2014-06-09 13:32:02 +08:00
yihua.huang
eb89d66566
fix test
2014-06-04 22:28:27 +08:00
yihua.huang
5e8ca02ec6
contributor
2014-06-04 22:26:56 +08:00
yihua.huang
8c33be48a6
Merge branch 'stable' of github.com:code4craft/webmagic
2014-06-04 17:37:45 +08:00
yihua.huang
5f8c3fd5c5
update version
2014-06-04 17:33:30 +08:00
yihua.huang
7a64847a3c
Bugfix: selector does not works well in element #113
2014-06-03 20:03:33 +08:00
yihua.huang
8d67fd0357
change back return proxy from spider to httpclientdownloader #128
2014-05-28 08:08:51 +08:00
yihua.huang
40bf8ca58f
change return proxy from spider to httpclientdownloader #128
2014-05-28 07:57:42 +08:00
yihua.huang
1f21d9cc14
spell mistake fix #128
2014-05-28 07:29:19 +08:00
Yihua Huang
e310139d00
Merge pull request #128 from yxssfxwzy/proxy
...
多个代理的管理
2014-05-28 07:22:08 +08:00
yihua.huang
b165090434
Bugfix:Type convert error in JsonPathSelector #129
2014-05-27 21:19:22 +08:00
yihua.huang
a5d1b56e44
fix ut #113
2014-05-27 18:07:53 +08:00
yihua.huang
3939074a23
Bugfix: nodes() only return the first element #113
2014-05-27 17:53:06 +08:00
yihua.huang
41c2ea9498
refactor of selectable cont' #113
...
1. remove lazy init of Html
2. rename strings to sourceTexts for better meaning
3. make getSourceTexts abstract and DO NOT always store strings
4. instead store parsed elements of document in HtmlNode
2014-05-27 17:34:19 +08:00
yihua.huang
f9825c214a
refactor selectable for html fragment #113
2014-05-27 16:00:51 +08:00
yihua.huang
03d26c169b
Enhance auto charset detect #126
...
1. Only read from content once to fix stream closed exception
2. invite moco as server test
2014-05-26 17:45:30 +08:00
zwf
c146e2c7b4
add proxy pool
2014-05-19 15:59:31 +08:00
yihua.huang
21982d3460
remove cpdetector temporary #126
2014-05-14 23:52:27 +08:00
fengwuze
fcbfb75608
修改自动从网页中获取字符的代码块,抽取出来成为单独的方法。
2014-05-14 19:14:42 +08:00
fengwuze
95494d3c4d
增加处理meta的逻辑。
...
遗留:
3、网页没有指定编码的情况下,需要采用cpdetector,但目前cpdetector这个在Maven的中央库里面没有,不清楚如何解决。
2014-05-14 14:53:54 +08:00
yihua.huang
dde2d89bbe
Ignore content in json when bracket when remove padding #124
2014-05-08 23:37:18 +08:00
ywooer
259f0a16c5
Update FilePipeline.java
2014-05-06 18:33:00 +08:00
ywooer
26d38851b5
add charset to Writer
2014-05-06 18:28:50 +08:00
yihua.huang
7668731f08
update version to snapshot
2014-05-05 07:03:55 +08:00
yihua.huang
182dd51689
Merge branch 'stable' of github.com:code4craft/webmagic
2014-05-03 06:19:11 +08:00
yihua.huang
81e6e772ac
versions back to 0.5.1
2014-05-03 06:18:57 +08:00
yihua.huang
feb604da87
Merge branch 'stable' of github.com:code4craft/webmagic
2014-05-03 06:14:54 +08:00
yihua.huang
358e906379
[maven-release-plugin] prepare for next development iteration
2014-05-03 00:00:13 +08:00
yihua.huang
470750fc0d
[maven-release-plugin] prepare release WebMagic-0.5.1
2014-05-02 23:59:55 +08:00
yihua.huang
01aec7e1ab
extension point of geturl #118
2014-05-02 23:23:23 +08:00
yihua.huang
ec1c2e8cbc
test and so on
2014-05-02 23:19:11 +08:00
yihua.huang
4f22f1210e
some bug fix #118
2014-05-02 20:38:49 +08:00
yihua.huang
56f033ce8d
set setDuplicateRemover for chain api #118
2014-05-02 20:21:23 +08:00
yihua.huang
d1140b9e29
add bloom filter for scheduler #118
2014-05-02 20:20:22 +08:00
yihua.huang
8e4814bdc5
fix path seperator
2014-05-02 17:06:34 +08:00
yihua.huang
e8d4a9be2b
fix remove duplicate error #117
2014-04-29 20:32:06 +08:00
yihua.huang
04ade75606
Merge branch 'stable' of github.com:code4craft/webmagic
...
Conflicts:
README.md
pom.xml
webmagic-avalon/pom.xml
webmagic-core/pom.xml
webmagic-extension/pom.xml
webmagic-lucene/pom.xml
webmagic-samples/pom.xml
webmagic-saxon/pom.xml
webmagic-scripts/pom.xml
webmagic-selenium/pom.xml
2014-04-27 15:03:15 +08:00
yihua.huang
a08d8cb167
update verion
2014-04-27 14:59:48 +08:00
yihua.huang
42a2676e8c
update version
2014-04-27 14:56:21 +08:00
yihua.huang
c25b32f1ca
[maven-release-plugin] prepare for next development iteration
2014-04-27 12:52:27 +08:00
yihua.huang
7ff83bb11a
[maven-release-plugin] prepare release WebMagic-0.5.0
2014-04-27 12:52:12 +08:00
yihua.huang
1104122979
more abstraction in scheduler
2014-04-27 09:30:01 +08:00
yihua.huang
2770811a10
update monitor example
2014-04-26 11:24:22 +08:00
yihua.huang
5ecd909ef2
add timeout for wait/notify #111
2014-04-25 19:41:36 +08:00
yihua.huang
c7afdb516e
remove thread utils #110
2014-04-25 18:44:45 +08:00
yihua.huang
17e95f2a7f
comments
2014-04-25 18:39:01 +08:00
yihua.huang
05eb7831b6
refactor and comments #110
2014-04-25 18:27:40 +08:00
yihua.huang
375e64e845
more monitor status
2014-04-25 18:10:14 +08:00
yihua.huang
018061d2cd
fix error in thread pool
2014-04-25 18:01:02 +08:00
yihua.huang
cdc423f2bf
log
2014-04-25 17:41:41 +08:00
yihua.huang
c6661899fd
new thread pool #110
2014-04-25 17:33:48 +08:00