Yihua Huang
3a9c1d3002
Merge pull request #159 from zhugw/patch-3
...
Update Site.java
2014-09-12 13:09:50 +08:00
zhugw
bc666e927d
Update Site.java
...
setCycleRetryTimes的javadoc是这么说的:Set cycleRetryTimes times when download fail, 0 by default. Only work in RedisScheduler.
而通过查看源码发现似乎并没有做限制,即只能用于RedisScheduler. 故想问一下该javadoc是否过时了?
2014-09-12 12:42:57 +08:00
yihua.huang
42a30074c9
update urls.contains to DuplicateRemover in FileCacheQueueScheduler #157
2014-09-12 07:52:38 +08:00
Yihua Huang
689e89a9b2
Merge pull request #157 from zhugw/patch-1
...
Update FileCacheQueueScheduler.java
2014-09-12 07:37:56 +08:00
zhugw
1db940a088
Update FileCacheQueueScheduler.java
...
在使用过程中发现urls.txt文件存在重复URL的情况,经跟踪源代码,发现初始化加载文件后,读取所有的url放入一集合中,但是之后添加待抓取URL时并未判断是否已存在该集合中(即文件中)了,故导致文件中重复URL的情况.故据此对源码做了修改,还请作者审阅.
2014-09-11 15:46:09 +08:00
yihua.huang
147401ce5e
remove duplicate setPath in ProxyPool
2014-09-09 22:58:44 +08:00
yihua.huang
3734865a6a
fix package name =.=
2014-08-21 14:39:44 +08:00
yihua.huang
e7668e01b8
fix SourceRegion error and add some tests on it #144
2014-08-21 14:29:06 +08:00
yihua.huang
4e5ba02020
fix test cont'
2014-08-18 11:08:17 +08:00
yihua.huang
4446669c24
fix test
2014-08-18 10:54:24 +08:00
yihua.huang
9866297ec4
Disable jsoup entity escape by Default. Set Html.DISABLE_HTML_ENTITY_ESCAPE to false to enable it. #149
2014-08-14 08:04:56 +08:00
yihua.huang
4e6e946dd7
more friendly exception message in PlainText #144
2014-08-13 10:02:16 +08:00
yihua.huang
ebb931e0bf
update assertj to test scope
2014-06-25 19:01:27 +08:00
yihua.huang
af9939622b
move thread package out of selector (because it is add by mistake at the beginning)
2014-06-25 18:19:50 +08:00
yihua.huang
2fd8f05fe2
change path seperator for varient OS #139
2014-06-25 14:55:23 +08:00
yihua.huang
eae37c868b
new sample
2014-06-10 17:38:54 +08:00
yihua.huang
b3a282e58d
some fix for tests #130
2014-06-10 00:05:30 +08:00
yihua.huang
b75e64a61b
t push origin masterMerge branch 'yxssfxwzy-proxy'
2014-06-09 23:51:47 +08:00
yihua.huang
074d767f45
Merge branch 'proxy' of github.com:yxssfxwzy/webmagic into yxssfxwzy-proxy
2014-06-09 23:51:36 +08:00
zwf
2f89cfc31a
add test and fix bug of proxy module
2014-06-09 13:32:02 +08:00
yihua.huang
4efd471840
remove duplicate jar
2014-06-04 22:46:03 +08:00
yihua.huang
435922f00d
Merge branch 'stable' of github.com:code4craft/webmagic
2014-06-04 22:33:58 +08:00
yihua.huang
eb89d66566
fix test
2014-06-04 22:28:27 +08:00
yihua.huang
2a15bc0289
contributor
2014-06-04 22:27:16 +08:00
yihua.huang
5e8ca02ec6
contributor
2014-06-04 22:26:56 +08:00
yihua.huang
baeb919cbe
update bin
2014-06-04 17:38:49 +08:00
yihua.huang
8c33be48a6
Merge branch 'stable' of github.com:code4craft/webmagic
2014-06-04 17:37:45 +08:00
yihua.huang
db0195babb
update version in docs
2014-06-04 17:35:31 +08:00
yihua.huang
5f8c3fd5c5
update version
2014-06-04 17:33:30 +08:00
yihua.huang
0e9042eefa
update pom
2014-06-04 17:17:48 +08:00
yihua.huang
03170178c4
update pom
2014-06-04 17:01:37 +08:00
yihua.huang
c83b74f0f4
update pom for deploy
2014-06-04 16:55:34 +08:00
yihua.huang
7a64847a3c
Bugfix: selector does not works well in element #113
2014-06-03 20:03:33 +08:00
yihua.huang
8d67fd0357
change back return proxy from spider to httpclientdownloader #128
2014-05-28 08:08:51 +08:00
yihua.huang
40bf8ca58f
change return proxy from spider to httpclientdownloader #128
2014-05-28 07:57:42 +08:00
yihua.huang
1f21d9cc14
spell mistake fix #128
2014-05-28 07:29:19 +08:00
Yihua Huang
e310139d00
Merge pull request #128 from yxssfxwzy/proxy
...
多个代理的管理
2014-05-28 07:22:08 +08:00
yihua.huang
b165090434
Bugfix:Type convert error in JsonPathSelector #129
2014-05-27 21:19:22 +08:00
yihua.huang
95bdb30296
update xsoup version to release #113
2014-05-27 20:46:48 +08:00
yihua.huang
a5d1b56e44
fix ut #113
2014-05-27 18:07:53 +08:00
yihua.huang
3939074a23
Bugfix: nodes() only return the first element #113
2014-05-27 17:53:06 +08:00
yihua.huang
41c2ea9498
refactor of selectable cont' #113
...
1. remove lazy init of Html
2. rename strings to sourceTexts for better meaning
3. make getSourceTexts abstract and DO NOT always store strings
4. instead store parsed elements of document in HtmlNode
2014-05-27 17:34:19 +08:00
yihua.huang
f9825c214a
refactor selectable for html fragment #113
2014-05-27 16:00:51 +08:00
yihua.huang
03d26c169b
Enhance auto charset detect #126
...
1. Only read from content once to fix stream closed exception
2. invite moco as server test
2014-05-26 17:45:30 +08:00
zwf
c146e2c7b4
add proxy pool
2014-05-19 15:59:31 +08:00
zwf
07ea04223f
change_gitignore
2014-05-19 15:56:22 +08:00
yihua.huang
21982d3460
remove cpdetector temporary #126
2014-05-14 23:52:27 +08:00
fengwuze
fcbfb75608
修改自动从网页中获取字符的代码块,抽取出来成为单独的方法。
2014-05-14 19:14:42 +08:00
fengwuze
95494d3c4d
增加处理meta的逻辑。
...
遗留:
3、网页没有指定编码的情况下,需要采用cpdetector,但目前cpdetector这个在Maven的中央库里面没有,不清楚如何解决。
2014-05-14 14:53:54 +08:00
yihua.huang
dde2d89bbe
Ignore content in json when bracket when remove padding #124
2014-05-08 23:37:18 +08:00