yihua.huang
42a30074c9
update urls.contains to DuplicateRemover in FileCacheQueueScheduler #157
2014-09-12 07:52:38 +08:00
Yihua Huang
689e89a9b2
Merge pull request #157 from zhugw/patch-1
...
Update FileCacheQueueScheduler.java
2014-09-12 07:37:56 +08:00
zhugw
1db940a088
Update FileCacheQueueScheduler.java
...
在使用过程中发现urls.txt文件存在重复URL的情况,经跟踪源代码,发现初始化加载文件后,读取所有的url放入一集合中,但是之后添加待抓取URL时并未判断是否已存在该集合中(即文件中)了,故导致文件中重复URL的情况.故据此对源码做了修改,还请作者审阅.
2014-09-11 15:46:09 +08:00
yihua.huang
147401ce5e
remove duplicate setPath in ProxyPool
2014-09-09 22:58:44 +08:00
yihua.huang
3734865a6a
fix package name =.=
2014-08-21 14:39:44 +08:00
yihua.huang
e7668e01b8
fix SourceRegion error and add some tests on it #144
2014-08-21 14:29:06 +08:00
yihua.huang
4e5ba02020
fix test cont'
2014-08-18 11:08:17 +08:00
yihua.huang
4446669c24
fix test
2014-08-18 10:54:24 +08:00
yihua.huang
9866297ec4
Disable jsoup entity escape by Default. Set Html.DISABLE_HTML_ENTITY_ESCAPE to false to enable it. #149
2014-08-14 08:04:56 +08:00
yihua.huang
4e6e946dd7
more friendly exception message in PlainText #144
2014-08-13 10:02:16 +08:00
yihua.huang
ebb931e0bf
update assertj to test scope
2014-06-25 19:01:27 +08:00
yihua.huang
af9939622b
move thread package out of selector (because it is add by mistake at the beginning)
2014-06-25 18:19:50 +08:00
yihua.huang
2fd8f05fe2
change path seperator for varient OS #139
2014-06-25 14:55:23 +08:00
yihua.huang
eae37c868b
new sample
2014-06-10 17:38:54 +08:00
yihua.huang
b3a282e58d
some fix for tests #130
2014-06-10 00:05:30 +08:00
yihua.huang
b75e64a61b
t push origin masterMerge branch 'yxssfxwzy-proxy'
2014-06-09 23:51:47 +08:00
yihua.huang
074d767f45
Merge branch 'proxy' of github.com:yxssfxwzy/webmagic into yxssfxwzy-proxy
2014-06-09 23:51:36 +08:00
zwf
2f89cfc31a
add test and fix bug of proxy module
2014-06-09 13:32:02 +08:00
yihua.huang
eb89d66566
fix test
2014-06-04 22:28:27 +08:00
yihua.huang
2a15bc0289
contributor
2014-06-04 22:27:16 +08:00
yihua.huang
5e8ca02ec6
contributor
2014-06-04 22:26:56 +08:00
yihua.huang
db0195babb
update version in docs
2014-06-04 17:35:31 +08:00
yihua.huang
5f8c3fd5c5
update version
2014-06-04 17:33:30 +08:00
yihua.huang
0e9042eefa
update pom
2014-06-04 17:17:48 +08:00
yihua.huang
03170178c4
update pom
2014-06-04 17:01:37 +08:00
yihua.huang
c83b74f0f4
update pom for deploy
2014-06-04 16:55:34 +08:00
yihua.huang
7a64847a3c
Bugfix: selector does not works well in element #113
2014-06-03 20:03:33 +08:00
yihua.huang
8d67fd0357
change back return proxy from spider to httpclientdownloader #128
2014-05-28 08:08:51 +08:00
yihua.huang
40bf8ca58f
change return proxy from spider to httpclientdownloader #128
2014-05-28 07:57:42 +08:00
yihua.huang
1f21d9cc14
spell mistake fix #128
2014-05-28 07:29:19 +08:00
Yihua Huang
e310139d00
Merge pull request #128 from yxssfxwzy/proxy
...
多个代理的管理
2014-05-28 07:22:08 +08:00
yihua.huang
b165090434
Bugfix:Type convert error in JsonPathSelector #129
2014-05-27 21:19:22 +08:00
yihua.huang
95bdb30296
update xsoup version to release #113
2014-05-27 20:46:48 +08:00
yihua.huang
a5d1b56e44
fix ut #113
2014-05-27 18:07:53 +08:00
yihua.huang
3939074a23
Bugfix: nodes() only return the first element #113
2014-05-27 17:53:06 +08:00
yihua.huang
41c2ea9498
refactor of selectable cont' #113
...
1. remove lazy init of Html
2. rename strings to sourceTexts for better meaning
3. make getSourceTexts abstract and DO NOT always store strings
4. instead store parsed elements of document in HtmlNode
2014-05-27 17:34:19 +08:00
yihua.huang
f9825c214a
refactor selectable for html fragment #113
2014-05-27 16:00:51 +08:00
yihua.huang
03d26c169b
Enhance auto charset detect #126
...
1. Only read from content once to fix stream closed exception
2. invite moco as server test
2014-05-26 17:45:30 +08:00
zwf
c146e2c7b4
add proxy pool
2014-05-19 15:59:31 +08:00
zwf
07ea04223f
change_gitignore
2014-05-19 15:56:22 +08:00
yihua.huang
21982d3460
remove cpdetector temporary #126
2014-05-14 23:52:27 +08:00
fengwuze
fcbfb75608
修改自动从网页中获取字符的代码块,抽取出来成为单独的方法。
2014-05-14 19:14:42 +08:00
fengwuze
95494d3c4d
增加处理meta的逻辑。
...
遗留:
3、网页没有指定编码的情况下,需要采用cpdetector,但目前cpdetector这个在Maven的中央库里面没有,不清楚如何解决。
2014-05-14 14:53:54 +08:00
yihua.huang
dde2d89bbe
Ignore content in json when bracket when remove padding #124
2014-05-08 23:37:18 +08:00
Yihua Huang
2913da4763
Merge pull request #123 from gsh199449/master
...
Update JsonFilePipeline.java #122
2014-05-08 15:20:02 +08:00
yihua.huang
928f98dd93
auto create folder in JsonFilePipeline #122
2014-05-08 15:12:17 +08:00
GaoShen
5883ed93d7
Update JsonFilePipeline.java
...
JsonFilePipeline可以自动新建尚不存在的文件夹
2014-05-08 15:08:55 +08:00
Yihua Huang
4e65dac249
Merge pull request #121 from ywooer/master
...
创建指定编码的Writer
2014-05-06 20:14:35 +08:00
ywooer
259f0a16c5
Update FilePipeline.java
2014-05-06 18:33:00 +08:00
ywooer
26d38851b5
add charset to Writer
2014-05-06 18:28:50 +08:00