yihua.huang
969ad1766b
change logger style to slf4j for cleaner code
2014-04-06 21:32:20 +08:00
yihua.huang
9b2cb43f47
ConfigurablePageProcessor #91
2014-04-05 23:40:10 +08:00
Bo LIANG
b043ac76d6
change the formatter of log.
...
To use slf4j, we should insert {} into the formatter string.
2014-04-05 11:31:56 +08:00
yihua.huang
7aaf837e15
change logger to slf4j style for performance #84
2014-04-04 20:10:00 +08:00
yihua.huang
f9b157951d
Merge branch 'master' of github.com:code4craft/webmagic
2014-04-04 20:01:14 +08:00
yihua.huang
22c394e629
[doc]
2014-04-04 20:00:58 +08:00
Bo LIANG
762a3973fd
Modify the log levels of LocalDuplicatedRemovedScheduler.java
...
The old version will print a debug level log each time the push method is
called. So sometimes, when a html page have multiple links for the same
page, the debug log will appears more than once. Also, when we meet a
duplicate URL, it will also print a log, which will be confusing.
I change the level of it to trace. And each time a URL is really push into
queue, print a debug level log.
2014-04-04 15:53:46 +08:00
yihua.huang
a1c7e826f7
fix dep of slf4j-log4j12
2014-04-03 23:04:31 +08:00
yihua.huang
01848301d4
encode illegal charactors in url #80
2014-04-01 22:14:30 +08:00
yihua.huang
2780423e60
enable blank space in quotes in UrlUtils.fixAllRelativeHrefs #80
2014-04-01 20:35:11 +08:00
yihua.huang
97b6f46280
Bugfix: break loop in addTargetRequests #81
2014-04-01 20:12:25 +08:00
yihua.huang
8d8194bee4
Change HashMap to LinkedHashMap in ResultItems for same order of input and output #76
2014-03-25 08:23:20 +08:00
yihua.huang
8b35d79569
Do not cache document in Selectable for selected Html element #73
2014-03-19 22:19:06 +08:00
yihua.huang
6201fd6966
add worker as container
2014-03-17 23:01:58 +08:00
yihua.huang
6c11718566
Clean project structure #70
2014-03-14 23:24:38 +08:00
yihua.huang
9606a173cd
fix ZipCodePageProcessor
2014-03-13 22:55:50 +08:00
yihua.huang
af07280176
remove defend code for httpclient 4.3.1 because it is fixed in 4.3.3 #59
2014-03-13 07:11:56 +08:00
yihua.huang
55368919df
add attribute 'text' support for CssSelector #66
2014-03-11 13:18:34 +08:00
yihua.huang
88b50d4182
bigfix: cycleTry will not work when spawnUrl is set to false #62
2014-03-04 07:33:07 +08:00
yihua.huang
2768a1cae4
add test for cycleTriedTimes and fix cycleTriedTimes inc error #60
2014-03-01 15:10:38 +08:00
yihua.huang
bbd0d7e600
update httpclient version to 4.3.3 #59
2014-02-28 21:17:02 +08:00
yihua.huang
571061454a
#58 add CYCLE_TRIED_TIMES support to QueueScheduler and PriorityScheduler
2014-02-27 23:54:30 +08:00
yihua.huang
0e98183f74
Change log4j to slf4j #55
2014-02-12 09:35:57 +08:00
yihua.huang
fa33b15843
property loader
2014-02-11 23:07:31 +08:00
yihua.huang
af809c4d55
update version to 0.5.0-snapshot
2014-02-11 22:16:01 +08:00
Almark Ming
2b46b11e55
Update RegexSelector.java
...
Optimize regex format check
Conflicts:
webmagic-core/src/main/java/us/codecraft/webmagic/selector/RegexSelector.java
2013-12-21 08:38:17 +08:00
yihua.huang
b51fb2696b
update ut for cookie
2013-12-06 00:30:01 +08:00
yihua.huang
ff2f588c41
#48 nullpointer exception
2013-12-04 22:11:20 +08:00
yihua.huang
6a828e923c
#46 Downloader thread hang up when timeout
2013-12-03 09:59:54 +08:00
shijinping
9a524aa364
double-check 中再取次httpClient的内容
2013-11-28 14:38:30 +08:00
yihua.huang
59ad4cad27
#42 Add jsonpath in annotation mode for json result
2013-11-28 08:25:16 +08:00
yihua.huang
c2d6d495b3
#41 add getThreadAlive(),getStatus,getPageCount() to spider
2013-11-28 07:59:24 +08:00
yihua.huang
cf62d707e0
#36 Spider does not exit when success
2013-11-27 23:33:18 +08:00
yihua.huang
a01312930a
#39 Parsing html after page.getHtml()
2013-11-27 22:01:34 +08:00
yihua.huang
f63d33b457
update some comments
2013-11-27 21:06:53 +08:00
yihua.huang
04fcf3193f
#38 Change algorithm of SmartContentSelector
2013-11-23 13:56:55 +08:00
yihua.huang
296a68920e
fix javadoc and add setPipelines() for spider
2013-11-14 13:23:29 +08:00
yihua.huang
47a0360783
#35 add status code to page
2013-11-12 11:51:34 +08:00
yihua.huang
bc5c30de17
update scripts
2013-11-12 08:20:59 +08:00
yihua.huang
df8ca8ad09
add scripts
2013-11-10 22:30:48 +08:00
yihua.huang
fe6d9bb2e2
get keep-alive rework
2013-11-06 21:53:39 +08:00
yihua.huang
fd6d2fd6f8
try to keepalive TCP connection
2013-11-06 21:19:14 +08:00
yihua.huang
e046bb0723
remove useless code
2013-11-06 12:48:14 +08:00
yihua.huang
6e32a19f80
update api for direct download
2013-11-06 12:46:50 +08:00
yihua.huang
807aefe9df
change EntityUtil to IOUtil because some encoding error
2013-11-06 07:37:34 +08:00
yihua.huang
00b0a751b4
#33 ignore 'content-encoding' when redirect
2013-11-06 06:57:58 +08:00
yihua.huang
8f774afc84
add direct download
2013-11-06 06:41:04 +08:00
yihua.huang
c18b603399
optimize long compare
2013-11-04 07:09:44 +08:00
yihua.huang
ed3f3583cc
downloader refactor
2013-11-04 01:03:23 +08:00
yihua.huang
a37f40e6e6
add cookie supoort
2013-11-04 00:59:48 +08:00