我用HttpHelper做了一个简单的爬虫去抓取https://www.bitstamp.net/api/v2/order_book/btcusd/。但返回“"The server committed a protocol violation. Section=ResponseHeader Detail=CR must be followed by LF"。发现此网站使用第三方([size=15.3333px]Incapsula)的cookie验证机制,[size=15.3333px]Incapsula的技术支持这样解释:”Section=ResponseHeader Detail=CR must be followed by LF error occurs as a response to our cookie classification method.
Basically, CR tells the cursor to move to the first position on the same line, while LF tells the cursor to move to the next line. Combining them together (<CR><LF>) makes the same effect as “Enter” does. The request/status line and other header fields must each end with <CR><LF>
The cookies that Incapsula sends are "broken" on purpose, and they include content whose purpose is to test how the client responds to an irregular cookie - as part of our classification process. While browsers are capable of handling such cookies, most bots aren't, and this is what serves as a first line of defense against them.“
大意是他们故意使用<CR><LF>这两个控制符来区分浏览器和爬虫,因为浏览器能够处理而爬虫一般没有预料到会出现这些控制符。网上有人建议在在app.config里加上
[C#] 纯文本查看 复制代码 <system.net>
<settings>
<httpWebRequest useUnsafeHeaderParsing="true"/>
</settings>
</system.net>
我试了好像不管用。我在HttpItem里hardcoded了我抓取的cookie,能用一阵子,最近又不能用了。 我想请教苏飞版主或其他高手该如何设置我的request参数或用其他办法绕过cookie验证。
|