爬虫作为时下最热门的一项话题。在爬虫技术上,python占据了大半壁江山。那.Net家族是否也能实现爬虫呢?答案是肯定的。c# 可能还算比较热门,但vb.net在国内的饭碗全被c# 抢走了。但是就7月的编程排行榜来看,世界排名还不差,仅比c# 低了0.3%,遥遥领先JS和PHP。接下来,我们重拾.Net家族的两位元老:c#与vb.net,手把手教你写第一个出色的爬虫!这篇文章适合有一定基础掌,握基本语法的朋友参考。我在此抛砖引玉。网页爬虫何为爬虫?简单地说就是一种抓取某个网页HTML代码的一项技术。通常用于数据接口交互、采集信息等。获取网页源码只是第一步,还要从获取的数据中提取到有用的数据,比如网页标题、网页内容、图片等。总结一下,其实爬虫部分就只有两个步骤:获取网页源代码>分析并取得有用数据下面介绍三种不同的爬取方式,分别适用于不同场景。正常爬取网页源代码大部分浏览器在打开某个网页后,右键都有“查看源代码”这一项。在这一大串的HTML代码里面,可以看到网页上显示的网页标题、内容数据、图片内容等等。当我们需要批量采集某个网站多个网址的内容时,一个一个页面右键查看源代码,手动的复制需要的内容保存,这未免太强人所难。这时候我们需要爬虫批量为我们爬取数据。那获取网址的源代码就是第一步。在Net下,有多种方式来获取网页源码。都是通过模拟发送http协议来实现的。一般情况下,可以使用NET及系统自带的HttpXML对象、WebClient对象和HttpWebRequest对象来实现。当然,有能力的可以采用纯Socket来实现。这里推荐采用HttpWebRequest方式。以下是c#代码HttpWebRequest方式获取源代码的函数:public string GetHtmlStr(string url)
{
try
{
Uri uri = new Uri(url);
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
request.UserAgent = "User-Agent:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705";
request.Accept = "*/*";
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream s = response.GetResponseStream();
StreamReader sr = new StreamReader(s, System.Text.Encoding.GetEncoding("utf-8"));
string html = sr.ReadToEnd();
s.Close();
response.Close();
return html;
}
catch (Exception ex)
{
return "/error/";
}
}以及咱说好的vb.net代码Public Function GetHtmlStr(ByVal url As String) As String
Try
Dim uri As Uri = New Uri(url)
Dim request As HttpWebRequest = CType(WebRequest.Create(uri), HttpWebRequest)
request.UserAgent = "User-Agent:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705"
request.Accept = "*/*"
Dim response As HttpWebResponse = CType(request.GetResponse(), HttpWebResponse)
Dim s As Stream = response.GetResponseStream()
Dim sr As StreamReader = New StreamReader(s, System.Text.Encoding.GetEncoding("utf-8"))
Dim html As String = sr.ReadToEnd()
s.Close()
response.Close()
Return html
Catch ex As Exception
Return "/error/"
End Try
End Function
这个获取网页源码的函数以及封装好,调用非常简单:GetHtmlStr("网址")如果发生错误,返回字符串/error/。这里,特别注意代码中的utf-8编码。编码错误会造成获取到乱码,还有命名空间的引用。不少同学可能会发现,以上的代码并不支持https网址,不过不用担心,稍作改动即可支持。Imports System.Net.Security
Imports System.Security.Authentication
Imports System.Security.Cryptography.X509Certificates
Public Function CheckValidationResult(ByVal sender As Object, ByVal certificate As X509Certificate, ByVal chain As X509Chain, ByVal errors As SslPolicyErrors) As Boolean
Return True
End Function并在以上的GetHTMLStr函数中HttpWebRequest = xxx 添加以下代码即可ServicePointManager.ServerCertificateValidationCallback = New System.Net.Security.RemoteCertificateValidationCallback(AddressOf CheckValidationResult)改造完成后,所有https的网址均能正常获取,哪怕https的ssl证书出现问题(过期、不信任)都可以正常获取,不受影响。简单总结一下,以上介绍的爬取方式适合普通直接打开即可访问的http或https网址。附带Cookie方式GET和POST爬取以上介绍的是获取网页的源码,功能比较简单。下面介绍附带Cookie的GET方式和POST方式获得源码。附带Cookie实现POST和GET,可以爬取更深一层的信息。一些需要账号登录后才能显示的网址均可以正常爬取。vb.net代码:Public Function GetHtmlStr(ByVal url As String, cookies As String) As String
Try
Dim ck As New CookieContainer
ck.SetCookies(New Uri(url), cookies)
Dim uri As Uri = New Uri(url)
Dim request As HttpWebRequest = CType(WebRequest.Create(uri), HttpWebRequest)
request.UserAgent = "User-Agent:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705"
request.Accept = "*/*"
request.CookieContainer = ck
Dim response As HttpWebResponse = CType(request.GetResponse(), HttpWebResponse)
Dim s As Stream = response.GetResponseStream()
Dim sr As StreamReader = New StreamReader(s, System.Text.Encoding.GetEncoding("utf-8"))
Dim html As String = sr.ReadToEnd()
s.Close()
response.Close()
Return html
Catch ex As Exception
Return "/error/"
End Try
End Functionc#代码:public string GetHtmlStr(string url, string cookies)
{
try
{
CookieContainer ck = new CookieContainer();
ck.SetCookies(new Uri(url), cookies);
Uri uri = new Uri(url);
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
request.UserAgent = "User-Agent:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705";
request.Accept = "*/*";
request.CookieContainer = ck;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream s = response.GetResponseStream();
StreamReader sr = new StreamReader(s, System.Text.Encoding.GetEncoding("utf-8"));
string html = sr.ReadToEnd();
s.Close();
response.Close();
return html;
}
catch (Exception ex)
{
return "/error/";
}
}以上代码不难看出,其实就是在普通爬取版本上增加了cookie对象的引入。GetHtmlStr函数的Cookie参数格式为:cookie名=值,多个用逗号隔开,如:name=123,pass=123有了它,就可以实现爬取某些需要登录才能访问的网址。对了,还有post方法的vb.net版本:Private Function HttpPost(Url As String, postDataStr As String) As String
Dim request As HttpWebRequest = DirectCast(WebRequest.Create(Url), HttpWebRequest)
request.Method = "POST"
request.ContentType = "application/x-www-form-urlencoded"
request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"
request.ContentLength = Encoding.UTF8.GetByteCount(postDataStr)
Dim myRequestStream As Stream = request.GetRequestStream()
Dim myStreamWriter As New StreamWriter(myRequestStream, Encoding.GetEncoding("gb2312"))
myStreamWriter.Write(postDataStr)
myStreamWriter.Close()
Dim response As HttpWebResponse = DirectCast(request.GetResponse(), HttpWebResponse)
Dim myResponseStream As Stream = response.GetResponseStream()
Dim myStreamReader As New StreamReader(myResponseStream, Encoding.GetEncoding("gb2312"))
Dim retString As String = myStreamReader.ReadToEnd()
myStreamReader.Close()
myResponseStream.Close()
Return retString
End Functionc#版本:private string HttpPost(string Url, string postDataStr) {
HttpWebRequest request;
WebRequest.Create(Url);
HttpWebRequest;
request.Method = "POST";
request.ContentType = "application/x-www-form-urlencoded";
request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safa" +
"ri/537.36";
request.ContentLength = Encoding.UTF8.GetByteCount(postDataStr);
Stream myRequestStream = request.GetRequestStream();
StreamWriter myStreamWriter = new StreamWriter(myRequestStream, Encoding.GetEncoding("gb2312"));
myStreamWriter.Write(postDataStr);
myStreamWriter.Close();
HttpWebResponse response;
request.GetResponse();
HttpWebResponse;
Stream myResponseStream = response.GetResponseStream();
StreamReader myStreamReader = new StreamReader(myResponseStream, Encoding.GetEncoding("gb2312"));
string retString = myStreamReader.ReadToEnd();
myStreamReader.Close();
myResponseStream.Close();
return retString;
}留个小作业,这里的HttpPost函数cookie参数部分,可以参考上面的带cookie的Get方案自行添加。浏览器爬取有一些网页,前端显示的内容是由后端JS动态生成,以上两种获取网页的方式都无法正常获取。那难道真的就没有任何办法了吗?不见得。我们还可以利用Net自带的IE控件webBrowser解释网页JS后再获取!添加一个WebBrowser浏览器控件,下面代码可以操作一个WebBrowser控件打开一个网址,并等待其加载完成(等待JS将内容解释完毕)。Public Sub WebBrowserOpenURL(W As WebBrowser, S As String)
Try
W.Navigate(S)
While (W.ReadyState <> WebBrowserReadyState.Complete)
Application.DoEvents()
End While
Catch ex As Exception
End Try
End Subpublic void WebBrowserOpenURL(WebBrowser W, string S)
{
try
{
W.Navigate(S);
while ((W.ReadyState != WebBrowserReadyState.Complete))
Application.DoEvents();
}
catch (Exception ex)
{
}
}
等到此自定义过程结束时,再通过webBrowser1.DocumentText方法获取控件中显示的所有内容。当然,如果条件允许,可以使用chromium来获得更好的兼容性和性能。其他以上三种不同的源码爬取方式适合不同的场景。但是要分别注意以下三点:1)浏览器UA,即浏览器标识。一些网页需要移动端才能正常显示。2)网页编码。编码设置错误会造成返回乱码。3)请遵守网站使用条款和相关法规,勿胡乱爬取。以上代码支持Winform以及web后端编程。本次限于篇幅,就简单介绍到这。下一期将介绍如何从爬取的HTML代码中分析并获取想要的数据。
本文出自快速备案,转载时请注明出处及相应链接。
本文永久链接: https://www.175ku.com/33666.html