C#Xpath解析HtmlDocument的使用方法与递归取得页面所有标签xpath值（附源码）

站长苏飞 · 发表于 2013-3-11 11:07:24

在学习HTML Xpath之前呢我们先来下载一下Dll文件
下载地址：http://htmlagilitypack.codeplex.com/
大家下载单击如下图片下载就行了

接下来就是在程序中引用一下，

然后就可以直接调用了，大家看看
代码吧

[C#] 纯文本查看 复制代码

  //htmlDcoument对象用来访问Html文档s
            HtmlAgilityPack.HtmlDocument hd = new HtmlAgilityPack.HtmlDocument();
            //加载Html文档
            hd.LoadHtml(strhtml);
  string str = hd.DocumentNode.SelectSingleNode("//*[@id='e_font']").OuterHtml;

这样就可以得到一个标签的HTml代码了
OuterHtml是取包含本身的Html如果是InnerHtml就是取的包含在这个标签之内的所有Html代码了
这点大家要注意了
如果大家想获取Html代码的Xpath路径就是这部分

//*[@id='e_font']

复制代码

这个其实很简单只在大家安装一个Firbug就行了，
看下图片

大家只要进入选择模式，然后选择你要的内容，然后右键复制一下就行了。
然后放在SelectSingleNode（）方法里就OK了
下面我说说几个方法和属性的意思吧、
方法

SelectNodes 获取的是一个集合
SelectSingleNode 获取一个标签
SetAttributeValue 设置标签的属性值例如：SetAttributeValue("name","xpath-89");这说明把name属性的值修改为xpath-89
属性

OuterHtml 是取包含本身的Html
InnerHtml 取的包含在这个标签之内的所有Html代码了
XPath 获取相对应的Xpath值
Attributes 获取一个属性的值例如：Attributes("name")
也可以进行添加属性例如：

[C#] 纯文本查看 复制代码

hd.DocumentNode.SelectSingleNode(item.Key).Attributes.Add("xpathid", "xpath_1" );

下面我写了一个递归获取Html页面所有Xpath值的方法大家看一下吧

[C#] 纯文本查看 复制代码

//key（Xpath）,value（整个节点）
        public List<ObjXpath> XpathList = new List<ObjXpath>();
        public string strhtml = "";//这里就是你的Html代码具体怎么获取请参考我的HttpHelper类吧
          private int Index = 0;
//开始处理Node
        private void SartNode()
        {
            //htmlDcoument对象用来访问Html文档s
            HtmlAgilityPack.HtmlDocument hd = new HtmlAgilityPack.HtmlDocument();
            //加载Html文档
            hd.LoadHtml(strhtml);
            HtmlNodeCollection htmllist = hd.DocumentNode.ChildNodes;
            Index = 0;
            XpathList.Clear();
            foreach (HtmlNode em in htmllist)
            {
                Setxpath(em);
            }
        }
        /// <summary>
        /// 递归获取Html Dom
        /// </summary>
        /// <param name="node">要处理的节点</param>
        private void Setxpath(HtmlNode node)
        {
            foreach (HtmlNode item in node.ChildNodes)
            {
                if (item.XPath.Contains("#"))
                {
                    continue;
                }
                if (item.ChildNodes.Count > 0)
                {
                    XpathList.Add(new ObjXpath() { id = Index.ToString(), Key = item.XPath, Value = "" });
                    Index++;
                    Setxpath(item);
                }
                else
                {
                    XpathList.Add(new ObjXpath() { id = Index.ToString(), Key = item.XPath, Value = "" });
                    Index++;
                }
            }
        }
  public class ObjXpath
    {
        public string id { get; set; }
        public string Key { get; set; }
        public string Value { get; set; }
    }

XpathList 就是获取的所有Xpath值了，大家有兴趣的话可以试试
我们先来看看效果吧

好了下面放出所有代码给大家

[C#] 纯文本查看 复制代码

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.Text.RegularExpressions;
using System.Threading;
using HtmlAgilityPack;
using System.IO;
using System.Runtime.Serialization.Json;

namespace AutoXpathTools
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        #region 私有变量和方法

        //委托传入一个字符串
        private delegate void SetListBox(string str);

        //key（Xpath）,value（整个节点）
        List<ObjXpath> XpathList = new List<ObjXpath>();
        private int Index = 0;
        //htmlDcoument对象用来访问Html文档
        HtmlAgilityPack.HtmlDocument hd = new HtmlAgilityPack.HtmlDocument();

        #endregion

        //分析Xpath的所有代码
        private void btnGetXpath_Click(object sender, EventArgs e)
        {
            try
            {
                HttpHelper http = new HttpHelper();
                HttpItem item = new HttpItem() { URL = textBox1.Text.Trim(), IsToLower = false, Encoding = "gbk" };
                txtXml.Text = http.GetHtml(item);
                if (!string.IsNullOrWhiteSpace(txtXml.Text) && txtXml.Text.Trim().ToLower() != "error")
                {
                    //加载Html文档
                    hd.LoadHtml(txtXml.Text);
                  

                    Thread pingTask = new Thread(new ThreadStart(delegate
                    {
                        //代码,线程要执行的代码
                        SartNode(txtXml.Text);
                    }));
                    pingTask.Start();
                   
                }
                else
                {
                    txtXml.Text = "根据您的的ULR：" + textBox1.Text.Trim() + "无法得到任何内容";
                }
            }
            catch (Exception ex)
            {
                txtXml.Text = ex.Message.Trim();
            }
        }
       

        //开始处理Node
        private void SartNode(string strhtml)
        {
            //htmlDcoument对象用来访问Html文档s
            HtmlAgilityPack.HtmlDocument hd = new HtmlAgilityPack.HtmlDocument();
            //加载Html文档
            hd.LoadHtml(strhtml);
            HtmlNodeCollection htmllist = hd.DocumentNode.ChildNodes;
            Index = 0;
            XpathList.Clear();
            foreach (HtmlNode em in htmllist)
            {
                Setxpath(em);
            }
        }
        /// <summary>
        /// 递归获取Html Dom
        /// </summary>
        /// <param name="node">要处理的节点</param>
        private void Setxpath(HtmlNode node)
        {
            foreach (HtmlNode item in node.ChildNodes)
            {
                if (item.XPath.Contains("#"))
                {
                    continue;
                }
                if (item.ChildNodes.Count > 0)
                {
                    XpathList.Add(new ObjXpath() { id = Index.ToString(), Key = item.XPath, Value = "" });
                    UIContorol(item.XPath);
                    Index++;
                    Setxpath(item);
                }
                else
                {
                    XpathList.Add(new ObjXpath() { id = Index.ToString(), Key = item.XPath, Value = "" });
                    UIContorol(item.XPath);
                    Index++;
                }
            }
        }
      
        //使用委托给控件赋值
        private void UIContorol(string str)
        {
            listBox1.Items.Add(str);
            toolStripStatusLabel1.Text = str;
        }

        private void listBox1_SelectedValueChanged(object sender, EventArgs e)
        {
            if (listBox1.SelectedItem != null)
            {
                txtPath.Text = listBox1.SelectedItem.ToString().Trim();
            }
        }

        private void button3_Click(object sender, EventArgs e)
        {
            txtContents.Text = hd.DocumentNode.SelectSingleNode(txtPath.Text.Trim()).OuterHtml;
        }
      
        private void Form1_Load(object sender, EventArgs e)
        {
            //HttpItem item = new HttpItem()
            //{
            //    URL = "http://www.diandian.com/login",
            //    Method = "post",
            //    Cookie = "dtid=ZfXUVo1IsplHR4mHW1HYmgKbY4GJa003; kvf=1358855337188; alf=1; dru=1356356040; _l5=y",
            //    ContentType = "application/x-www-form-urlencoded",
            //    Postdata = "account=xinsuilie1998@163.com&password=wjlove520&nextUrl=&lcallback=&persistent=1",
            //    Referer = "http://www.diandian.com/logout?formKey=e4714d863c862a84fafd83d98e5ecb22"
            //};
            //HttpHelper http = new HttpHelper();
            //string html = http.GetHtml(item);
            //string cookie = item.Cookie;
            //item = new HttpItem() { URL = "http://www.diandian.com/home", Cookie = cookie };
            //html = http.GetHtml(item);
        }
    }
    public class ObjXpath
    {
        public string id { get; set; }
        public string Key { get; set; }
        public string Value { get; set; }
    }
}

就到这里吧，大家可以下载我的源代码试试手
打包下载：

AutoXpathTools.zip (76.32 KB, 下载次数: 1510)
如果你感觉可以话就给我推荐一下吧。感谢大家

站长苏飞 · 发表于 2013-9-5 17:52:56

天山明月发表于 2013-9-5 17:02
如果是ajax动态生成的数据，是否可以获取

这个本文没有关系吧，这是HttpHelper方向的，是获取方向，而不是分析

torank · 发表于 2023-6-24 21:15:15

正好在研究网页下载

912288184 · 发表于 2017-11-8 14:24:36

有个基于浏览器选择的工具就好了。~~ 还是不错的。

李兔子pxn · 发表于 2017-10-30 14:03:30

我反复看了多遍，好帖，得支持

刘春：加强对自媒体平台监管力度
昨天热成猪，今天冷成狗，难道真的要冻死在“夏至未至”吗？
泉州市委组织部关于王育英等同志任前公示的公告
武汉高德红外股份有限公司关于公司购买银行理财产品的进展公告
郝茂荣到库伦旗调研
健康｜敞开心扉远离抑郁
如何识别和预防猫蛔虫
全国农技中心柑橘苗木联合检疫执法检查暨柑橘木虱春梢防控督查组到云南
痔疮疼怎么办？七大注意事项要牢记
情商王者章士钊，民国大咖都爱他｜画事
诺奖得主当导师成都将建菁蓉国际天然药物研究中心
工行信用卡的申请条件
女人吃什么补气血美食补气血效果快
海淀北部文化馆唱响劳动赞歌
一季度基金盈利能力排名出炉富国基金为投资者大赚37亿元排名前十
球球大作战6.4新版本全面解读 2017全新大赛季来袭
苹果与高通撕破脸，下一代iPhone高通基带订单将大减
科普贴：绿茶可以分为几类？
世界哮喘日 10省（市）携手呼吁共抗哮喘
五一劳动节聚焦各国底层劳工生存现状

精彩 · 发表于 2014-12-15 06:56:51

真好，这两天正在研究这个呢

sandy1231 · 发表于 2014-12-11 20:32:11

站长苏飞发表于 2014-12-10 08:24
什么事情都不能一概而论吧。不使用递归你想个更好的方法，可以在1秒内完成的。

在内存中。递归1W 和Fo ...

也是，不过这个执行的时间和http访问一次差不多了，除了递归获取所有节点好像没其他方法了吧？

站长苏飞 · 发表于 2014-12-10 08:24:18

sandy1231 发表于 2014-12-9 21:57
递归效率很低，占用CPU很大，多线程下特明显

什么事情都不能一概而论吧。不使用递归你想个更好的方法，可以在1秒内完成的。

在内存中。递归1W 和For或者是Do1W你感觉能差多少。有时候点Cpu效率慢要看情况，不一定就是递归引起的。

sandy1231 · 发表于 2014-12-9 21:57:46

递归效率很低，占用CPU很大，多线程下特明显

ed2000de · 发表于 2014-5-31 11:04:10

受教了，学习中……

南方 · 发表于 2014-1-24 19:48:07

受教了，学习中……

天山明月 · 发表于 2013-9-6 09:13:54

站长苏飞发表于 2013-9-5 18:45
直接请求Json网址，而不是所以网页地址，

就行了，HttpHelper也可以直接获取Json数据的，具体分析具体 ...

就是这个地方比较困惑，json数据也获取到了，但是和展示的情况不一致。
仔细分析了一下，网站又使用了dwr推送

		自动登录	找回密码
密码			马上注册

[C#语言基础] C#Xpath解析HtmlDocument的使用方法与递归取得页面所有标签xpath值（附源码）

本帖被以下淘专辑推荐:

相关帖子