Barış Kısır

Advanced Web Scraping: Parsing HTML with HtmlAgilityPack

01 May 2017

The Art of Programmatic Web Extraction

In an era where data is the primary currency, the ability to programmatically extract information from web resources is a vital skill. When a target platform lacks a structured API, developers must resort to DOM (Document Object Model) parsing—commonly known as web scraping—to isolate and retrieve specific data points.

Introducing HtmlAgilityPack

HtmlAgilityPack (HAP) is the industry-standard library for .NET developers. Unlike standard XML parsers, HAP is designed to handle “real-world” HTML, which is often malformed or follows non-standard conventions. It provides an intuitive API that supports both XPath and LINQ selectors.

Installation via NuGet

Install-Package HtmlAgilityPack

Implementation Strategy: Scraping a Live Broadcast Schedule

Let’s demonstrate how to scrape a television broadcast schedule (e.g., NTV) to extract program titles and their respective airtimes.

Step 1: Establishing the Object Model

We start by defining a hierarchical POCO structure to represent the channel and its discrete television programs.

public class TVChannel
{
    public string Name { get; set; }
    public List<BroadcastProgram> Schedule { get; set; } = new List<BroadcastProgram>();
}

public class BroadcastProgram
{
    public string Title { get; set; }
    public string AirTime { get; set; }
}

Step 2: Orchestrating the Extraction Logic

The following implementation utilizes HtmlDocument to load raw HTML and applies LINQ-based filtering to navigate the DOM tree efficiently.

public TVChannel ExtractBroadcastSchedule(string endpointUrl)
{
    using (var webClient = new WebClient { Encoding = Encoding.UTF8 })
    {
        string rawHtml = webClient.DownloadString(endpointUrl);
        var document = new HtmlDocument();
        document.LoadHtml(rawHtml);

        // Target the specific unordered list (ul) containing the schedule
        var programNodes = document.DocumentNode.Descendants("ul")
            .FirstOrDefault(n => n.HasAttributes && n.Attributes["class"]?.Value == "programmes")
            ?.SelectNodes("li");

        var channel = new TVChannel { Name = "NTV" };

        if (programNodes != null)
        {
            foreach (var node in programNodes)
            {
                // Utilize XPath via SelectSingleNode for precise element targeting
                var anchor = node.SelectSingleNode("a");
                if (anchor == null) continue;

                channel.Schedule.Add(new BroadcastProgram
                {
                    AirTime = anchor.Descendants("span").FirstOrDefault(s => s.Attributes["class"]?.Value == "tv-hour")?.InnerText,
                    Title = anchor.Descendants("span").FirstOrDefault(s => s.Attributes["class"]?.Value == "programmeTitle")?.InnerText.Trim()
                });
            }
        }

        return channel;
    }
}

Strategic Best Practices

  1. Resilience: Web structures are volatile. Always implement defensive coding (null checks) and error handling to manage unexpected DOM changes.
  2. Performance: For large-scale scraping tasks, consider using HttpClient for asynchronous loading and HtmlWeb for a more integrated HAP experience.
  3. Ethics and compliance: Always respect a website’s robots.txt and terms of service before initiating automated scraping activities.