Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get specific content from website via C#

Tags:

html

c#

For a non-commercial private school project I'm creating a piece of software that will search for lyrics based on what song currently is playing on Spotify. I have to do this in C# (requirement), but I can use other languages if I so desire.

I've found a few sites that I can use to fetch the lyrics from. I have already succeeded in fetching the entire html code, but after that I'm not sure what to do. I've asked my teacher, she told me to use XML (which I also found complicated :p), so I've read quite a bit about it and searched for examples, but haven't found anything that seems applicable to my case.

Time for some code.

Let's say I wanted to fetch the lyrics from musixmatch.com:

(Human-readable altered) HTML:

<span data-reactid="199">
    <p class="mxm-lyrics__content" data-reactid="200">First line of the lyrics!
        These words will never be ignored
        I don't want a battle
    </p>
    <!-- react-empty: 201 -->
    <div data-reactid="202">
        <div class="inline_video_ad_container_container" data-reactid="203">
            <div id="inline_video_ad_container" data-reactid="204">
                <div class="" style="line-height:0;" data-reactid="205">
                    <div id="div_gpt_ad_outofpage_musixmatch_desktop_lyrics" data-reactid="206">
                        <script type="text/javascript">
                            //Really nice google ad JS which I have removed;
                        </script>
                    </div>
                </div>
            </div>
        </div>
        <p class="mxm-lyrics__content" data-reactid="207">But I got a war
            More fancy lyrics
            And lines
            That I want to fetch
            And display
            Tralala
            lala
            Trouble!
        </p>
    </div>
</span>

Note the first three lines of the lyrics are located at the top, with the rest in the bottom <p>. Also note that the two <p> tags have the same class. Full html source can be found here: view-source:https://www.musixmatch.com/lyrics/Bullet-for-My-Valentine/You-Want-a-Battle-Here%E2%80%99s-a-War At around line 97 the snippet starts.

So in this specific example there are the lyrics, and there is quite a bit of code that I don't need. So far I've tried fetching the html code with the following C#:

string source = "https://www.musixmatch.com/lyrics/Bullet-for-My-Valentine/You-Want-a-Battle-Here’s-a-War";

    // The HtmlWeb class is a utility class to get the HTML over HTTP
    HtmlWeb htmlWeb = new HtmlWeb();

    // Creates an HtmlDocument object from an URL
    HtmlAgilityPack.HtmlDocument document = htmlWeb.Load(source);

    // Targets a specific node
    HtmlNode someNode = document.GetElementbyId("mxm - lyrics__content");

    if (someNode != null)
    {
        Console.WriteLine(someNode);
    } else
    {
        Console.WriteLine("Nope");
    }

    foreach (var node in document.DocumentNode.SelectNodes("//span/div[@id='site']/p[@class='mxm-lyrics__content']"))
    {
        // here is your text: node.InnerText    "//div[@class='sideInfoPlayer']/span[@class='wrap']"
        Console.WriteLine(node.InnerText);
    }

    Console.ReadKey();

The fetching of the entire html works, but the extracting doesn't. I'm stuck at extracting the lyrics from the html. Since for this page the lyrics aren't in an ID tag, I can't just use the GetElementbyId. Can somebody point me in the right direction? I want to support multiple sites, so I have to do this a few times for different sites.

like image 210
MagicLegend Avatar asked Oct 29 '22 16:10

MagicLegend


1 Answers

One of the solutions

var htmlWeb = new HtmlWeb();
var documentNode = htmlWeb.Load(source).DocumentNode;

var findclasses = documentNode.Descendants("p")
    .Where(d => d.Attributes["class"]?.Value.Contains("mxm-lyrics__content") == true);
//or
var findclasses = documentNode.SelectNodes("//p[contains(@class,'mxm-lyrics__content')]")
var text = string.Join(Environment.NewLine, findclasses.Select(x => x.InnerText));
like image 200
Artiom Avatar answered Nov 15 '22 06:11

Artiom