Menu
Log in


R code : downloading website images with rvest

  • 15 Oct 2020 6:00 AM
    Reply # 9304004 on 9301839
    Deleted user
    Penelope Bilton wrote:
    <snipped>
    Example:

    tt <- "https://www.gofundme.com/f/baba039s-brave-battle-i-can-breathe"


    > html_nodes(read_html(tt),".a-image--background")

    {xml_nodeset (1)} [1] <div class="a-image a-image--background" style="background-image:url(https://images.gofundme.com/OBX ...

    I can't get any further than this. I don't know how to extract the information from the html_nodes object.

    ***

    Hello,

    I think the R package xml2 will help here.

    First save the html_nodes to an object:

    > nodes <- html_nodes(read_html(tt),".a-image--background")

    Then look at its attributes:

    > xml_attrs(nodes)

    [[1]]
                                                                                                                                                         class                                                                                                                    "a-image a-image--background"
                                                                                                                                                         style
    "background-image:url(https://images.gofundme.com/OBX8u6ExqYkPs9mGp_zXCI-VYY4=/720x405/https://d2g8igdw686xgo.cloudfront.net/28360424_15211039920_r.jpeg)" 

    It seems you want the content under attribute "style":

    > xml_attr(nodes,"style")

    [1] "background-image:url(https://images.gofundme.com/OBX8u6ExqYkPs9mGp_zXCI-VYY4=/720x405/https://d2g8igdw686xgo.cloudfront.net/28360424_15211039920_r.jpeg)"

    This is just text and you can proceed to extract information with your favourite regexp method. Does this help?

    Regards,

    Jason


    Last modified: 15 Oct 2020 6:01 AM | Deleted user
  • 14 Oct 2020 7:10 AM
    Message # 9301839

    Hi R users,

    I am trying to download images from the crowd funding platform GoFundMe, as a part of a research project at Otago University into medical crowd funding for New Zealanders.

    I have been able to download images in the main body of the text, but not in the updates section.

    I have been using ScrapMateBeta to identify the field name for images (.a-image--background) in the updates section, but the code I use isn't giving what I want. Example:

    > tt <- "https://www.gofundme.com/f/baba039s-brave-battle-i-can-breathe"


    > html_nodes(read_html(tt),".a-image--background")

    {xml_nodeset (1)} [1] <div class="a-image a-image--background" style="background-image:url(https://images.gofundme.com/OBX ...

    I can't get any further than this. I don't know how to extract the information from the html_nodes object.

    When I clicked on the  photo  in the first update I got this web address,    https://www.gofundme.com/f/baba039s-brave-battle-i-can-breathe/update/18051468/gallery/0.  But I can't find any code in the source code to match this, or any way to extract this web address from the html source code.

    thanks, Penny.

Powered by Wild Apricot Membership Software