Statistical Society of Australia - R code : downloading website images with rvest

Member Login Join Now

Back to topics

R code : downloading website images with rvest

Show oldest replies on top

Subscribe to topic

15 Oct 2020 7:00 AM

Quote

Reply # 9304004 on 9301839

Deleted user

Penelope Bilton wrote:
<snipped>
Example:

tt <- "https://www.gofundme.com/f/baba039s-brave-battle-i-can-breathe"

> html_nodes(read_html(tt),".a-image--background")

{xml_nodeset (1)} [1] <div class="a-image a-image--background" style="background-image:url(https://images.gofundme.com/OBX ...

I can't get any further than this. I don't know how to extract the information from the html_nodes object.

***

Hello,

I think the R package xml2 will help here.

First save the html_nodes to an object:

> nodes <- html_nodes(read_html(tt),".a-image--background")

Then look at its attributes:

> xml_attrs(nodes)

[[1]]
class "a-image a-image--background"
style
"background-image:url(https://images.gofundme.com/OBX8u6ExqYkPs9mGp_zXCI-VYY4=/720x405/https://d2g8igdw686xgo.cloudfront.net/28360424_15211039920_r.jpeg)"

It seems you want the content under attribute "style":

> xml_attr(nodes,"style")

[1] "background-image:url(https://images.gofundme.com/OBX8u6ExqYkPs9mGp_zXCI-VYY4=/720x405/https://d2g8igdw686xgo.cloudfront.net/28360424_15211039920_r.jpeg)"

This is just text and you can proceed to extract information with your favourite regexp method. Does this help?

Regards,

Jason

Last modified: 15 Oct 2020 7:01 AM | Deleted user

14 Oct 2020 8:10 AM

Quote

Message # 9301839

Penelope Bilton

Hi R users,

I am trying to download images from the crowd funding platform GoFundMe, as a part of a research project at Otago University into medical crowd funding for New Zealanders.

I have been able to download images in the main body of the text, but not in the updates section.

I have been using ScrapMateBeta to identify the field name for images (.a-image--background) in the updates section, but the code I use isn't giving what I want. Example:

> tt <- "https://www.gofundme.com/f/baba039s-brave-battle-i-can-breathe"

> html_nodes(read_html(tt),".a-image--background")

{xml_nodeset (1)} [1] <div class="a-image a-image--background" style="background-image:url(https://images.gofundme.com/OBX ...

I can't get any further than this. I don't know how to extract the information from the html_nodes object.

When I clicked on the photo in the first update I got this web address, https://www.gofundme.com/f/baba039s-brave-battle-i-can-breathe/update/18051468/gallery/0. But I can't find any code in the source code to match this, or any way to extract this web address from the html source code.

thanks, Penny.

Statistical Society of Australia (SSA)

PO Box 213

Belconnen ACT 2616 Australia

02 6251 3647

www.statsoc.org.au

ABN 82 853 491 081

Please direct enquiries to:

the SSA Team via email at

contact@statsoc.org.au

@StatSocAus

Privacy Security Sitemap

Website by Converge Design