Heading here
My First paragraph
My second paragraph
from dsc80_utils import *
How many slip days do you plan on using on Project 2?
Goal: Access information about HDSI faculty members from the HDSI Faculty page.
Let's start by making a GET request to the HDSI Faculty page and see what the resulting HTML looks like.
import requests
fac_response = requests.get('https://datascience.ucsd.edu/faculty/')
fac_response
<Response [200]>
fac_text = fac_response.text
len(fac_text)
277166
print(fac_text[:10000])
<!DOCTYPE html> <html lang="en-US"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0" /> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <link rel="profile" href="https://gmpg.org/xfn/11" /> <title>Faculty – Halıcıoğlu Data Science Institute – UC San Diego</title> <script> /* You can add more configuration options to webfontloader by previously defining the WebFontConfig with your options */ if ( typeof WebFontConfig === "undefined" ) { WebFontConfig = new Object(); } WebFontConfig['google'] = {families: ['Jost:400,700', 'Roboto:400,500']}; (function() { var wf = document.createElement( 'script' ); wf.src = 'https://ajax.googleapis.com/ajax/libs/webfont/1.5.3/webfont.js'; wf.type = 'text/javascript'; wf.async = 'true'; var s = document.getElementsByTagName( 'script' )[0]; s.parentNode.insertBefore( wf, s ); })(); </script> <meta name='robots' content='max-image-preview:large' /> <link rel='dns-prefetch' href='//kit.fontawesome.com' /> <link rel='dns-prefetch' href='//platform-api.sharethis.com' /> <link rel='dns-prefetch' href='//fonts.googleapis.com' /> <link rel='dns-prefetch' href='//use.fontawesome.com' /> <link rel="alternate" type="application/rss+xml" title="Halıcıoğlu Data Science Institute - UC San Diego » Feed" href="https://datascience.ucsd.edu/feed/" /> <link rel="alternate" type="text/calendar" title="Halıcıoğlu Data Science Institute - UC San Diego » iCal Feed" href="https://datascience.ucsd.edu/events/?ical=1" /> <!-- This site uses the Google Analytics by MonsterInsights plugin v8.23.1 - Using Analytics tracking - https://www.monsterinsights.com/ --> <script src="//www.googletagmanager.com/gtag/js?id=G-ZWZDLH05C7" data-cfasync="false" data-wpfc-render="false" type="text/javascript" async></script> <script data-cfasync="false" data-wpfc-render="false" type="text/javascript"> var mi_version = '8.23.1'; var mi_track_user = true; var mi_no_track_reason = ''; var disableStrs = [ 'ga-disable-G-ZWZDLH05C7', ]; /* Function to detect opted out users */ function __gtagTrackerIsOptedOut() { for (var index = 0; index < disableStrs.length; index++) { if (document.cookie.indexOf(disableStrs[index] + '=true') > -1) { return true; } } return false; } /* Disable tracking if the opt-out cookie exists. */ if (__gtagTrackerIsOptedOut()) { for (var index = 0; index < disableStrs.length; index++) { window[disableStrs[index]] = true; } } /* Opt-out function */ function __gtagTrackerOptout() { for (var index = 0; index < disableStrs.length; index++) { document.cookie = disableStrs[index] + '=true; expires=Thu, 31 Dec 2099 23:59:59 UTC; path=/'; window[disableStrs[index]] = true; } } if ('undefined' === typeof gaOptout) { function gaOptout() { __gtagTrackerOptout(); } } window.dataLayer = window.dataLayer || []; window.MonsterInsightsDualTracker = { helpers: {}, trackers: {}, }; if (mi_track_user) { function __gtagDataLayer() { dataLayer.push(arguments); } function __gtagTracker(type, name, parameters) { if (!parameters) { parameters = {}; } if (parameters.send_to) { __gtagDataLayer.apply(null, arguments); return; } if (type === 'event') { parameters.send_to = monsterinsights_frontend.v4_id; var hookName = name; if (typeof parameters['event_category'] !== 'undefined') { hookName = parameters['event_category'] + ':' + name; } if (typeof MonsterInsightsDualTracker.trackers[hookName] !== 'undefined') { MonsterInsightsDualTracker.trackers[hookName](parameters); } else { __gtagDataLayer('event', name, parameters); } } else { __gtagDataLayer.apply(null, arguments); } } __gtagTracker('js', new Date()); __gtagTracker('set', { 'developer_id.dZGIzZG': true, }); __gtagTracker('config', 'G-ZWZDLH05C7', {"forceSSL":"true","link_attribution":"true"} ); window.gtag = __gtagTracker; (function () { /* https://developers.google.com/analytics/devguides/collection/analyticsjs/ */ /* ga and __gaTracker compatibility shim. */ var noopfn = function () { return null; }; var newtracker = function () { return new Tracker(); }; var Tracker = function () { return null; }; var p = Tracker.prototype; p.get = noopfn; p.set = noopfn; p.send = function () { var args = Array.prototype.slice.call(arguments); args.unshift('send'); __gaTracker.apply(null, args); }; var __gaTracker = function () { var len = arguments.length; if (len === 0) { return; } var f = arguments[len - 1]; if (typeof f !== 'object' || f === null || typeof f.hitCallback !== 'function') { if ('send' === arguments[0]) { var hitConverted, hitObject = false, action; if ('event' === arguments[1]) { if ('undefined' !== typeof arguments[3]) { hitObject = { 'eventAction': arguments[3], 'eventCategory': arguments[2], 'eventLabel': arguments[4], 'value': arguments[5] ? arguments[5] : 1, } } } if ('pageview' === arguments[1]) { if ('undefined' !== typeof arguments[2]) { hitObject = { 'eventAction': 'page_view', 'page_path': arguments[2], } } } if (typeof arguments[2] === 'object') { hitObject = arguments[2]; } if (typeof arguments[5] === 'object') { Object.assign(hitObject, arguments[5]); } if ('undefined' !== typeof arguments[1].hitType) { hitObject = arguments[1]; if ('pageview' === hitObject.hitType) { hitObject.eventAction = 'page_view'; } } if (hitObject) { action = 'timing' === arguments[1].hitType ? 'timing_complete' : hitObject.eventAction; hitConverted = mapArgs(hitObject); __gtagTracker('event', action, hitConverted); } } return; } function mapArgs(args) { var arg, hit = {}; var gaMap = { 'eventCategory': 'event_category', 'eventAction': 'event_action', 'eventLabel': 'event_label', 'eventValue': 'event_value', 'nonInteraction': 'non_interaction', 'timingCategory': 'event_category', 'timingVar': 'name', 'timingValue': 'value', 'timingLabel': 'event_label', 'page': 'page_path', 'location': 'page_location', 'title': 'page_title', 'referrer' : 'page_referrer', }; for (arg in args) { if (!(!args.hasOwnProperty(arg) || !gaMap.hasOwnProperty(arg))) { hit[gaMap[arg]] = args[arg]; } else { hit[arg] = args[arg]; } } return hit; } try { f.hitCallback(); } catch (ex) { } }; __gaTracker.create = newtracker; __gaTracker.getByName = newtracker; __gaTracker.getAll = function () { return []; }; __gaTracker.remove = noopfn; __gaTracker.loaded = true; window['__gaTracker'] = __gaTracker; })(); } else { console.log(""); (function () { function __gtagTracker() { return null; } window['__gtagTracker'] = __gtagTracker; window['gtag'] = __gtagTracker; })(); } </script> <!-- / Google Analytics by MonsterInsights --> <script type="text/javascript"> /* <![CDATA[ */ window._wpemojiSettings = {"baseUrl":"https:\/\/s.w.org\/images\/core\/emoji\/14.0.0\/72x72\/","ext":".png","svgUrl":"https:\/\/s.w.org\/images\/core\/emoji\/14.0.0\/svg\/","svgExt":".svg","source":{"concatemoji":"https:\/\/datascience.ucsd.edu\/wp-includes\/js\/wp-emoji-release.min.js?ver=6.4.3"}}; /*! This file is auto-generated */ !function(i,n){var o,s,e;function c(e){try{var t={supportTests:e,timestamp:(new Date).valueOf()};sessionStorage.setItem(o,JSON.stringify(t))}catch(e){}}function p(e,t,n){e.clearRect(0,0,e.canvas.width,e.canvas.height),e.fillText(t,0,0);var t=new Uint32Array(e.getImageData(0,0,e.canvas.width,e.canvas.height).data),r=(e.clearRect(0,0,e.canvas.width,e.canvas.height),e.fillText(n,0,0),new Uint32Array(e.getImageData(0,0,e.canvas.width,e.canvas.height).data));return t.every(function(e,t){return e===r[t]})}function u(e,t,n){switch(t){case"flag":return n(e,"\ud83c\udff3\ufe0f\u200d\u26a7\ufe0f","\ud83c\udff3\ufe0f\u200b\u26a7\ufe0f")?!1:!n(e,"\ud83c\uddfa\ud83c\uddf3","\ud83c\uddfa\u200b\ud83c\uddf3")&&!n(e,"\ud83c\udff4\udb40\udc67\udb40\udc62\udb40\udc65\udb40\udc6e\udb40\udc67\udb40\udc7f","\ud83c\udff4\u200b\udb40\udc67\u200b\udb40\udc62\u200b\udb40\udc65\u200b\udb40\udc6e\u200b\udb40\udc67\u200b\udb40\udc7f");case"emoji":return!n(e,"\ud83e\udef1\ud83c\udffb\u200d\ud83e\udef2\ud83c\udfff","\ud83e\udef1\ud83c\udffb\u200b\ud83e\udef2\ud83c\udfff")}return!1}function f(e,t,n){var r="undefined"!=typeof WorkerGlobalScope&&self instanceof WorkerGlobalScope?new OffscreenC
Wow, that is gross looking! 😰
robots.txt
file.robots.txt
file in their root directory, which contains a policy that allows or disallows automatic access to their site.If you make too many requests:
For instance, here's the content of a very basic webpage.
!cat data/lec10_ex1.html
<html> <head> <title>Page title</title> </head> <body> <h1>This is a heading</h1> <p>This is a paragraph.</p> <p>This is <b>another</b> paragraph.</p> </body> </html>
Using IPython.display.HTML
, we can render it directly in our notebook.
from IPython.display import HTML
HTML(filename=Path('data') / 'lec10_ex1.html')
This is a paragraph.
This is another paragraph.
HTML document: The totality of markup that makes up a webpage.
Document Object Model (DOM): The internal representation of an HTML document as a hierarchical tree structure.
HTML element: An object in the DOM, such as a paragraph, header, or title.
HTML tags: Markers that denote the start and end of an element, such as <p>
and </p>
.
Element | Description |
---|---|
<html> |
the document |
<head> |
the header |
<body> |
the body |
<div> |
a logical division of the document |
<span> |
an inline logical division |
<p> |
a paragraph |
<a> |
an anchor (hyperlink) |
<h1>, <h2>, ... |
header(s) |
<img> |
an image |
There are many, many more. See this article for examples.
Tags can have attributes, which further specify how to display information on a webpage.
For instance, <img>
tags have src
and alt
attributes (among others):
<img src="king-selfie.png" alt="A photograph of King Triton." width=500>
Click this link to access past exams.
Hyperlinks have href
attributes:
Click <a href="https://practice.dsc80.com">this link</a> to access past exams.
What do you think this webpage looks like?
!cat data/lec10_ex2.html
<html> <head> <title>Project 4A and 4B - DSC 80, Winter 2024</title> <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0-alpha1/dist/css/bootstrap.min.css" rel="stylesheet" /> </head> <body> <h1>Project Overview</h1> <img src="../imgs/platter.png" width="200" alt="My dinner last night." /> <p> When the project is released, you can start it by <a href="https://github.com/dsc-courses/dsc80-2024-wi/" >public GitHub repo</a >. </p> <center> <h3> Note that you'll have to submit your notebook as a PDF and a link to your website. </h3> </center> </body> </html>
<div>
tag¶<div style="background-color:lightblue">
<h3>This is a heading</h3>
<p>This is a paragraph.</p>
</div>
The <div>
tag defines a division or a "section" of an HTML document.
<div>
as a "cell" in a Jupyter Notebook.The <div>
element is often used as a container for other HTML elements to style them with CSS or to perform operations involving them using JavaScript.
<div>
elements often have attributes, which are important when scraping!
Under the document object model (DOM), HTML documents are trees. In DOM trees, child nodes are ordered.
What does the DOM tree look like for this document?
To start, we'll work with the source code for an HTML page with the DOM tree shown below:
The string html_string
contains an HTML "document".
html_string = '''
<html>
<body>
<div id="content">
<h1>Heading here</h1>
<p>My First paragraph</p>
<p>My <em>second</em> paragraph</p>
<hr>
</div>
<div id="nav">
<ul>
<li>item 1</li>
<li>item 2</li>
<li>item 3</li>
</ul>
</div>
</body>
</html>
'''.strip()
HTML(html_string)
My First paragraph
My second paragraph
BeautifulSoup
objects¶bs4.BeautifulSoup
takes in a string or file-like object representing HTML (markup
) and returns a parsed document.
import bs4
bs4.BeautifulSoup?
Normally, we pass the result of a GET request to bs4.BeautifulSoup
, but here we will pass our hand-crafted html_string
.
soup = bs4.BeautifulSoup(html_string)
soup
<html> <body> <div id="content"> <h1>Heading here</h1> <p>My First paragraph</p> <p>My <em>second</em> paragraph</p> <hr/> </div> <div id="nav"> <ul> <li>item 1</li> <li>item 2</li> <li>item 3</li> </ul> </div> </body> </html>
type(soup)
bs4.BeautifulSoup
BeautifulSoup
objects have several useful attributes, e.g. text
:
soup
<html> <body> <div id="content"> <h1>Heading here</h1> <p>My First paragraph</p> <p>My <em>second</em> paragraph</p> <hr/> </div> <div id="nav"> <ul> <li>item 1</li> <li>item 2</li> <li>item 3</li> </ul> </div> </body> </html>
print(soup.text)
Heading here My First paragraph My second paragraph item 1 item 2 item 3
descendants
¶The descendants
attribute traverses a BeautifulSoup
tree using depth-first traversal.
Why depth-first? Elements closer to one another on a page are more likely to be related than elements further away.
soup.descendants
<generator object Tag.descendants at 0x133e79ac0>
for child in soup.descendants:
# print(child) # What would happen if we ran this instead?
if isinstance(child, str):
continue
print(child.name)
html body div h1 p p em hr div ul li li li
Practically speaking, you will not use the descendants
attribute (or the related children
attribute) directly very often. Instead, you will use the following methods:
soup.find(tag)
, which finds the first instance of a tag (the first one on the page, i.e. the first one that DFS sees).soup.find(name=None, attrs={}, recursive=True, text=None, **kwargs)
.soup.find_all(tag)
will find all instances of a tag.find
finds tags!
soup.find('div')
<div id="content"> <h1>Heading here</h1> <p>My First paragraph</p> <p>My <em>second</em> paragraph</p> <hr/> </div>
Let's try and find the <div>
element that has an id
attribute equal to 'nav'
.
len(soup.find_all('div'))
2
soup
<html> <body> <div id="content"> <h1>Heading here</h1> <p>My First paragraph</p> <p>My <em>second</em> paragraph</p> <hr/> </div> <div id="nav"> <ul> <li>item 1</li> <li>item 2</li> <li>item 3</li> </ul> </div> </body> </html>
soup.find('div', attrs={'id': 'nav'})
<div id="nav"> <ul> <li>item 1</li> <li>item 2</li> <li>item 3</li> </ul> </div>
find
will return the first occurrence of a tag, regardless of its depth in the tree.
soup
<html> <body> <div id="content"> <h1>Heading here</h1> <p>My First paragraph</p> <p>My <em>second</em> paragraph</p> <hr/> </div> <div id="nav"> <ul> <li>item 1</li> <li>item 2</li> <li>item 3</li> </ul> </div> </body> </html>
# The ul child is not at the top of the tree, but we can still find it.
soup.find('li')
<li>item 1</li>
find_all
¶find_all
returns a list of all matches.
soup.find_all('div')
[<div id="content"> <h1>Heading here</h1> <p>My First paragraph</p> <p>My <em>second</em> paragraph</p> <hr/> </div>, <div id="nav"> <ul> <li>item 1</li> <li>item 2</li> <li>item 3</li> </ul> </div>]
soup.find_all('li')
[<li>item 1</li>, <li>item 2</li>, <li>item 3</li>]
soup.find('ucsd')
soup.find_all('li')[0]
<li>item 1</li>
[x.text for x in soup.find_all('li')]
['item 1', 'item 2', 'item 3']
text
attribute of a tag element gets the text between the opening and closing tags.attrs
attribute of a tag element lists all of its attributes.get
method of a tag element gets the value of an attribute.soup
<html> <body> <div id="content"> <h1>Heading here</h1> <p>My First paragraph</p> <p>My <em>second</em> paragraph</p> <hr/> </div> <div id="nav"> <ul> <li>item 1</li> <li>item 2</li> <li>item 3</li> </ul> </div> </body> </html>
soup.find('p')
<p>My First paragraph</p>
soup.find('p').text
'My First paragraph'
soup.find('div')
<div id="content"> <h1>Heading here</h1> <p>My First paragraph</p> <p>My <em>second</em> paragraph</p> <hr/> </div>
soup.find('div').text
'\nHeading here\nMy First paragraph\nMy second paragraph\n\n'
soup.find('div').attrs
{'id': 'content'}
# soup.find('img').get('href') # this would find you the link to an image, for example.
[div.attrs for div in soup.find_all('div')]
[{'id': 'content'}, {'id': 'nav'}]
soup.find('div').get('id')
'content'
The get
method must be called directly on the node that contains the attribute you're looking for.
soup
<html> <body> <div id="content"> <h1>Heading here</h1> <p>My First paragraph</p> <p>My <em>second</em> paragraph</p> <hr/> </div> <div id="nav"> <ul> <li>item 1</li> <li>item 2</li> <li>item 3</li> </ul> </div> </body> </html>
# While there are multiple 'id' attributes, none of them are in the <html> tag at the top.
soup.get('id')
soup.find('div').attrs['id']
'content'
soup.find('div').get('id')
'content'
Consider the following HTML document, which represents a webpage containing the top few songs with the most streams on Spotify today in Canada.
<head>
<title>3*Canada-2022-06-04</title>
</head>
<body>
<h1>Spotify Top 3 - Canada</h1>
<table>
<tr class='heading'>
<th>Rank</th>
<th>Artist(s)</th>
<th>Song</th>
</tr>
<tr class=1>
<td>1</td>
<td>Harry Styles</td>
<td>As It Was</td>
</tr>
<tr class=2>
<td>2</td>
<td>Jack Harlow</td>
<td>First Class</td>
</tr>
<tr class=3>
<td>3</td>
<td>Kendrick Lamar</td>
<td>N95</td>
</tr>
</table>
</body>
Part 1: How many leaf nodes are there in the DOM tree of the previous document — that is, how many nodes have no children?
Answer: 14.
Part 2: What does the following line of code evaluate to?
len(soup.find_all("td"))
Answer: 9.
Part 3: What does the following line of code evaluate to?
soup.find("tr").get("class")
Answer: 'heading'
.
Consider quotes.toscrape.com.
Goal: Extract quotes (and relevant metadata) into a DataFrame.
Specifically, let's try to make a DataFrame that looks like the one below:
quote | author | author_url | tags | |
---|---|---|---|---|
0 | “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” | Albert Einstein | https://quotes.toscrape.com/author/Albert-Einstein | change,deep-thoughts,thinking,world |
1 | “It is our choices, Harry, that show what we truly are, far more than our abilities.” | J.K. Rowling | https://quotes.toscrape.com/author/J-K-Rowling | abilities,choices |
2 | “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” | Albert Einstein | https://quotes.toscrape.com/author/Albert-Einstein | inspirational,life,live,miracle,miracles |
Eventually, we will create a single function – make_quote_df
– which takes in an integer n
and returns a DataFrame with the quotes on the first n
pages of quotes.toscrape.com.
To do this, we will define several helper functions:
download_page(i)
, which downloads a single page (page i
) and returns a BeautifulSoup
object of the response.
process_quote(div)
, which takes in a <div>
tree corresponding to a single quote and returns a dictionary containing all of the relevant information for that quote.
process_page(divs)
, which takes in a list of <div>
trees corresponding to a single page and returns a DataFrame containing all of the relevant information for all quotes on that page.
Key principle: some of our helper functions will make requests, and others will parse, but none will do both!
def download_page(i):
url = f'https://quotes.toscrape.com/page/{i}'
request = requests.get(url)
return bs4.BeautifulSoup(request.text)
In make_quote_df
, we will call download_page
repeatedly – once for i=1
, once for i=2
, ..., i=n
. For now, we will work with just page 1 (chosen arbitrarily).
soup = download_page(1)
Let's look at the page's source code (right click the page and click "Inspect" in Chrome) to find where the quotes in the page are located.
soup.find_all('div', attrs={'class': 'quote'})
[<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span> <span>by <small class="author" itemprop="author">Albert Einstein</small> <a href="/author/Albert-Einstein">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/> <a class="tag" href="/tag/change/page/1/">change</a> <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a> <a class="tag" href="/tag/thinking/page/1/">thinking</a> <a class="tag" href="/tag/world/page/1/">world</a> </div> </div>, <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span> <span>by <small class="author" itemprop="author">J.K. Rowling</small> <a href="/author/J-K-Rowling">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" content="abilities,choices" itemprop="keywords"/> <a class="tag" href="/tag/abilities/page/1/">abilities</a> <a class="tag" href="/tag/choices/page/1/">choices</a> </div> </div>, <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span> <span>by <small class="author" itemprop="author">Albert Einstein</small> <a href="/author/Albert-Einstein">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" content="inspirational,life,live,miracle,miracles" itemprop="keywords"/> <a class="tag" href="/tag/inspirational/page/1/">inspirational</a> <a class="tag" href="/tag/life/page/1/">life</a> <a class="tag" href="/tag/live/page/1/">live</a> <a class="tag" href="/tag/miracle/page/1/">miracle</a> <a class="tag" href="/tag/miracles/page/1/">miracles</a> </div> </div>, <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span> <span>by <small class="author" itemprop="author">Jane Austen</small> <a href="/author/Jane-Austen">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" content="aliteracy,books,classic,humor" itemprop="keywords"/> <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a> <a class="tag" href="/tag/books/page/1/">books</a> <a class="tag" href="/tag/classic/page/1/">classic</a> <a class="tag" href="/tag/humor/page/1/">humor</a> </div> </div>, <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span> <span>by <small class="author" itemprop="author">Marilyn Monroe</small> <a href="/author/Marilyn-Monroe">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" content="be-yourself,inspirational" itemprop="keywords"/> <a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a> <a class="tag" href="/tag/inspirational/page/1/">inspirational</a> </div> </div>, <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span> <span>by <small class="author" itemprop="author">Albert Einstein</small> <a href="/author/Albert-Einstein">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" content="adulthood,success,value" itemprop="keywords"/> <a class="tag" href="/tag/adulthood/page/1/">adulthood</a> <a class="tag" href="/tag/success/page/1/">success</a> <a class="tag" href="/tag/value/page/1/">value</a> </div> </div>, <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.”</span> <span>by <small class="author" itemprop="author">André Gide</small> <a href="/author/Andre-Gide">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" content="life,love" itemprop="keywords"/> <a class="tag" href="/tag/life/page/1/">life</a> <a class="tag" href="/tag/love/page/1/">love</a> </div> </div>, <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“I have not failed. I've just found 10,000 ways that won't work.”</span> <span>by <small class="author" itemprop="author">Thomas A. Edison</small> <a href="/author/Thomas-A-Edison">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" content="edison,failure,inspirational,paraphrased" itemprop="keywords"/> <a class="tag" href="/tag/edison/page/1/">edison</a> <a class="tag" href="/tag/failure/page/1/">failure</a> <a class="tag" href="/tag/inspirational/page/1/">inspirational</a> <a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a> </div> </div>, <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“A woman is like a tea bag; you never know how strong it is until it's in hot water.”</span> <span>by <small class="author" itemprop="author">Eleanor Roosevelt</small> <a href="/author/Eleanor-Roosevelt">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" content="misattributed-eleanor-roosevelt" itemprop="keywords"/> <a class="tag" href="/tag/misattributed-eleanor-roosevelt/page/1/">misattributed-eleanor-roosevelt</a> </div> </div>, <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“A day without sunshine is like, you know, night.”</span> <span>by <small class="author" itemprop="author">Steve Martin</small> <a href="/author/Steve-Martin">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" content="humor,obvious,simile" itemprop="keywords"/> <a class="tag" href="/tag/humor/page/1/">humor</a> <a class="tag" href="/tag/obvious/page/1/">obvious</a> <a class="tag" href="/tag/simile/page/1/">simile</a> </div> </div>]
divs = soup.find_all('div', class_='quote')
divs[0]
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span> <span>by <small class="author" itemprop="author">Albert Einstein</small> <a href="/author/Albert-Einstein">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/> <a class="tag" href="/tag/change/page/1/">change</a> <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a> <a class="tag" href="/tag/thinking/page/1/">thinking</a> <a class="tag" href="/tag/world/page/1/">world</a> </div> </div>
From this <div>
, we can extract the quote, author name, author's URL, and tags.
divs[0]
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span> <span>by <small class="author" itemprop="author">Albert Einstein</small> <a href="/author/Albert-Einstein">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/> <a class="tag" href="/tag/change/page/1/">change</a> <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a> <a class="tag" href="/tag/thinking/page/1/">thinking</a> <a class="tag" href="/tag/world/page/1/">world</a> </div> </div>
# The quote.
divs[0].find('span', class_='text').text
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
# The author.
divs[0].find('small', class_='author').text
'Albert Einstein'
# The URL for the author.
'https://quotes.toscrape.com' + divs[0].find('a').get('href')
'https://quotes.toscrape.com/author/Albert-Einstein'
# The quote's tags.
divs[0].find('meta', class_='keywords').get('content')
'change,deep-thoughts,thinking,world'
Let's implement our next function, process_quote
, which takes in a <div>
corresponding to a single quote and returns a dictionary containing the quote's information.
Why use a dictionary? Passing pd.DataFrame
a list of dictionaries is an easy way to create a DataFrame.
def process_quote(div):
quote = div.find('span', class_='text').text
author = div.find('small', class_='author').text
author_url = 'https://quotes.toscrape.com' + div.find('a').get('href')
tags = div.find('meta', class_='keywords').get('content')
return {'quote': quote, 'author': author, 'author_url': author_url, 'tags': tags}
process_quote(divs[0])
{'quote': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'author_url': 'https://quotes.toscrape.com/author/Albert-Einstein', 'tags': 'change,deep-thoughts,thinking,world'}
Our last helper function will take in a list of <div>
s, call process_quote
on each <div>
in the list, and return a DataFrame.
def process_page(divs):
return pd.DataFrame([process_quote(div) for div in divs])
process_page(divs)
quote | author | author_url | tags | |
---|---|---|---|---|
0 | “The world as we have created it is a process ... | Albert Einstein | https://quotes.toscrape.com/author/Albert-Eins... | change,deep-thoughts,thinking,world |
1 | “It is our choices, Harry, that show what we t... | J.K. Rowling | https://quotes.toscrape.com/author/J-K-Rowling | abilities,choices |
2 | “There are only two ways to live your life. On... | Albert Einstein | https://quotes.toscrape.com/author/Albert-Eins... | inspirational,life,live,miracle,miracles |
... | ... | ... | ... | ... |
7 | “I have not failed. I've just found 10,000 way... | Thomas A. Edison | https://quotes.toscrape.com/author/Thomas-A-Ed... | edison,failure,inspirational,paraphrased |
8 | “A woman is like a tea bag; you never know how... | Eleanor Roosevelt | https://quotes.toscrape.com/author/Eleanor-Roo... | misattributed-eleanor-roosevelt |
9 | “A day without sunshine is like, you know, nig... | Steve Martin | https://quotes.toscrape.com/author/Steve-Martin | humor,obvious,simile |
10 rows × 4 columns
def make_quote_df(n):
'''Returns a DataFrame containing the quotes on the first n pages of https://quotes.toscrape.com/.'''
dfs = []
for i in range(1, n+1):
# Download page n and create a BeautifulSoup object.
soup = download_page(i)
# Create DataFrame using the information in that page.
divs = soup.find_all('div', class_='quote')
df = process_page(divs)
# Append DataFrame to dfs.
dfs.append(df)
# Stitch all DataFrames together.
return pd.concat(dfs, ignore_index=True)
quotes = make_quote_df(3)
# quotes.head()
quotes
quote | author | author_url | tags | |
---|---|---|---|---|
0 | “The world as we have created it is a process ... | Albert Einstein | https://quotes.toscrape.com/author/Albert-Eins... | change,deep-thoughts,thinking,world |
1 | “It is our choices, Harry, that show what we t... | J.K. Rowling | https://quotes.toscrape.com/author/J-K-Rowling | abilities,choices |
2 | “There are only two ways to live your life. On... | Albert Einstein | https://quotes.toscrape.com/author/Albert-Eins... | inspirational,life,live,miracle,miracles |
... | ... | ... | ... | ... |
27 | “It is impossible to live without failing at s... | J.K. Rowling | https://quotes.toscrape.com/author/J-K-Rowling | |
28 | “Logic will get you from A to Z; imagination w... | Albert Einstein | https://quotes.toscrape.com/author/Albert-Eins... | imagination |
29 | “One good thing about music, when it hits you,... | Bob Marley | https://quotes.toscrape.com/author/Bob-Marley | music |
30 rows × 4 columns
quotes
quote | author | author_url | tags | |
---|---|---|---|---|
0 | “The world as we have created it is a process ... | Albert Einstein | https://quotes.toscrape.com/author/Albert-Eins... | change,deep-thoughts,thinking,world |
1 | “It is our choices, Harry, that show what we t... | J.K. Rowling | https://quotes.toscrape.com/author/J-K-Rowling | abilities,choices |
2 | “There are only two ways to live your life. On... | Albert Einstein | https://quotes.toscrape.com/author/Albert-Eins... | inspirational,life,live,miracle,miracles |
... | ... | ... | ... | ... |
27 | “It is impossible to live without failing at s... | J.K. Rowling | https://quotes.toscrape.com/author/J-K-Rowling | |
28 | “Logic will get you from A to Z; imagination w... | Albert Einstein | https://quotes.toscrape.com/author/Albert-Eins... | imagination |
29 | “One good thing about music, when it hits you,... | Bob Marley | https://quotes.toscrape.com/author/Bob-Marley | music |
30 rows × 4 columns
quotes[quotes['author'] == 'Albert Einstein']
quote | author | author_url | tags | |
---|---|---|---|---|
0 | “The world as we have created it is a process ... | Albert Einstein | https://quotes.toscrape.com/author/Albert-Eins... | change,deep-thoughts,thinking,world |
2 | “There are only two ways to live your life. On... | Albert Einstein | https://quotes.toscrape.com/author/Albert-Eins... | inspirational,life,live,miracle,miracles |
5 | “Try not to become a man of success. Rather be... | Albert Einstein | https://quotes.toscrape.com/author/Albert-Eins... | adulthood,success,value |
12 | “If you can't explain it to a six year old, yo... | Albert Einstein | https://quotes.toscrape.com/author/Albert-Eins... | simplicity,understand |
26 | “If you want your children to be intelligent, ... | Albert Einstein | https://quotes.toscrape.com/author/Albert-Eins... | children,fairy-tales |
28 | “Logic will get you from A to Z; imagination w... | Albert Einstein | https://quotes.toscrape.com/author/Albert-Eins... | imagination |
The elements in the 'tags'
column are all strings, but they look like lists. This is not ideal, as we will see shortly.
Let's try and extract a list of HDSI Faculty from datascience.ucsd.edu/faculty.
fac_response = requests.get('https://datascience.ucsd.edu/faculty/')
fac_response
<Response [200]>
soup = bs4.BeautifulSoup(fac_response.text)
It's not easy identifying which <div>
s we want. The Inspect tool makes this easier, but it's good to verify that find_all
is finding the right number of elements.
divs = soup.find_all('div', class_='vc_grid-item')
len(divs)
63
Within here, we need to extract each faculty member's name. It seems like names are stored as text within the <h4>
tag.
divs[0]
<div class="vc_grid-item vc_clearfix col_1-5 vc_grid-item-zone-c-bottom vc_visible-item vc_grid-term-council vc_grid-term-faculty vc_grid-term-faculty-fellows"> <a class="anchor-link" id="ilkay-altintas" name="ilkay-altintas"></a> <div class="vc_grid-item-mini vc_clearfix"><div class="vc_gitem-animated-block"> <div class="vc_gitem-zone vc_gitem-zone-a vc-gitem-zone-height-mode-auto vc-gitem-zone-height-mode-auto-1-1" style="background-image: url(https://datascience.ucsd.edu/wp-content/uploads/2022/10/Ilkay-Altintas-2.jpg) !important;"><a class="vc_gitem-link vc-zone-link" href="https://datascience.ucsd.edu/people/ilkay-altintas/"></a><img alt="Ilkay Altintas" class="vc_gitem-zone-img" decoding="async" src="https://datascience.ucsd.edu/wp-content/uploads/2022/10/Ilkay-Altintas-2.jpg"/> <div class="vc_gitem-zone-mini"></div> </div> </div><div class="vc_gitem-zone vc_gitem-zone-c"> <div class="vc_gitem-zone-mini"> <div class="vc_gitem_row vc_row vc_gitem-row-position-top"> <div class="vc_col-sm-12 vc_gitem-col vc_gitem-col-align-"> <div class="vc_custom_heading vc_gitem-post-data vc_gitem-post-data-source-post_title"> <h4 style="text-align: left"><a href="https://datascience.ucsd.edu/people/ilkay-altintas/">Ilkay Altintas</a></h4> </div> <div class="vc_gitem-align-left fields"> <div class="field pendari_people_title">SDSC Chief Data Science Officer & HDSI Founding Faculty Fellow</div> </div> <div class="excerpt"> </div> <div class="terms"> Council Faculty Faculty Fellows </div> </div> </div> </div> </div></div></div>
divs[39].find('h4').text
'Suraj Rampure'
We can also extract job titles:
divs[39].find('div', class_='pendari_people_title').text
'Lecturer'
Let's create a DataFrame consisting of names and job titles for each faculty member.
names = [div.find('h4').text for div in divs]
names[:10]
['Ilkay Altintas', 'Tiffany Amariuta', 'Mikio Aoi', 'Ery Arias-Castro', 'Vineet Bafna', 'Mikhail Belkin', 'Jelena Bradic', 'Henrik Christensen', 'Alex Cloninger', 'Anders Dale']
titles = [div.find(class_='field').text for div in divs]
titles[:10]
['SDSC Chief Data Science Officer & HDSI Founding Faculty Fellow', 'Assistant Professor', 'Assistant Professor', 'Professor', 'Professor', 'Professor', 'Professor', 'Distinguished Scientist, Professor', 'Associate Professor', 'Distinguished Scientist, Professor']
faculty = pd.DataFrame({
'name': names,
'title': titles,
})
faculty.head()
name | title | |
---|---|---|
0 | Ilkay Altintas | SDSC Chief Data Science Officer & HDSI Foundin... |
1 | Tiffany Amariuta | Assistant Professor |
2 | Mikio Aoi | Assistant Professor |
3 | Ery Arias-Castro | Professor |
4 | Vineet Bafna | Professor |
Now we have a DataFrame!
faculty[faculty['title'].str.contains('Teaching') | faculty['title'].str.contains('Lecturer')]
name | title | |
---|---|---|
12 | Justin Eldridge | Assistant Teaching Professor |
13 | Shannon Ellis | Associate Teaching Professor |
27 | Marina Langlois | Lecturer |
... | ... | ... |
39 | Suraj Rampure | Lecturer |
47 | Jack Silberman | Lecturer |
51 | Janine Tiefenbruck | Lecturer |
9 rows × 2 columns
What if we want to get faculty members' pictures?
len(divs)
63
names.index('Marina Langlois')
27
divs[27].find('img').get('src')
'https://datascience.ucsd.edu/wp-content/uploads/2022/10/Marina-Langlois-1.jpg'
from IPython.display import Image, display
def show_picture(name):
idx = faculty[faculty['name'].str.lower().str.contains(name)].index[0]
display(Image(divs[idx].find('img').get('src')))
show_picture('suraj')
Consider the following HTML document, which represents a webpage containing the top few songs with the most streams on Spotify today in Canada.
<head>
<title>3*Canada-2022-06-04</title>
</head>
<body>
<h1>Spotify Top 3 - Canada</h1>
<table>
<tr class='heading'>
<th>Rank</th>
<th>Artist(s)</th>
<th>Song</th>
</tr>
<tr class=1>
<td>1</td>
<td>Harry Styles</td>
<td>As It Was</td>
</tr>
<tr class=2>
<td>2</td>
<td>Jack Harlow</td>
<td>First Class</td>
</tr>
<tr class=3>
<td>3</td>
<td>Kendrick Lamar</td>
<td>N95</td>
</tr>
</table>
</body>
Part 4: Complete the implementation of the function top_nth
, which takes in a positive integer n
and returns the name of the n-th ranked song in the HTML document. For instance, top_nth(2)
should evaluate to "First Class"
(n=1
corresponds to the top song).
Note: Your implementation should work in the case that the page contains more than 3 songs.
def top_nth(n):
return soup.find("tr", attrs=__(a)__).find_all("td")__(b)__
def top_nth(n):
return soup.find("tr", attrs={"class": n}).find_all("td")[-1].text
The spread of true and false news online by Vosoughi et al. compared how true and false news spreads via Twitter:
There is worldwide concern over false news and the possibility that it can influence political, economic, and social well-being. To understand how false news spreads, Vosoughi et al. used a data set of rumor cascades on Twitter from 2006 to 2017. About 126,000 rumors were spread by ∼3 million people. False news reached more people than the truth; the top 1% of false news cascades diffused to between 1000 and 100,000 people, whereas the truth rarely diffused to more than 1000 people. Falsehood also diffused faster than the truth. The degree of novelty and the emotional reactions of recipients may be responsible for the differences observed.
To conduct this study, the authors used the Twitter API for accessing tweets and web-scraped fact-checking websites to verify whether news was false or not.
soup.find
and soup.find_all
are the functions you will use most often.Regular expressions!