from dsc80_utils import *
📣 Announcements 📣¶
- Project 3 due tomorrow!
- Lab 7 out, due Monday, Nov 20.
📆 Agenda¶
- TF-IDF Example: State of the Union addresses 🎤.
- Modeling.
- Case study: Restaurant tips 🧑🍳.
- Regression in
sklearn
.
Example: State of the Union addresses 🎤¶
State of the Union addresses¶
The 2023 State of the Union address was on February 7th, 2023.
from IPython.display import YouTubeVideo
YouTubeVideo('gzcBTUvVp7M')
The data¶
from pathlib import Path
import re
sotu_txt = Path('data') / 'stateoftheunion1790-2023.txt'
sotu = sotu_txt.read_text()
speeches = sotu.split('\n***\n')[1:]
def extract_struct(speech):
L = speech.strip().split('\n', maxsplit=3)
L[3] = re.sub(r"[^A-Za-z' ]", ' ', L[3]).lower()
return dict(zip(['speech', 'president', 'date', 'contents'], L))
speeches_df = pd.DataFrame(list(map(extract_struct, speeches)))
speeches_df
speech | president | date | contents | |
---|---|---|---|---|
0 | State of the Union Address | George Washington | January 8, 1790 | fellow citizens of the senate and house of re... |
1 | State of the Union Address | George Washington | December 8, 1790 | fellow citizens of the senate and house of re... |
2 | State of the Union Address | George Washington | October 25, 1791 | fellow citizens of the senate and house of re... |
... | ... | ... | ... | ... |
230 | State of the Union Address | Joseph R. Biden Jr. | April 28, 2021 | thank you thank you thank you good to be b... |
231 | State of the Union Address | Joseph R. Biden Jr. | March 1, 2022 | madam speaker madam vice president and our ... |
232 | State of the Union Address | Joseph R. Biden Jr. | February 7, 2023 | mr speaker madam vice president our firs... |
233 rows × 4 columns
Finding the most important words in each speech¶
Here, a "document" is a speech. We have 233 documents.
speeches_df
speech | president | date | contents | |
---|---|---|---|---|
0 | State of the Union Address | George Washington | January 8, 1790 | fellow citizens of the senate and house of re... |
1 | State of the Union Address | George Washington | December 8, 1790 | fellow citizens of the senate and house of re... |
2 | State of the Union Address | George Washington | October 25, 1791 | fellow citizens of the senate and house of re... |
... | ... | ... | ... | ... |
230 | State of the Union Address | Joseph R. Biden Jr. | April 28, 2021 | thank you thank you thank you good to be b... |
231 | State of the Union Address | Joseph R. Biden Jr. | March 1, 2022 | madam speaker madam vice president and our ... |
232 | State of the Union Address | Joseph R. Biden Jr. | February 7, 2023 | mr speaker madam vice president our firs... |
233 rows × 4 columns
A rough sketch of what we'll compute:
for each word t:
for each speech d:
compute tfidf(t, d)
unique_words = speeches_df['contents'].str.split().explode().value_counts()
# Take the top 500 most common words for speed
unique_words = unique_words.iloc[:500].index
unique_words
Index(['the', 'of', 'to', 'and', 'in', 'a', 'that', 'for', 'be', 'our', ... 'desire', 'call', 'submitted', 'increasing', 'months', 'point', 'trust', 'throughout', 'set', 'object'], dtype='object', length=500)
💡 Pro-Tip: Using tqdm
¶
This code takes a while to run, so we'll use the tdqm
package to track its progress. (Install with pip install tqdm
if needed).
from tqdm.notebook import tqdm
tfidf_dict = {}
tf_denom = speeches_df['contents'].str.split().str.len()
# Wrap the sequence with `tqdm()` to display a progress bar
for word in tqdm(unique_words):
re_pat = fr' {word} ' # Imperfect pattern for speed.
tf = speeches_df['contents'].str.count(re_pat) / tf_denom
idf = np.log(len(speeches_df) / speeches_df['contents'].str.contains(re_pat).sum())
tfidf_dict[word] = tf * idf
0%| | 0/500 [00:00<?, ?it/s]
tfidf = pd.DataFrame(tfidf_dict)
tfidf.head()
the | of | to | and | ... | trust | throughout | set | object | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 4.29e-04 | 0.00e+00 | 0.00e+00 | 2.04e-03 |
1 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.00e+00 | 0.00e+00 | 0.00e+00 | 1.06e-03 |
2 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 4.06e-04 | 0.00e+00 | 3.48e-04 | 6.44e-04 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 6.70e-04 | 2.17e-04 | 0.00e+00 | 7.09e-04 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 2.38e-04 | 4.62e-04 | 0.00e+00 | 3.77e-04 |
5 rows × 500 columns
Note that the TF-IDFs of many common words are all 0!
Summarizing speeches¶
By using idxmax
, we can find the word with the highest TF-IDF in each speech.
summaries = tfidf.idxmax(axis=1)
summaries
0 object 1 convention 2 provision ... 230 it's 231 tonight 232 it's Length: 233, dtype: object
What if we want to see the 5 words with the highest TF-IDFs, for each speech?
def five_largest(row):
return list(row.index[row.argsort()][-5:])
keywords = tfidf.apply(five_largest, axis=1)
keywords_df = pd.concat([
speeches_df['president'],
speeches_df['date'],
keywords
], axis=1)
Run the cell below to see every single row of keywords_df
.
display_df(keywords_df, rows=233)
president | date | 0 | |
---|---|---|---|
0 | George Washington | January 8, 1790 | [your, proper, regard, ought, object] |
1 | George Washington | December 8, 1790 | [case, established, object, commerce, convention] |
2 | George Washington | October 25, 1791 | [community, upon, lands, proper, provision] |
3 | George Washington | November 6, 1792 | [subject, upon, information, proper, provision] |
4 | George Washington | December 3, 1793 | [having, vessels, executive, shall, ought] |
5 | George Washington | November 19, 1794 | [too, army, let, ought, constitution] |
6 | George Washington | December 8, 1795 | [army, prevent, object, provision, treaty] |
7 | George Washington | December 7, 1796 | [republic, treaty, britain, ought, object] |
8 | John Adams | November 22, 1797 | [spain, british, claims, treaty, vessels] |
9 | John Adams | December 8, 1798 | [st, minister, treaty, spain, commerce] |
10 | John Adams | December 3, 1799 | [civil, period, british, minister, treaty] |
11 | John Adams | November 11, 1800 | [experience, protection, navy, commerce, ought] |
12 | Thomas Jefferson | December 8, 1801 | [consideration, shall, object, vessels, subject] |
13 | Thomas Jefferson | December 15, 1802 | [shall, debt, naval, duties, vessels] |
14 | Thomas Jefferson | October 17, 1803 | [debt, vessels, sum, millions, friendly] |
15 | Thomas Jefferson | November 8, 1804 | [received, convention, having, due, friendly] |
16 | Thomas Jefferson | December 3, 1805 | [families, convention, sum, millions, vessels] |
17 | Thomas Jefferson | December 2, 1806 | [due, consideration, millions, shall, spain] |
18 | Thomas Jefferson | October 27, 1807 | [whether, army, british, vessels, shall] |
19 | Thomas Jefferson | November 8, 1808 | [thus, british, millions, commerce, her] |
20 | James Madison | November 29, 1809 | [cases, having, due, british, minister] |
21 | James Madison | December 5, 1810 | [provisions, view, minister, commerce, british] |
22 | James Madison | November 5, 1811 | [britain, provisions, commerce, minister, brit... |
23 | James Madison | November 4, 1812 | [nor, subject, provisions, britain, british] |
24 | James Madison | December 7, 1813 | [number, having, naval, britain, british] |
25 | James Madison | September 20, 1814 | [naval, vessels, britain, his, british] |
26 | James Madison | December 5, 1815 | [debt, treasury, millions, establishment, sum] |
27 | James Madison | December 3, 1816 | [constitution, annual, sum, treasury, british] |
28 | James Monroe | December 12, 1817 | [improvement, territory, indian, millions, lands] |
29 | James Monroe | November 16, 1818 | [minister, object, territory, her, spain] |
30 | James Monroe | December 7, 1819 | [parties, friendly, minister, treaty, spain] |
31 | James Monroe | November 14, 1820 | [amount, minister, extent, vessels, spain] |
32 | James Monroe | December 3, 1821 | [powers, duties, revenue, spain, vessels] |
33 | James Monroe | December 3, 1822 | [object, proper, vessels, spain, convention] |
34 | James Monroe | December 2, 1823 | [th, department, object, minister, spain] |
35 | James Monroe | December 7, 1824 | [spain, governments, convention, parties, object] |
36 | John Quincy Adams | December 6, 1825 | [officers, commerce, condition, upon, improvem... |
37 | John Quincy Adams | December 5, 1826 | [commercial, upon, vessels, british, duties] |
38 | John Quincy Adams | December 4, 1827 | [lands, british, receipts, upon, th] |
39 | John Quincy Adams | December 2, 1828 | [duties, revenue, upon, commercial, britain] |
40 | Andrew Jackson | December 8, 1829 | [attention, subject, her, upon, duties] |
41 | Andrew Jackson | December 6, 1830 | [general, subject, character, vessels, upon] |
42 | Andrew Jackson | December 6, 1831 | [indian, commerce, claims, treaty, minister] |
43 | Andrew Jackson | December 4, 1832 | [general, subject, duties, lands, commerce] |
44 | Andrew Jackson | December 3, 1833 | [treasury, convention, minister, spain, duties] |
45 | Andrew Jackson | December 1, 1834 | [bill, treaty, minister, claims, upon] |
46 | Andrew Jackson | December 7, 1835 | [treaty, upon, claims, subject, minister] |
47 | Andrew Jackson | December 5, 1836 | [upon, treasury, duties, revenue, banks] |
48 | Martin van Buren | December 5, 1837 | [price, subject, upon, banks, lands] |
49 | Martin van Buren | December 3, 1838 | [subject, upon, indian, banks, court] |
50 | Martin van Buren | December 2, 1839 | [duties, treasury, extent, institutions, banks] |
51 | Martin van Buren | December 5, 1840 | [general, revenue, upon, extent, having] |
52 | John Tyler | December 7, 1841 | [banks, britain, amount, duties, treasury] |
53 | John Tyler | December 6, 1842 | [claims, minister, thus, amount, treasury] |
54 | John Tyler | December 6, 1843 | [treasury, british, her, minister, mexico] |
55 | John Tyler | December 3, 1844 | [minister, upon, treaty, her, mexico] |
56 | James Polk | December 2, 1845 | [british, convention, territory, duties, mexico] |
57 | James Polk | December 8, 1846 | [army, territory, minister, her, mexico] |
58 | James Polk | December 7, 1847 | [amount, treaty, her, army, mexico] |
59 | James Polk | December 5, 1848 | [tariff, upon, bill, constitution, mexico] |
60 | Zachary Taylor | December 4, 1849 | [territory, treaty, recommend, minister, mexico] |
61 | Millard Fillmore | December 2, 1850 | [recommend, claims, upon, mexico, duties] |
62 | Millard Fillmore | December 2, 1851 | [department, annual, fiscal, subject, mexico] |
63 | Millard Fillmore | December 6, 1852 | [duties, navy, mexico, subject, her] |
64 | Franklin Pierce | December 5, 1853 | [commercial, regard, upon, construction, subject] |
65 | Franklin Pierce | December 4, 1854 | [character, duties, naval, minister, property] |
66 | Franklin Pierce | December 31, 1855 | [constitution, british, territory, convention,... |
67 | Franklin Pierce | December 2, 1856 | [institutions, property, condition, thus, terr... |
68 | James Buchanan | December 8, 1857 | [treaty, constitution, territory, convention, ... |
69 | James Buchanan | December 6, 1858 | [june, mexico, minister, constitution, territory] |
70 | James Buchanan | December 19, 1859 | [minister, th, fiscal, mexico, june] |
71 | James Buchanan | December 3, 1860 | [minister, duties, claims, convention, constit... |
72 | Abraham Lincoln | December 3, 1861 | [army, claims, labor, capital, court] |
73 | Abraham Lincoln | December 1, 1862 | [upon, population, shall, per, sum] |
74 | Abraham Lincoln | December 8, 1863 | [upon, receipts, subject, navy, naval] |
75 | Abraham Lincoln | December 6, 1864 | [condition, secretary, naval, treasury, navy] |
76 | Andrew Johnson | December 4, 1865 | [form, commerce, powers, general, constitution] |
77 | Andrew Johnson | December 3, 1866 | [thus, june, constitution, mexico, condition] |
78 | Andrew Johnson | December 3, 1867 | [june, value, department, upon, constitution] |
79 | Andrew Johnson | December 9, 1868 | [millions, amount, expenditures, june, per] |
80 | Ulysses S. Grant | December 6, 1869 | [subject, upon, receipts, per, spain] |
81 | Ulysses S. Grant | December 5, 1870 | [her, convention, vessels, spain, british] |
82 | Ulysses S. Grant | December 4, 1871 | [object, powers, treaty, desire, recommend] |
83 | Ulysses S. Grant | December 2, 1872 | [territory, line, her, britain, treaty] |
84 | Ulysses S. Grant | December 1, 1873 | [consideration, banks, subject, amount, claims] |
85 | Ulysses S. Grant | December 7, 1874 | [duties, upon, attention, claims, convention] |
86 | Ulysses S. Grant | December 7, 1875 | [parties, territory, court, spain, claims] |
87 | Ulysses S. Grant | December 5, 1876 | [subject, court, per, commission, claims] |
88 | Rutherford B. Hayes | December 3, 1877 | [upon, sum, fiscal, commercial, value] |
89 | Rutherford B. Hayes | December 2, 1878 | [per, secretary, fiscal, june, indian] |
90 | Rutherford B. Hayes | December 1, 1879 | [subject, territory, june, commission, indian] |
91 | Rutherford B. Hayes | December 6, 1880 | [subject, office, relations, attention, commer... |
92 | Chester A. Arthur | December 6, 1881 | [spain, international, british, relations, fri... |
93 | Chester A. Arthur | December 4, 1882 | [territory, establishment, mexico, internation... |
94 | Chester A. Arthur | December 4, 1883 | [total, convention, mexico, commission, treaty] |
95 | Chester A. Arthur | December 1, 1884 | [treaty, territory, commercial, secretary, ves... |
96 | Grover Cleveland | December 8, 1885 | [duties, vessels, treaty, condition, upon] |
97 | Grover Cleveland | December 6, 1886 | [mexico, claims, subject, convention, fiscal] |
98 | Grover Cleveland | December 6, 1887 | [condition, sum, thus, price, tariff] |
99 | Grover Cleveland | December 3, 1888 | [secretary, treaty, upon, per, june] |
100 | Benjamin Harrison | December 3, 1889 | [general, commission, indian, upon, lands] |
101 | Benjamin Harrison | December 1, 1890 | [receipts, subject, upon, per, tariff] |
102 | Benjamin Harrison | December 9, 1891 | [court, tariff, indian, upon, per] |
103 | Benjamin Harrison | December 6, 1892 | [tariff, secretary, upon, value, per] |
104 | William McKinley | December 6, 1897 | [conditions, upon, international, territory, s... |
105 | William McKinley | December 5, 1898 | [navy, commission, naval, june, spain] |
106 | William McKinley | December 5, 1899 | [treaty, officers, commission, international, ... |
107 | William McKinley | December 3, 1900 | [settlement, civil, shall, convention, commiss... |
108 | Theodore Roosevelt | December 3, 1901 | [army, commercial, conditions, navy, man] |
109 | Theodore Roosevelt | December 2, 1902 | [upon, man, navy, conditions, tariff] |
110 | Theodore Roosevelt | December 7, 1903 | [june, lands, territory, property, treaty] |
111 | Theodore Roosevelt | December 6, 1904 | [cases, conditions, indian, labor, man] |
112 | Theodore Roosevelt | December 5, 1905 | [upon, conditions, commission, cannot, man] |
113 | Theodore Roosevelt | December 3, 1906 | [upon, navy, tax, court, man] |
114 | Theodore Roosevelt | December 3, 1907 | [conditions, navy, upon, army, man] |
115 | Theodore Roosevelt | December 8, 1908 | [man, officers, labor, control, banks] |
116 | William H. Taft | December 7, 1909 | [convention, banks, court, department, tariff] |
117 | William H. Taft | December 6, 1910 | [department, court, commercial, international,... |
118 | William H. Taft | December 5, 1911 | [mexico, department, per, tariff, court] |
119 | William H. Taft | December 3, 1912 | [tariff, upon, army, per, department] |
120 | Woodrow Wilson | December 2, 1913 | [how, shall, upon, mexico, ought] |
121 | Woodrow Wilson | December 8, 1914 | [shall, convention, ought, matter, upon] |
122 | Woodrow Wilson | December 7, 1915 | [her, navy, millions, economic, cannot] |
123 | Woodrow Wilson | December 5, 1916 | [commerce, shall, upon, commission, bill] |
124 | Woodrow Wilson | December 4, 1917 | [purpose, her, know, settlement, shall] |
125 | Woodrow Wilson | December 2, 1918 | [shall, go, men, upon, back] |
126 | Woodrow Wilson | December 2, 1919 | [economic, her, budget, labor, conditions] |
127 | Woodrow Wilson | December 7, 1920 | [expenditures, receipts, treasury, budget, upon] |
128 | Warren Harding | December 6, 1921 | [capital, ought, problems, conditions, tariff] |
129 | Warren Harding | December 8, 1922 | [responsibility, republic, problems, ought, per] |
130 | Calvin Coolidge | December 6, 1923 | [conditions, production, commission, ought, co... |
131 | Calvin Coolidge | December 3, 1924 | [navy, international, desire, economic, court] |
132 | Calvin Coolidge | December 8, 1925 | [international, budget, economic, ought, court] |
133 | Calvin Coolidge | December 7, 1926 | [tax, federal, reduction, tariff, ought] |
134 | Calvin Coolidge | December 6, 1927 | [construction, banks, per, program, property] |
135 | Calvin Coolidge | December 4, 1928 | [federal, department, production, program, per] |
136 | Herbert Hoover | December 3, 1929 | [commission, federal, construction, tariff, per] |
137 | Herbert Hoover | December 2, 1930 | [about, budget, economic, per, construction] |
138 | Herbert Hoover | December 8, 1931 | [upon, construction, federal, economic, banks] |
139 | Herbert Hoover | December 6, 1932 | [health, june, value, economic, banks] |
140 | Franklin D. Roosevelt | January 3, 1934 | [labor, permanent, problems, cannot, banks] |
141 | Franklin D. Roosevelt | January 4, 1935 | [private, work, local, program, cannot] |
142 | Franklin D. Roosevelt | January 3, 1936 | [income, shall, let, say, today] |
143 | Franklin D. Roosevelt | January 6, 1937 | [powers, convention, needs, help, problems] |
144 | Franklin D. Roosevelt | January 3, 1938 | [budget, business, economic, today, income] |
145 | Franklin D. Roosevelt | January 4, 1939 | [labor, cannot, capital, income, billion] |
146 | Franklin D. Roosevelt | January 3, 1940 | [world, domestic, cannot, economic, today] |
147 | Franklin D. Roosevelt | January 6, 1941 | [freedom, problems, cannot, program, today] |
148 | Franklin D. Roosevelt | January 6, 1942 | [him, today, know, forces, production] |
149 | Franklin D. Roosevelt | January 7, 1943 | [pacific, get, cannot, americans, production] |
150 | Franklin D. Roosevelt | January 11, 1944 | [individual, total, know, economic, cannot] |
151 | Franklin D. Roosevelt | January 6, 1945 | [cannot, production, army, forces, jobs] |
152 | Harry S. Truman | January 21, 1946 | [fiscal, program, billion, million, dollars] |
153 | Harry S. Truman | January 6, 1947 | [commission, budget, economic, labor, program] |
154 | Harry S. Truman | January 7, 1948 | [tax, billion, today, program, economic] |
155 | Harry S. Truman | January 5, 1949 | [economic, price, program, cannot, production] |
156 | Harry S. Truman | January 4, 1950 | [income, today, program, programs, economic] |
157 | Harry S. Truman | January 8, 1951 | [help, program, production, strength, economic] |
158 | Harry S. Truman | January 9, 1952 | [defense, working, program, help, production] |
159 | Harry S. Truman | January 7, 1953 | [republic, free, cannot, world, economic] |
160 | Dwight D. Eisenhower | February 2, 1953 | [federal, labor, budget, economic, programs] |
161 | Dwight D. Eisenhower | January 7, 1954 | [federal, programs, economic, budget, program] |
162 | Dwight D. Eisenhower | January 6, 1955 | [problems, federal, economic, programs, program] |
163 | Dwight D. Eisenhower | January 5, 1956 | [billion, federal, problems, economic, program] |
164 | Dwight D. Eisenhower | January 10, 1957 | [cannot, programs, human, program, economic] |
165 | Dwight D. Eisenhower | January 9, 1958 | [program, strength, today, programs, economic] |
166 | Dwight D. Eisenhower | January 9, 1959 | [growth, help, billion, programs, economic] |
167 | Dwight D. Eisenhower | January 7, 1960 | [freedom, cannot, today, economic, help] |
168 | Dwight D. Eisenhower | January 12, 1961 | [million, percent, billion, program, programs] |
169 | John F. Kennedy | January 30, 1961 | [budget, programs, problems, economic, program] |
170 | John F. Kennedy | January 11, 1962 | [billion, help, program, jobs, cannot] |
171 | John F. Kennedy | January 14, 1963 | [help, cannot, tax, percent, billion] |
172 | Lyndon B. Johnson | January 8, 1964 | [help, billion, americans, budget, million] |
173 | Lyndon B. Johnson | January 4, 1965 | [americans, man, programs, tonight, help] |
174 | Lyndon B. Johnson | January 12, 1966 | [program, percent, help, billion, tonight] |
175 | Lyndon B. Johnson | January 10, 1967 | [programs, americans, billion, tonight, percent] |
176 | Lyndon B. Johnson | January 17, 1968 | [programs, million, budget, tonight, billion] |
177 | Lyndon B. Johnson | January 14, 1969 | [americans, program, billion, budget, tonight] |
178 | Richard Nixon | January 22, 1970 | [billion, percent, america, today, programs] |
179 | Richard Nixon | January 22, 1971 | [federal, americans, budget, tonight, let] |
180 | Richard Nixon | January 20, 1972 | [america, program, programs, today, help] |
181 | Richard Nixon | February 2, 1973 | [economic, help, americans, working, programs] |
182 | Richard Nixon | January 30, 1974 | [program, americans, today, energy, tonight] |
183 | Gerald R. Ford | January 15, 1975 | [program, percent, billion, programs, energy] |
184 | Gerald R. Ford | January 19, 1976 | [federal, americans, budget, jobs, programs] |
185 | Gerald R. Ford | January 12, 1977 | [programs, today, percent, jobs, energy] |
186 | Jimmy Carter | January 19, 1978 | [cannot, economic, tonight, jobs, it's] |
187 | Jimmy Carter | January 25, 1979 | [cannot, budget, tonight, americans, it's] |
188 | Jimmy Carter | January 21, 1980 | [help, america, energy, tonight, it's] |
189 | Jimmy Carter | January 16, 1981 | [percent, economic, energy, program, programs] |
190 | Ronald Reagan | January 26, 1982 | [jobs, help, program, billion, programs] |
191 | Ronald Reagan | January 25, 1983 | [problems, programs, americans, economic, perc... |
192 | Ronald Reagan | January 25, 1984 | [budget, help, americans, tonight, it's] |
193 | Ronald Reagan | February 6, 1985 | [help, tax, jobs, tonight, it's] |
194 | Ronald Reagan | February 4, 1986 | [america, cannot, it's, budget, tonight] |
195 | Ronald Reagan | January 27, 1987 | [percent, let, budget, tonight, it's] |
196 | Ronald Reagan | January 25, 1988 | [let, americans, it's, budget, tonight] |
197 | George H.W. Bush | February 9, 1989 | [help, ask, it's, budget, tonight] |
198 | George H.W. Bush | January 31, 1990 | [percent, budget, today, tonight, it's] |
199 | George H.W. Bush | January 29, 1991 | [jobs, budget, americans, know, tonight] |
200 | George H.W. Bush | January 28, 1992 | [know, get, tonight, help, it's] |
201 | William J. Clinton | February 17, 1993 | [tax, budget, percent, tonight, jobs] |
202 | William J. Clinton | January 25, 1994 | [americans, it's, health, get, jobs] |
203 | William J. Clinton | January 24, 1995 | [jobs, americans, get, tonight, it's] |
204 | William J. Clinton | January 23, 1996 | [tonight, families, working, americans, children] |
205 | William J. Clinton | February 4, 1997 | [america, children, budget, americans, tonight] |
206 | William J. Clinton | January 27, 1998 | [ask, americans, children, help, tonight] |
207 | William J. Clinton | January 19, 1999 | [children, budget, help, americans, tonight] |
208 | William J. Clinton | January 27, 2000 | [families, help, children, americans, tonight] |
209 | George W. Bush | February 27, 2001 | [help, tax, percent, tonight, budget] |
210 | George W. Bush | September 20, 2001 | [freedom, america, ask, americans, tonight] |
211 | George W. Bush | January 29, 2002 | [americans, budget, tonight, america, jobs] |
212 | George W. Bush | January 28, 2003 | [america, help, million, americans, tonight] |
213 | George W. Bush | January 20, 2004 | [children, america, americans, help, tonight] |
214 | George W. Bush | February 2, 2005 | [freedom, tonight, help, social, americans] |
215 | George W. Bush | January 31, 2006 | [reform, jobs, americans, america, tonight] |
216 | George W. Bush | January 23, 2007 | [children, health, americans, tonight, help] |
217 | George W. Bush | January 29, 2008 | [america, americans, trust, tonight, help] |
218 | Barack Obama | February 24, 2009 | [know, budget, jobs, tonight, it's] |
219 | Barack Obama | January 27, 2010 | [get, tonight, americans, jobs, it's] |
220 | Barack Obama | January 25, 2011 | [percent, get, tonight, jobs, it's] |
221 | Barack Obama | January 24, 2012 | [americans, tonight, get, it's, jobs] |
222 | Barack Obama | February 12, 2013 | [families, it's, get, tonight, jobs] |
223 | Barack Obama | January 28, 2014 | [get, tonight, help, it's, jobs] |
224 | Barack Obama | January 20, 2015 | [families, americans, tonight, jobs, it's] |
225 | Barack Obama | January 12, 2016 | [tonight, jobs, americans, get, it's] |
226 | Donald J. Trump | February 27, 2017 | [america, jobs, americans, it's, tonight] |
227 | Donald J. Trump | January 30, 2018 | [tax, get, it's, americans, tonight] |
228 | Donald J. Trump | February 5, 2019 | [get, jobs, americans, it's, tonight] |
229 | Donald J. Trump | February 4, 2020 | [jobs, it's, americans, percent, tonight] |
230 | Joseph R. Biden Jr. | April 28, 2021 | [get, americans, percent, jobs, it's] |
231 | Joseph R. Biden Jr. | March 1, 2022 | [let, jobs, americans, get, tonight] |
232 | Joseph R. Biden Jr. | February 7, 2023 | [down, percent, jobs, tonight, it's] |
Aside: What if we remove the $\log$ from $\text{idf}(t)$?¶
Let's try it and see what happens.
tfidf_nl_dict = {}
tf_denom = speeches_df['contents'].str.split().str.len()
for word in tqdm(unique_words):
re_pat = fr' {word} ' # Imperfect pattern for speed.
tf = speeches_df['contents'].str.count(re_pat) / tf_denom
idf_nl = len(speeches_df) / speeches_df['contents'].str.contains(re_pat).sum()
tfidf_nl_dict[word] = tf * idf_nl
0%| | 0/500 [00:00<?, ?it/s]
tfidf_nl = pd.DataFrame(tfidf_nl_dict)
tfidf_nl.head()
the | of | to | and | ... | trust | throughout | set | object | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.09 | 0.06 | 0.05 | 0.04 | ... | 1.47e-03 | 0.00e+00 | 0.00e+00 | 5.78e-03 |
1 | 0.09 | 0.06 | 0.03 | 0.03 | ... | 0.00e+00 | 0.00e+00 | 0.00e+00 | 2.99e-03 |
2 | 0.11 | 0.07 | 0.04 | 0.03 | ... | 1.39e-03 | 0.00e+00 | 1.30e-03 | 1.82e-03 |
3 | 0.09 | 0.07 | 0.04 | 0.03 | ... | 2.29e-03 | 7.53e-04 | 0.00e+00 | 2.01e-03 |
4 | 0.09 | 0.07 | 0.04 | 0.02 | ... | 8.12e-04 | 1.60e-03 | 0.00e+00 | 1.07e-03 |
5 rows × 500 columns
keywords_nl = tfidf_nl.apply(five_largest, axis=1)
keywords_nl_df = pd.concat([
speeches_df['president'],
speeches_df['date'],
keywords_nl
], axis=1)
keywords_nl_df
president | date | 0 | |
---|---|---|---|
0 | George Washington | January 8, 1790 | [a, and, to, of, the] |
1 | George Washington | December 8, 1790 | [in, and, to, of, the] |
2 | George Washington | October 25, 1791 | [a, and, to, of, the] |
... | ... | ... | ... |
230 | Joseph R. Biden Jr. | April 28, 2021 | [of, it's, and, to, the] |
231 | Joseph R. Biden Jr. | March 1, 2022 | [we, of, to, and, the] |
232 | Joseph R. Biden Jr. | February 7, 2023 | [a, of, and, to, the] |
233 rows × 3 columns
The role of $\log$ in $\text{idf}(t)$¶
$$ \begin{align*}\text{tfidf}(t, d) &= \text{tf}(t, d) \cdot \text{idf}(t) \\\ &= \frac{\text{# of occurrences of $t$ in $d$}}{\text{total # of words in $d$}} \cdot \log \left(\frac{\text{total # of documents}}{\text{# of documents in which $t$ appears}} \right) \end{align*} $$- Remember, for any positive input $x$, $\log(x)$ is (much) smaller than $x$.
- In $\text{idf}(t)$, the $\log$ "dampens" the impact of the ratio $\frac{\text{# documents}}{\text{# documents with $t$}}$.
- If a word is very common, the ratio will be close to 1. The log of the ratio will be close to 0.
(1000 / 999)
1.001001001001001
np.log(1000 / 999)
0.001000500333583622
- If a word is very common (e.g. 'the'), removing the log multiplies the statistic by a large factor.
- If a word is very rare, the ratio will be very large. However, for instance, a word being seen in 2 out of 50 documents is not very different than being seen in 2 out of 500 documents (it is very rare in both cases), and so $\text{idf}(t)$ should be similar in both cases.
(50 / 2)
25.0
(500 / 2)
250.0
np.log(50 / 2)
3.2188758248682006
np.log(500 / 2)
5.521460917862246
Modeling¶
Reflection¶
So far this quarter, we've learned how to:
- Extract information from tabular data using
pandas
and regular expressions. - Clean data so that it best represents a data generating process.
- Missingness analyses and imputation.
- Collect data from the internet through scraping and APIs, and parse it using BeautifulSoup.
- Perform exploratory data analysis through aggregation, visualization, and the computation of summary statistics like TF-IDF.
- Infer about the relationships between samples and populations through hypothesis and permutation testing.
- Now, let's make predictions.
Modeling¶
Data generating process: A real-world phenomena that we are interested in studying.
- Example: Every year, city employees are hired and fired, earn salaries and benefits, etc.
- Unless we work for the city, we can't observe this process directly.
Model: A theory about the data generating process.
- Example: If an employee is $X$ years older than average, then they will make $100,000 in salary.
Fit Model: A model that is learned from a particular set of observations, i.e. training data.
- Example: If an employee is 5 years older than average, they will make $100,000 in salary.
- How is this estimate determined? What makes it "good"?
Goals of modeling¶
To make accurate predictions regarding unseen data drawn from the data generating process.
- Given this dataset of past UCSD data science students' salaries, can we predict your future salary? (regression)
- Given this dataset of images, can we predict if this new image is of a dog, cat, or zebra? (classification)
To make inferences about the structure of the data generating process, i.e. to understand complex phenomena.
- Is there a linear relationship between the heights of children and the heights of their biological mothers?
- The weights of smoking and non-smoking mothers' babies babies in my sample are different – how confident am I that this difference exists in the population?
Of the two focuses of models, we will focus on prediction.
In the above taxonomy, we will focus on supervised learning.
Features¶
A feature is a measurable property of a phenomenon being observed.
- Other terms for "feature" include "(explanatory) variable" and "attribute".
- Typically, features are the inputs to models.
In DataFrames, features typically correspond to columns, while rows typically correspond to different individuals.
- There are two types of features:
- Features that come as part of a dataset, e.g. weight and height.
- Features that we create, e.g. $\text{BMI} = \frac{\text{weight (kg)}}{\text{[height (m)]}^2}$.
- Example: TF-IDF creates features that summarize documents!
Example: Restaurant tips 🧑🍳¶
About the data¶
What features does the dataset contain?
# The dataset is built into plotly (and seaborn)!
tips = px.data.tips()
tips
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
... | ... | ... | ... | ... | ... | ... | ... |
241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 |
242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 |
243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
244 rows × 7 columns
Predicting tips¶
Goal: Given various information about a table at a restaurant, we want to predict the tip that a server will earn.
Why might a server be interested in doing this?
- To determine which tables are likely to tip the most (inference).
- To predict earnings over the next month (prediction).
Exploratory data analysis (EDA)¶
The most natural feature to look at first is
'total_bill'
.As such, we should explore the relationship between
'total_bill'
and'tip'
, as well as the distributions of both columns individually.As we do so, try to describe each distribution in words.
Visualizing distributions¶
tips.plot(kind='scatter',
x='total_bill', y='tip',
title='Tip vs. Total Bill')