import pandas as pd
import numpy as np
import re
pd.options.plotting.backend = 'plotly'
import util
operation | example | matches ✅ | does not match ❌ |
---|---|---|---|
escape character | ucsd\.edu |
'ucsd.edu' |
'ucsd!edu' |
beginning of line | ^ark |
'ark two' 'ark o ark' |
'dark' |
end of line | ark$ |
'dark' 'ark o ark' |
'ark two' |
zero or one | cat? |
'ca' 'cat' |
'cart' (matches 'ca' only) |
built-in character classes* | \w+ \d+ |
'billy' '231231' |
'this person' '858 people' |
character class negation | [^a-z]+ |
'KINGTRITON551' '1721$$' |
'porch' 'billy.edu' |
*Note: in Python's implementation of regex,
\d
refers to digits.\w
refers to alphanumeric characters ([A-Z][a-z][0-9]_
).\s
refers to whitespace.\b
is a word boundary.\d{3} \d{3}-\d{4}
match?\bcat\b
match? Does it find a match in 'my cat is hungry'
? What about 'concatenate'
?Write a regular expression that matches any string that:
'Y'
and 'y'
), periods, and spaces.Examples include 'yoo.ee.IOU'
and 'AI.I oey'
.
^[aeiouyAEIOUY. ]{5,10}$
[...]
), special characters do not generally need to be escaped.
re
in Python¶The re
package is built into Python. It allows us to use regular expressions to find, extract, and replace strings.
import re
re.search
takes in a string regex
and a string text
and returns the location and substring corresponding to the first match of regex
in text
.
re.search('AB*A',
'here is a string for you: ABBBA. here is another: ABBBBBBBA')
<re.Match object; span=(26, 31), match='ABBBA'>
re.findall
takes in a string regex
and a string text
and returns a list of all matches of regex
in text
. You'll use this most often.
re.findall('AB*A',
'here is a string for you: ABBBA. here is another: ABBBBBBBA')
['ABBBA', 'ABBBBBBBA']
re.sub
takes in a string regex
, a string repl
, and a string text
, and replaces all matches of regex
in text
with repl
.
re.sub('AB*A',
'billy',
'here is a string for you: ABBBA. here is another: ABBBBBBBA')
'here is a string for you: billy. here is another: billy'
When using regular expressions in Python, it's a good idea to use raw strings, denoted by an r
before the quotes, e.g. r'exp'
.
re.findall('\bcat\b', 'my cat is hungry')
[]
re.findall(r'\bcat\b', 'my cat is hungry')
['cat']
# Huh?
print('\bcat\b')
cat
(
and )
to define a capture group within a pattern.re.findall(r'\w+@(\w+)\.edu',
'my old email was billy@notucsd.edu, my new email is notbilly@ucsd.edu')
['notucsd', 'ucsd']
(
and )
!re.findall(r'\w+@\w+\.edu',
'my old email was billy@notucsd.edu, my new email is notbilly@ucsd.edu')
['billy@notucsd.edu', 'notbilly@ucsd.edu']
re.findall
, all groups are treated as capturing groups.# A regex that matches strings with two of the same vowel followed by 3 digits
# We only want to capture the digits, but...
re.findall(r'(aa|ee|ii|oo|uu)(\d{3})', 'eeoo124')
[('oo', '124')]
Web servers typically record every request made of them in the "logs".
s = '''132.249.20.188 - - [24/Feb/2023:12:26:15 -0800] "GET /my/home/ HTTP/1.1" 200 2585'''
Let's use our new regex syntax (including capturing groups) to extract the day, month, year, and time from the log string s
.
exp = '\[(.+)\/(.+)\/(.+):(.+):(.+):(.+) .+\]'
re.findall(exp, s)
[('24', 'Feb', '2023', '12', '26', '15')]
While above regex works, it is not very specific. It works on incorrectly formatted log strings.
other_s = '[adr/jduy/wffsdffs:r4s4:4wsgdfd:asdf 7]'
re.findall(exp, other_s)
[('adr', 'jduy', 'wffsdffs', 'r4s4', '4wsgdfd', 'asdf')]
.*
matches every possible string, but we don't use it very often.\[(\d{2})\/([A-Z]{1}[a-z]{2})\/(\d{4}):(\d{2}):(\d{2}):(\d{2}) -\d{4}\]
\d{2}
matches any 2-digit number.[A-Z]{1}
matches any single occurrence of any uppercase letter.[a-z]{2}
matches any 2 consecutive occurrences of lowercase letters.[
, ]
, /
) need to be escaped with \
.s
'132.249.20.188 - - [24/Feb/2023:12:26:15 -0800] "GET /my/home/ HTTP/1.1" 200 2585'
new_exp = '\[(\d{2})\/([A-Z]{1}[a-z]{2})\/(\d{4}):(\d{2}):(\d{2}):(\d{2}) -\d{4}\]'
re.findall(new_exp, s)
[('24', 'Feb', '2023', '12', '26', '15')]
A benefit of new_exp
over exp
is that it doesn't capture anything when the string doesn't follow the format we specified.
other_s
'[adr/jduy/wffsdffs:r4s4:4wsgdfd:asdf 7]'
re.findall(new_exp, other_s)
[]
Writing a regular expression is like writing a program.
Regular expressions are terrible at certain types of problems. Examples:
Below is a regular expression that validates email addresses in Perl. See this article for more details.
StackOverflow crashed due to regex! See this article for the details.
Suppose we'd like to predict the sentiment of a piece of text from 1 to 10.
Example:
salaries = pd.read_csv('https://transcal.s3.amazonaws.com/public/export/san-diego-2021.csv')
util.anonymize_names(salaries)
salaries.head()
Employee Name | Job Title | Base Pay | Overtime Pay | Other Pay | Benefits | Total Pay | Pension Debt | Total Pay & Benefits | Year | Notes | Agency | Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Mara Xxxx | City Attorney | 218759.0 | 0.0 | -2560.00 | 108652.0 | 216199.0 | 427749.18 | 752600.18 | 2021 | NaN | San Diego | FT |
1 | Todd Xxxx | Mayor | 218759.0 | 0.0 | -81.00 | 95549.0 | 218678.0 | 427749.18 | 741976.18 | 2021 | NaN | San Diego | FT |
2 | Elizabeth Xxxx | Investment Officer | 259732.0 | 0.0 | -870.00 | 71438.0 | 258862.0 | 221041.09 | 551341.09 | 2021 | NaN | San Diego | FT |
3 | Terence Xxxx | Police Officer | 212837.0 | 0.0 | 39683.00 | 56569.0 | 252520.0 | 222375.06 | 531464.06 | 2021 | NaN | San Diego | FT |
4 | Andrea Xxxx | Independent Budget Analyst | 224312.0 | 0.0 | 59819.00 | 54213.0 | 284131.0 | 192126.79 | 530470.79 | 2021 | NaN | San Diego | FT |
'Deputy Fire Chief'
and 'Fire Battalion Chief'
are more similar than 'Deputy Fire Chief'
and 'City Attorney'
.jobtitles = salaries['Job Title']
jobtitles.head()
0 City Attorney 1 Mayor 2 Investment Officer 3 Police Officer 4 Independent Budget Analyst Name: Job Title, dtype: object
How many employees are in the dataset? How many unique job titles are there?
jobtitles.shape[0], jobtitles.nunique()
(12305, 588)
What are the most common job titles?
jobtitles.value_counts().iloc[:100]
Police Officer 2123 Fire Fighter Ii 331 Assistant Engineer - Civil 284 Grounds Maintenance Worker Ii 250 Fire Captain 248 ... Grounds Maintenance Manager 27 Electrician 27 Executive Assistant 26 Paralegal 26 Librarian Iv 25 Name: Job Title, Length: 100, dtype: int64
jobtitles.value_counts().iloc[:25].sort_values().plot(kind='barh')
Are there any missing job titles?
jobtitles.isna().sum()
2
There aren't many. To avoid having to deal with missing values later on, let's just drop the two missing job titles now.
jobtitles = jobtitles[jobtitles.notna()]
Remember, our goal is ultimately to count the number of shared words between job titles. But before we start counting the number of shared words, we need to consider the following:
'-'
and '&'
, which may count as words when they shouldn't.'Assistant - Manager'
and 'Assistant Manager'
should count as the same job title.'to'
and 'the'
, which (we can argue) also shouldn't count as words.'Assistant To The Manager'
and 'Assistant Manager'
should count as the same job title.Let's address the above issues. The process of converting job titles so that they are always represented the same way is called canonicalization.
Are there job titles with unnecessary punctuation that we can remove?
To find out, we can write a regular expression that looks for characters other than letters, numbers, and spaces.
We can use regular expressions with the .str
methods we learned earlier in the quarter just by using regex=True
.
# Uses character class negation
jobtitles.str.contains(r'[^A-Za-z0-9 ]', regex=True).sum()
845
jobtitles[jobtitles.str.contains(r'[^A-Za-z0-9 ]', regex=True)].head()
281 Park & Recreation Director 539 Associate Engineer - Mechanical 1023 Associate Engineer - Civil 1376 Associate Engineer - Traffic 1460 Budget/Legislative Analyst I Name: Job Title, dtype: object
It seems like we should replace these pieces of punctuation with a single space.
Are there job titles with "glue" words in the middle, such as 'Assistant to the Manager'
?
To figure out if any titles contain the word 'to'
, we can't just do the following, because it will evaluate to True
for job titles that have 'to'
anywhere in them, even if not as a standalone word.
# Why are we converting to lowercase?
jobtitles.str.lower().str.contains('to').sum()
1541
jobtitles[jobtitles.str.lower().str.contains('to')]
0 City Attorney 10 Assistant Retirement Administrator 25 Department Director 26 Assistant City Attorney 27 Fire Prevention Inspector Ii ... 12162 Test Monitor Ii 12185 Word Processing Operator 12190 Deputy Director 12210 City Attorney Investigator 12267 Test Monitor Ii Name: Job Title, Length: 1541, dtype: object
Instead, we need to look for 'to'
separated by word boundaries.
jobtitles.str.lower().str.contains(r'\bto\b', regex=True).sum()
11
jobtitles[jobtitles.str.lower().str.contains(r'\bto\b', regex=True)]
664 Assistant To The Fire Chief 1403 Principal Assistant To City Attorney 2358 Assistant To The Director 4336 Confidential Secretary To Police Chief 4459 Assistant To The Director 5196 Confidential Secretary To Chief Operating Officer 5563 Confidential Secretary To City Attorney 5685 Assistant To The Director 7544 Confidential Secretary To Mayor 9627 Principal Assistant To City Attorney 12061 Assistant To The Director Name: Job Title, dtype: object
We can look for other filler words too, like 'the'
and 'for'
.
jobtitles[jobtitles.str.lower().str.contains(r'\bthe\b', regex=True)]
664 Assistant To The Fire Chief 2358 Assistant To The Director 4459 Assistant To The Director 5685 Assistant To The Director 12061 Assistant To The Director Name: Job Title, dtype: object
jobtitles[jobtitles.str.lower().str.contains(r'\bfor\b', regex=True)]
3676 Assistant For Community Outreach 4451 Assistant For Community Outreach 11010 Assistant For Community Outreach Name: Job Title, dtype: object
We should probably remove these "glue" words.
Let's put the following two steps together, and canonicalize job titles by:
'to'
, 'the'
, and 'for'
,jobtitles = (
jobtitles
.str.lower()
.str.replace(r'\bto\b|\bthe\b|\bfor\b', '', regex=True)
.str.replace('[^A-Za-z0-9 ]', ' ', regex=True)
.str.replace(' +', ' ', regex=True) # ' +' matches 1 or more occurrences of a space.
.str.strip() # Removes leading/trailing spaces if present.
)
jobtitles.sample(10)
7755 paralegal 3775 police officer 11323 clerical assistant ii 9372 greenskeeper 7221 pesticide applicator 10655 assistant center director 7010 recycling specialist iii 11046 lifeguard i 8452 library assistant iii 10363 library assistant ii Name: Job Title, dtype: object
Another possible issue is that some job titles may have inconsistent representations of the same word (e.g. 'Asst.'
vs 'Assistant'
).
jobtitles[jobtitles.str.contains('asst')].value_counts()
Series([], Name: Job Title, dtype: int64)
jobtitles[jobtitles.str.contains('assistant')].value_counts().head()
assistant engineer civil 284 library assistant i 127 library assistant ii 116 library assistant iii 107 clerical assistant ii 100 Name: Job Title, dtype: int64
The 2020 salaries dataset had several of these issues, but fortunately they appear to be fixed for us in the 2021 dataset (thanks, Transparent California).
Recall, our idea is to measure the similarity of two job titles by counting the number of shared words between the job titles. How do we actually do that, for all of the job titles we have?
Let's create a "counts" matrix, such that:
title
and column word
is the number of occurrences of word
in title
.Such a matrix might look like:
senior | lecturer | teaching | professor | assistant | associate | |
---|---|---|---|---|---|---|
senior lecturer | 1 | 1 | 0 | 0 | 0 | 0 |
assistant teaching professor | 0 | 0 | 1 | 1 | 1 | 0 |
associate professor | 0 | 0 | 0 | 1 | 0 | 1 |
senior assistant to the assistant professor | 1 | 0 | 0 | 1 | 2 | 0 |
First, we need to determine all words that are used across all job titles.
jobtitles.str.split()
0 [city, attorney] 1 [mayor] 2 [investment, officer] 3 [police, officer] 4 [independent, budget, analyst] ... 12300 [recreation, leader, i] 12301 [fire, fighter, ii] 12302 [fire, captain] 12303 [fleet, repair, supervisor] 12304 [fire, engineer] Name: Job Title, Length: 12303, dtype: object
all_words = jobtitles.str.split().sum()
all_words[:10]
['city', 'attorney', 'mayor', 'investment', 'officer', 'police', 'officer', 'independent', 'budget', 'analyst']
Next, to determine the columns of our matrix, we need to find a list of all unique words used in titles. We can do this with np.unique
, but value_counts
shows us the distribution, which is interesting.
unique_words = pd.Series(all_words).value_counts()
unique_words.head(10)
officer 2343 ii 2305 police 2294 i 1449 assistant 1193 fire 1158 engineer 1032 civil 667 iii 625 technician 616 dtype: int64
len(unique_words)
327
Note that in unique_words.index
, job titles are sorted by number of occurrences!
For each of the 327 unique words that are used in job titles, we can count the number of occurrences of the word in each job title.
'deputy fire chief'
contains the word 'deputy'
once, the word 'fire'
once, and the word 'chief'
once.'assistant managers assistant'
contains the word 'assistant'
twice and the word 'managers'
once.# Created using a dictionary to avoid a "DataFrame is highly fragmented" warning.
counts_dict = {}
for word in unique_words.index:
re_pat = fr'\b{word}\b'
counts_dict[word] = jobtitles.str.count(re_pat).astype(int).tolist()
counts_df = pd.DataFrame(counts_dict)
counts_df.head()
officer | ii | police | i | assistant | fire | engineer | civil | iii | technician | ... | estate | stores | assets | treasurer | risk | security | geologist | utilities | gardener | principle | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 327 columns
counts_df
has one row for all 12303 employees, and one column for each unique word that is used in a job title.
counts_df.shape
(12303, 327)
To put into context what the numbers in counts_df
mean, we can show the actual job title for each row.
counts_df = counts_df.set_index(jobtitles)
counts_df
officer | ii | police | i | assistant | fire | engineer | civil | iii | technician | ... | estate | stores | assets | treasurer | risk | security | geologist | utilities | gardener | principle | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Job Title | |||||||||||||||||||||
city attorney | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
mayor | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
investment officer | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
police officer | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
independent budget analyst | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
recreation leader i | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
fire fighter ii | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
fire captain | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
fleet repair supervisor | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
fire engineer | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
12303 rows × 327 columns
The fourth row tells us that the fourth job title contains 'police'
once and 'officer'
once.
counts_df.head()
officer | ii | police | i | assistant | fire | engineer | civil | iii | technician | ... | estate | stores | assets | treasurer | risk | security | geologist | utilities | gardener | principle | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Job Title | |||||||||||||||||||||
city attorney | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
mayor | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
investment officer | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
police officer | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
independent budget analyst | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 327 columns
The Series below describes the 20 most common words used in job titles, along with the number of times they appeared in all job titles (including repeats). We will call these words "top 20" words.
# Remember, the columns of counts_df are ordered by number of occurrences.
counts_df.iloc[:, :20].sum()
officer 2343 ii 2305 police 2294 i 1449 assistant 1193 fire 1158 engineer 1032 civil 667 iii 625 technician 616 senior 567 associate 558 analyst 527 worker 496 fighter 470 management 409 manager 393 operator 380 recreation 371 supervisor 362 dtype: int64
The Series below describes the number of top 20 words used in each job title.
counts_df.iloc[:, :20].sum(axis=1)
Job Title city attorney 0 mayor 0 investment officer 1 police officer 2 independent budget analyst 1 .. recreation leader i 2 fire fighter ii 3 fire captain 1 fleet repair supervisor 1 fire engineer 2 Length: 12303, dtype: int64
'deputy fire chief'
?¶counts_df
, which contains a row vector for each job title.To start, let's compare the row vectors for 'deputy fire chief'
and 'fire battalion chief'
.
dfc = counts_df.loc['deputy fire chief'].iloc[0]
dfc
officer 0 ii 0 police 0 i 0 assistant 0 .. security 0 geologist 0 utilities 0 gardener 0 principle 0 Name: deputy fire chief, Length: 327, dtype: int64
fbc = counts_df.loc['fire battalion chief'].iloc[0]
fbc
officer 0 ii 0 police 0 i 0 assistant 0 .. security 0 geologist 0 utilities 0 gardener 0 principle 0 Name: fire battalion chief, Length: 327, dtype: int64
We can stack these two vectors horizontally.
pair_counts = (
pd.concat([dfc, fbc], axis=1)
.sort_values(by=['deputy fire chief', 'fire battalion chief'], ascending=False)
.head(10)
.T
)
pair_counts
fire | chief | deputy | battalion | officer | ii | police | i | assistant | engineer | |
---|---|---|---|---|---|---|---|---|---|---|
deputy fire chief | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
fire battalion chief | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
One way to measure how similar the above two vectors are is through their dot product.
np.sum(pair_counts.iloc[0] * pair_counts.iloc[1])
2
Here, since both vectors consist only of 1s and 0s, the dot product is equal to the number of shared words between the two job titles.
To find the job title that is most similar to 'deputy fire chief'
, we can compute the dot product of the 'deputy fire chief'
word vector with all other titles' word vectors, and find the title with the highest dot product.
counts_df.head()
officer | ii | police | i | assistant | fire | engineer | civil | iii | technician | ... | estate | stores | assets | treasurer | risk | security | geologist | utilities | gardener | principle | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Job Title | |||||||||||||||||||||
city attorney | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
mayor | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
investment officer | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
police officer | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
independent budget analyst | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 327 columns
dfc
officer 0 ii 0 police 0 i 0 assistant 0 .. security 0 geologist 0 utilities 0 gardener 0 principle 0 Name: deputy fire chief, Length: 327, dtype: int64
To do so, we can apply np.dot
to each row that doesn't correspond to 'deputy fire chief'
.
dots = (
counts_df[counts_df.index != 'deputy fire chief']
.apply(lambda s: np.dot(s, dfc), axis=1)
.sort_values(ascending=False)
)
dots
Job Title fire battalion chief 2 fire battalion chief 2 assistant fire chief 2 fire battalion chief 2 fire battalion chief 2 .. finance analyst iii 0 associate engineer traffic 0 supervising procurement contracting officer 0 sanitation driver ii 0 city attorney 0 Length: 12292, dtype: int64
The unique job titles that are most similar to 'deputy fire chief'
are given below.
np.unique(dots.index[dots == dots.max()])
array(['assistant deputy chief operating officer', 'assistant fire chief', 'deputy chief operating officer', 'fire battalion chief', 'fire chief'], dtype=object)
Note that they all share two words in common with 'deputy fire chief'
.
Note: To truly use the dot product as a measure of similarity, we should normalize by the lengths of the word vectors. More on this next time.
pandas
.str
methods can use regular expressions; just set regex=True
.'deputy fire chief'
, into feature vectors, is to count the number of occurrences of each word in the text, ignoring order. This is done using the bag of words model.