Andrew Harrison Way

Data Scientist

  • Home
  • Portfolio
  • Resume
  • Journal
  • Contact

Project 4 - Will Power

March 09, 2020 by Andrew Way in Projects
“Shakespeare - the nearest thing in incarnation to the eye of God”
— Laurence Olivier

On Shakespeare:

William Shakespeare is deserving of our praise, methinks. A pioneer of prose and verse, his writing has endured the trials of time. The themes in his works have spoken to audiences for more than four centuries, and his influence is still felt today in literature, drama, and poetry.

What better writer could one pick to test the limits of Natural Language Processing?

I turned to the sonnets, a group of 154 structured poems authored beginning in 1599 and released in 1609, along with the long form poem A Lover’s Complaint. Using BeautifulSoup, I scraped all 154 poems, along with 154 contemporary translations just in case Shakespeare’s portfolio proved to be too dense or too subtle.

I was thrilled to find I wouldn’t need them.


On Sonnets:

The sonnet was not a new construction, even in Shakespeare’s time. The sonnet had emerged nearly 400 years prior in 13th century Italy. Sicilian poets were inspired by pastoral provencal love poems and created their own poetic structures with intricate rhyme schemes.

In the following centuries, the sonnet would continue to evolve. By the time Shakespeare embarked on his poetic pilgrimage, a new English sonnet tradition had been in place for more than half a century, pioneered by Sir Thomas Wyatt and Henry Howard, Earl of Surrey. These English sonnets were extremely rigorous in rhyme, meter, and verse.

Rather than wilt within these poetic boundaries, Shakespeare’s sonnets burst forth like blooms. Shakespeare leverages the inherent structure of the form, rather that fighting against it. His body of sonnets is a triumphant expression of cunning composition.


On Structure:

Elizabethan sonnets have three strict conventions to uphold:

Verse

Sonnets are arranged in 4 stanzas or sections. Stanzas 1-3 are quatrains (four lines), and stanza 4 is a couplet (two lines).

Rhyme

The rhyme scheme is regular throughout the stanzas. In each quatrain, the rhyme follows an ABAB pattern, repeating two more times with different rhymes. The concluding couplet has a matching rhyme (AA).

Meter

The sonnets, along with a large portion of William Shakespeare's total corpus, are written in poetic meter. Will's meter is particularly rigorous, mostly constructed in iambic pentameter, meaning there are 10 syllables in each line, grouped into 5 groups of 2 syllables (feet), one stressed, one unstressed. If you can think of the musical theme to The Pink Panther, that is an iambic foot.

We can see all three parameters at work in one of Shakespeare’s most iconic sonnets:


Sonnet 18

Verse Rhyme Text (with Meter)
Stanza 1 - Quatrain A Shall I compare thee to a summer's day?
B Thou art more lovely and more temperate:
A Rough winds do shake the darling buds of May,
B And summer's lease hath all too short a date:
Stanza 2 - Quatrain C Sometime too hot the eye of heaven shines,
D And often is his gold complexion dimm'd;
C And every fair from fair sometime declines,
D By chance or nature's changing course untrimm'd;
Stanza 3 - Quatrain E But thy eternal summer shall not fade
F Nor lose possession of that fair thou owest;
E Nor shall Death brag thou wander'st in his shade,
F When in eternal lines to time thou growest:
Stanza 4 - Couplet G So long as men can breathe or eyes can see,
G So long lives this and this gives life to thee.

Using NLP:

While our algorithms can group our terms into like categories, it is our job as readers and data scientists to interpret their meaning. To process the poems, I utilized the term frequency—inverse document frequency (TF—IDF) vectorizer and ran the text and vectorizer through a non-negative matrix factorization (NMF) model. After iterating through a range of different numbers of categories, the best fit for our poems was a partition of four categories. Below, you will find the key words for each category, labeled with my interpretation of their shared subject.

Category 2: Illusion and Perception

heart, eyes, eye, hearts, sight, face, looks, picture, thoughts, right

Category 1: Loves Thrall

love, loves, true, hate, new, prove, dear, sweet, soul, sake

Category 4: The Passage of Time

time, world, life, times, make, earth, night, day, happy, away

Category 3: Objective Beauty

beauty, praise, fair, muse, sweet, beautys, old, days, truth, worth


Applying Categories:

With these categories from our model, we can go back and assess each poem on how well it falls into a given category. From this, we can begin to see more than just a collection of love poems, but an exploration of the deeper and subtler features of love.

Sonnet 18:

Shall I compare thee to a summer’s day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer’s lease hath all too short a date;
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimm'd;
And every fair from fair sometime declines,
By chance or nature’s changing course untrimm'd;
But thy eternal summer shall not fade,
Nor lose possession of that fair thou ow’st;
Nor shall death brag thou wander’st in his shade,
When in eternal lines to time thou grow’st:
   So long as men can breathe or eyes can see,
   So long lives this, and this gives life to thee.

Category score:

  • Love 0.0

  • Illusion 0.02992

  • Beauty 0.18229

  • Time 0.07058

While a bit odd at first glance, we can begin to see these categories at play when we return to Sonnet 18. Beauty is prominently featured in mentions of loveliness and fairness, time is expressed in the passage of seasons from summer, and illusion makes an appearance at the end expressing perpetuity of love.


Conclusion:

It is very surprising as a contemporary analyst to see how perceptibly clear-eyed Shakespeare was in his writing. For the common reader, the writing can be almost overwhelmingly dense. However, when analyzing the work programmatically and as a whole, we begin to see exactly how each piece feels like a diamond in a crown: rigidly structured, intricately multifaceted, and joyously resplendent. And though he may not be around to wear it anymore, Bill still deserves it for his contributions to literature and to the English Language as a whole.

March 09, 2020 /Andrew Way
NLP, Clustering, tf-idf, vectorizer, shakespeare
Projects

Project 3 - Rules of Attrition

February 18, 2020 by Andrew Way in Projects

“Nothing is really work unless you would rather be doing something else.”
— J. M. Barrie

The contents of this project are classified…

…by algorithm.

Project 3 at Metis saw our first foray into the world of classification algorithms. We were tasked to find a dataset with somewhere between 1000 and 100000 entries, 10 or more features, and a question we could answer using binary or multiclass partitioning algorithms.

A large part of my work history is in Human Resources and Recruiting, so I set out to find a dataset to help me answer figure out what is it that makes people quit jobs, and what can we do to change that?


A Word on Attrition:

I think many business leaders underestimate the true cost of turnover. The reality is that at this moment in time, it is truly an employee’s market. Skilled workers have never been in such high demand, and that can be seen in trends in attrition.

In their annual study, The Work Institute found that in 2018, more than 27% of employees nationwide voluntarily left a job, accounting for more than 41 million job vacancies. Their survey data also concludes that 77% of these attritions are preventable.

The Society for Human Resource Management (SHRM) estimates that replacing an attrited employee can cost between 50% and 250% of their annual salary. They estimate that in 2018 alone, attrition costs for US employers topped $600 billion.

We’ll use some of these numbers later.


The Data:

For this project, I used a dataset called IBM HR Analytics Employee Attrition & Performance from Kaggle. This is a synthetic dataset from IBM’s data science team. There are 1470 individual employee entries and 35 features. On some light inspection, we find the following:

237 of 1470 entries are attrited (16%). As expected, there are trends that we see widely in HR on the whole. Attrited employees generally had:

  • fewer total working years, lower income, and lower job level

  • lower ratings for job satisfaction and job involvement

  • longer commutes and more frequent travel demands

None of these findings are exactly groundbreaking, but with this many features, I was sure there was something to uncover in this data.


The Model:

It is important to note that this dataset has an implicit “survivor bias”. Once employees quit, they don’t tend to un-quit. It quickly becomes important to recognize that we still have employees who are more likely to quit in our currently un-attrited population.

My primary goal with this investigation was to come out of it with an interpretable and actionable model. To be able to forecast with accuracy whether or not an employee is likely to quit is a novel party trick, but it becomes a pointless exercise if we don’t conclude what features are most impactful in retention.

Given this, I used oversampling to balance out my attrition classes and scaled the data for interpretability. I then ran a slew of boilerplate models and compared them to see which ones performed best.

Model Comparison

As illustrated by this chart, the logistic regression and Random Forest models generally outperform the rest at almost every confidence level.

This wound up working in my favor in terms of interpretability; both logistic regressions and Random Forests have coefficient importance built into their metric interpretations.

At a threshold of 35% confidence, our logistic regression has a recall score of 76%, echoing the proportion of preventable attritions.


Findings, Part 1:

Our model’s feature coefficients tell an interesting, but seemingly contradictory, story:

Strongest predictors of retention:

  • Years in current role

  • Stock option level

  • Years with current manager

  • Job Involvement

Strongest predictors of attrition:

  • Overtime obligations

  • Years at company

  • Years since last promotion

  • Frequent business travel

It’s not surprising that the longer an employee stays at a company, the more likely they are to leave; that’s just how time works.

In direct opposition to this, the longer an employee stays in a particular role and the longer they stay with their current manager, the more likely they are to stay in their job. I thought these two narratives were mutually exclusive until I visualized attrition against years with manager.

Screen Shot 2020-02-11 at 5.04.44 PM.png

There seems to be a wall in an employee’s first year with their manager. More than one third of employees in this group wound up attriting. This is significantly reduced after one year and slowly continues to degrade as time moves on.


Visualizing the Cost of Attrition:

The estimated cost of attrition for this group of employees is a staggering $15.7 million.

This is based on the assumption the cost of backfill to be between 50-250% of an employee’s annual salary.

This Tableau Dashboard breaks down the cost of attrition for each job level (1-5).

In this breakdown, we can see something that wasn’t clear in our initial findings…


Findings, Part 2:

Level 3 employees are not our largest population, nor are they the most expensive to replace, but the intersection of cost and population make them indisputably the most expensive group overall.

Though they only make up 13.5% of our company’s attritions, they account for 34.3% of the total estimated losses.

Grouping the data by level, level 3 employees ranked:

  • 4th of 5 in job involvement

  • 5th of 5 in workplace satisfaction and
    job satisfaction


Conclusion:

As a general recommendation for this company, I would push for a more robust training program. This would kill two birds with one stone.

Firstly, robust management training could have a huge impact on entry-level retention. It would also significantly reduce instances of employees quitting during their first year with their manager.

Secondly, it would boost engagement and morale with level 3 employees. It’s no secret that middle management can be extremely tough. One of the hardest transitions an employee can make is from an individual contributor to a people manager. Enabling these employees with the tools they need to thrive can spare them from some of the challenges of this transition.

This would allow us to not work so hard on the workplace and focus more on the work.

February 18, 2020 /Andrew Way
Classification, Partitioning, KNN, KNearestNeighbors, Logistic Regression, SVM
Projects
image.jpg

Project 2 - Regression Progression, or How I Learned to Stop Worrying and Love Machine Learning

January 27, 2020 by Andrew Way in Projects

“All progress takes place outside the comfort zone.”
— Michael John Bobak, Artist

A Confession:

Until my admission to Metis in late 2019, I hadn’t thought about statistics since Dr. Letarte’s class in 11th grade.  In that era, I was only peripherally interested in mathematics and significantly more interested in becoming a Broadway star.

Statistics was unlike any class that preceded it.  Algebra, Geometry, and Pre-calculus were all rooted in some sort of mathematical reality.  There were formulas and variables, the simple building blocks of math. Each question had a specific answer.  Stat was different. I was wary of things like distributions. If you couldn’t sort your hypothetical groups of people into neat, fractional buckets, what was the point?

Stat passed with barely-there marks, and I went through my whole education without having to reconsider greek symbols ever again.  Ten years later I would face my own reckoning.


Arriving at a question:

We were prompted to find a question that could possibly be solved with linear regression.  It would be our first time utilizing machine learning techniques, our first model (or at least mine), plenty of firsts.

I’ve been obsessed with pop music for around the last 4-5 years.  Coming from a background of musical theatre, I used to think it was cool to dismiss pop music for being made for the masses.  In reality, pop music is an incredible force for bringing people together. And 2019 was a great year for it. Billie Eilish pioneered intimate bedroom pop, Lil Nas X set a new record for longest run at no. 1 on the Hot 100 (19 weeks!), and Mariah Carey finally got the recognition she deserved for making a contemporary Christmas classic.

Using newly minted skills (aka BeautifulSoup), I scraped the Billboard Hot 100 for every week in 2019 to try and answer the following:

What is it about this music that we all love?  And if we can find it, can we use it to forecast how far a song will go?


Diving In:

This may seem shocking, but there’s really nothing predictive about a song’s features that can indicate chart success, at least not in the models I was working with.  Having begun with 12 features, I wound up using LASSO to remove features one by one. And after all that LASSO-ing, I was left with only 2 features in a model performing nearly identically to my full-featured first-draft model.

OLS, 12 features


OLS + LASSO, 2 features

Unsurprisingly, the two strongest indicators of how long a song will spend on the chart are:

  1. Peak position

  2. A metric from Spotify called “Popularity”.  Popularity is calculated by play-count, giving weight to more recent plays.  For example, a song played 1000 times today would have a higher Popularity score than a song played 1000 times yesterday.

This finding is even less impressive when you consider that the Hot 100 actually started incorporating streaming counts into their rankings beginning in 2012, so the Popularity metric and the Hot 100 are basically counting the same thing.


Drawing a conclusion:

Perhaps if anyone were able to answer the question “what makes a song popular,” the person answering would probably not be a fledgling data scientist/former theatre kid with a dubious stats background.  And the answer would have probably been monetized by now.

Instead, I can only extrapolate what this lack of a pattern means.  I found some pertinent theory from far outside the realm of data science, and also outside the realm of music.

Raymond Loewy was an acclaimed mid-century French-American industrial designer.  He is responsible for logos for many staple companies including Exxon, Nabisco, and the US Postal Service, as well for a design philosophy known as MAYA: Most Acceptable, Yet Advanced.

Human brains crave two things that are seemingly at odds: the thrill of the unknown and new, and the comfort of the familiar.  What Loewy aspired to do in his design was to push the boundaries of products into new territories, but never to the point where they felt foreign or unfamiliar.

I think this theory can be neatly applied to pop music.  The music that moves us tends to have that dualistic quality of being both warmly familiar and excitingly groundbreaking.  Given the opportunity, I would probably reevaluate these tracks to try and find what they don’t have in common rather than what they do.

Perhaps that will bring us one step closer to understanding what puts the pop in pop music.

Andrew Way, Jan 2020

January 27, 2020 /Andrew Way
projects, data, regression, lasso, ridge
Projects
image.jpg

Project 1 - Enter Through the Turnstile Data

January 13, 2020 by Andrew Way in Projects

“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”
— Sir Arthur Conan Doyle, Sherlock Holmes

THE BEGINNING

Week 1 at Metis began with a prompt, an email from a fictional non-profit organization.

“We are interested in harnessing the power of data and analytics to optimize the effectiveness of our street team work.”

“Where we’d like to solicit your engagement is to use MTA subway data […] to help us optimize the placement of our street teams, such that we can gather the most signatures, ideally from those who will attend [our] gala and contribute to our cause.”

As a brand new resident of Chicago and longtime (now former) resident of New York, I felt as though I had lucked out.  Perhaps I was new to data science, but I was certainly not new to the New York subway system.


THE TASK

Based on the email we were given, our task was to:

  • ‘Optimize effectiveness of street teams’

  • at ‘entrances to subway stations’

  • to target individuals who are ‘passionate about technology’

  • to attend a gala ‘at the beginning of the summer’


THE TEAM

Myself, Andrew Way, former actor and recruiter

Ake Paramadilok, physical therapist

And Tony Ghabour, consultant


THE APPROACH

I didn’t know until this project that the MTA publishes all their weekly ridership data, containing information about every turnstile at every entrance of all ~470 stations in the system.  While this dataset is rather cryptic at first glance, looking at it from the right direction can illuminate trends about the subway and the people who utilize it.

We decided that since this fictional gala is taking place in June, we should look at the 4 weeks ending in May of 2019.  Once we had imported our data and cleaned it with various techniques, we began to pick apart the intricacies of the data.

The turnstile data contains cumulative counts of people both entering and exiting the subway stations.  Given that we are merely looking for people, regardless of their trajectory, we combined these metrics into a total of target individuals.

From our initial visualizations, we were able to see distinct peaks of activity in the morning and evening, roughly correlating with a sort of daily commute.  However, when operating optimally, the turnstiles only send their records at 4-hour intervals. While this is not nearly as specific as I think any of us would have preferred, we could only accept the circumstance and attempt to optimize for it.

We opted to break the data roughly into two sections, morning and evening, partitioning at noon.  We decided not to consider any data associated with late night or early morning (8:00pm to 4:00am).

We then began categorizing on the basis of weekday vs weekend to target the professional market as opposed to the tourist market.  When categorized by day type, then organized by mean target volume, 9 of our top 10 stations were observed means from weekdays.

With this observation, we decided to plot all of our stations with their weekday average on one axis, weekend average on the other, just to see the general shape of how these two variables interact.

We can observe a general trend of weekday use outpacing weekend use.  Given the very linear nature of this data, we went back to the drawing board to try to distinguish some feature that may guide us to the demographic of technological professionals without having to wade into the very deep (and usually not terribly current) pool of census data.


THE MAP

In my own qualitative research online, I stumbled across BuiltInNYC, an organization specifically for New York based startups and tech companies.  From their website, I manually scraped data on ~115 of NYC’s largest tech employers. I also used one of their recent articles to gather information on the largest tech companies in New York based on footprint.

Using two very handy packages, I was able to turn these lists of names and addresses into a powerful visualization comparing our top stations’ locations to this array of companies.  Geopy is a package that can accept addresses as strings and return extensive location data on them, including geographical coordinates. I then took those coordinates and ran them through Folium, a package for creating leaflet maps.  The resulting map compares tech companies (blue) to subway stations (red), indicating potential reach.

unnamed3.png

THE CONCLUSION

We recommended the following stations based on our research:

Fulton Street - A/C, J/Z, 2/3, 4/5

  • connects to PATH train for access to New Jersey

  • tech presence with Spotify and Conde Nast

Columbus Circle - A/C, B/D, 1

  • closest junction station to New York Institute of Technology

  • northernmost station in top stations, dividing residential uptown from commercial downtown

Union Square - L, N/Q/R/W, 4/5/6

  • close proximity to Facebook, Oath, New York University

  • serves L train, the most utilized train in the MTA


THE AFTERMATH

After our presentation (and a few celebratory high fives), we got our feedback and got back to work refactoring our code, taking out all the missteps and dead ends and such.  Once we’d made a notebook we were all pleased with, we gave it one last run and shipped it.

Given my new dexterity with python, I decided to go back to my exercises from the beginning of the week and give them another shot.  Pandas had become significantly more familiar, so I was quite certain I would be able to speed through.

Instead of looking at any of the most utilized turnstiles, I turned my attention to my old stomping grounds.  Unlike many New Yorkers I’ve met, I only ever lived in one place in NYC - Ditmas Park, a quiet area south of Prospect Park.  It’s full of trees and lawns and old Victorian homes. I zoomed in on my own station, Newkirk Plaza.


THE REALIZATION

And then it dawned on me.  As I compared the station to the date, it suddenly occurred to me that I was looking at myself through the data, my own motion through turnstiles, my own commute.  And I felt both big and small, anonymous and recognized.

The point of data science is to make sense of the world around us.  To bring order to the chaos. And in this moment, a field of study that felt so nebulous and hypothetical suddenly rooted itself into the ground.

Andrew Way, Jan 2020

January 13, 2020 /Andrew Way
data visualization, python, geopy, folium, projects
Projects