“All progress takes place outside the comfort zone.”

— Michael John Bobak, Artist

A Confession:

Until my admission to Metis in late 2019, I hadn’t thought about statistics since Dr. Letarte’s class in 11th grade. In that era, I was only peripherally interested in mathematics and significantly more interested in becoming a Broadway star.

Statistics was unlike any class that preceded it. Algebra, Geometry, and Pre-calculus were all rooted in some sort of mathematical reality. There were formulas and variables, the simple building blocks of math. Each question had a specific answer. Stat was different. I was wary of things like distributions. If you couldn’t sort your hypothetical groups of people into neat, fractional buckets, what was the point?

Stat passed with barely-there marks, and I went through my whole education without having to reconsider greek symbols ever again. Ten years later I would face my own reckoning.

Arriving at a question:

We were prompted to find a question that could possibly be solved with linear regression. It would be our first time utilizing machine learning techniques, our first model (or at least mine), plenty of firsts.

I’ve been obsessed with pop music for around the last 4-5 years. Coming from a background of musical theatre, I used to think it was cool to dismiss pop music for being made for the masses. In reality, pop music is an incredible force for bringing people together. And 2019 was a great year for it. Billie Eilish pioneered intimate bedroom pop, Lil Nas X set a new record for longest run at no. 1 on the Hot 100 (19 weeks!), and Mariah Carey finally got the recognition she deserved for making a contemporary Christmas classic.

Using newly minted skills (aka BeautifulSoup), I scraped the Billboard Hot 100 for every week in 2019 to try and answer the following:

What is it about this music that we all love? And if we can find it, can we use it to forecast how far a song will go?

Diving In:

This may seem shocking, but there’s really nothing predictive about a song’s features that can indicate chart success, at least not in the models I was working with. Having begun with 12 features, I wound up using LASSO to remove features one by one. And after all that LASSO-ing, I was left with only 2 features in a model performing nearly identically to my full-featured first-draft model.

Unsurprisingly, the two strongest indicators of how long a song will spend on the chart are:

Peak position
A metric from Spotify called “Popularity”. Popularity is calculated by play-count, giving weight to more recent plays. For example, a song played 1000 times today would have a higher Popularity score than a song played 1000 times yesterday.

This finding is even less impressive when you consider that the Hot 100 actually started incorporating streaming counts into their rankings beginning in 2012, so the Popularity metric and the Hot 100 are basically counting the same thing.

Drawing a conclusion:

Perhaps if anyone were able to answer the question “what makes a song popular,” the person answering would probably not be a fledgling data scientist/former theatre kid with a dubious stats background. And the answer would have probably been monetized by now.

Instead, I can only extrapolate what this lack of a pattern means. I found some pertinent theory from far outside the realm of data science, and also outside the realm of music.

Raymond Loewy was an acclaimed mid-century French-American industrial designer. He is responsible for logos for many staple companies including Exxon, Nabisco, and the US Postal Service, as well for a design philosophy known as MAYA: Most Acceptable, Yet Advanced.

Human brains crave two things that are seemingly at odds: the thrill of the unknown and new, and the comfort of the familiar. What Loewy aspired to do in his design was to push the boundaries of products into new territories, but never to the point where they felt foreign or unfamiliar.

I think this theory can be neatly applied to pop music. The music that moves us tends to have that dualistic quality of being both warmly familiar and excitingly groundbreaking. Given the opportunity, I would probably reevaluate these tracks to try and find what they don’t have in common rather than what they do.

Perhaps that will bring us one step closer to understanding what puts the pop in pop music.

Andrew Way, Jan 2020

“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”

— Sir Arthur Conan Doyle, Sherlock Holmes

THE BEGINNING

Week 1 at Metis began with a prompt, an email from a fictional non-profit organization.

“We are interested in harnessing the power of data and analytics to optimize the effectiveness of our street team work.”

“Where we’d like to solicit your engagement is to use MTA subway data […] to help us optimize the placement of our street teams, such that we can gather the most signatures, ideally from those who will attend [our] gala and contribute to our cause.”

As a brand new resident of Chicago and longtime (now former) resident of New York, I felt as though I had lucked out. Perhaps I was new to data science, but I was certainly not new to the New York subway system.

THE TASK

Based on the email we were given, our task was to:

‘Optimize effectiveness of street teams’
at ‘entrances to subway stations’
to target individuals who are ‘passionate about technology’
to attend a gala ‘at the beginning of the summer’

THE TEAM

Myself, Andrew Way, former actor and recruiter

Ake Paramadilok, physical therapist

And Tony Ghabour, consultant

THE APPROACH

I didn’t know until this project that the MTA publishes all their weekly ridership data, containing information about every turnstile at every entrance of all ~470 stations in the system. While this dataset is rather cryptic at first glance, looking at it from the right direction can illuminate trends about the subway and the people who utilize it.

We decided that since this fictional gala is taking place in June, we should look at the 4 weeks ending in May of 2019. Once we had imported our data and cleaned it with various techniques, we began to pick apart the intricacies of the data.

The turnstile data contains cumulative counts of people both entering and exiting the subway stations. Given that we are merely looking for people, regardless of their trajectory, we combined these metrics into a total of target individuals.

From our initial visualizations, we were able to see distinct peaks of activity in the morning and evening, roughly correlating with a sort of daily commute. However, when operating optimally, the turnstiles only send their records at 4-hour intervals. While this is not nearly as specific as I think any of us would have preferred, we could only accept the circumstance and attempt to optimize for it.

We opted to break the data roughly into two sections, morning and evening, partitioning at noon. We decided not to consider any data associated with late night or early morning (8:00pm to 4:00am).

We then began categorizing on the basis of weekday vs weekend to target the professional market as opposed to the tourist market. When categorized by day type, then organized by mean target volume, 9 of our top 10 stations were observed means from weekdays.

With this observation, we decided to plot all of our stations with their weekday average on one axis, weekend average on the other, just to see the general shape of how these two variables interact.

We can observe a general trend of weekday use outpacing weekend use. Given the very linear nature of this data, we went back to the drawing board to try to distinguish some feature that may guide us to the demographic of technological professionals without having to wade into the very deep (and usually not terribly current) pool of census data.

THE MAP

In my own qualitative research online, I stumbled across BuiltInNYC, an organization specifically for New York based startups and tech companies. From their website, I manually scraped data on ~115 of NYC’s largest tech employers. I also used one of their recent articles to gather information on the largest tech companies in New York based on footprint.

Using two very handy packages, I was able to turn these lists of names and addresses into a powerful visualization comparing our top stations’ locations to this array of companies. Geopy is a package that can accept addresses as strings and return extensive location data on them, including geographical coordinates. I then took those coordinates and ran them through Folium, a package for creating leaflet maps. The resulting map compares tech companies (blue) to subway stations (red), indicating potential reach.

THE CONCLUSION

We recommended the following stations based on our research:

Fulton Street - A/C, J/Z, 2/3, 4/5

connects to PATH train for access to New Jersey
tech presence with Spotify and Conde Nast

Columbus Circle - A/C, B/D, 1

closest junction station to New York Institute of Technology
northernmost station in top stations, dividing residential uptown from commercial downtown

Union Square - L, N/Q/R/W, 4/5/6

close proximity to Facebook, Oath, New York University
serves L train, the most utilized train in the MTA

THE AFTERMATH

After our presentation (and a few celebratory high fives), we got our feedback and got back to work refactoring our code, taking out all the missteps and dead ends and such. Once we’d made a notebook we were all pleased with, we gave it one last run and shipped it.

Given my new dexterity with python, I decided to go back to my exercises from the beginning of the week and give them another shot. Pandas had become significantly more familiar, so I was quite certain I would be able to speed through.

Instead of looking at any of the most utilized turnstiles, I turned my attention to my old stomping grounds. Unlike many New Yorkers I’ve met, I only ever lived in one place in NYC - Ditmas Park, a quiet area south of Prospect Park. It’s full of trees and lawns and old Victorian homes. I zoomed in on my own station, Newkirk Plaza.

THE REALIZATION

And then it dawned on me. As I compared the station to the date, it suddenly occurred to me that I was looking at myself through the data, my own motion through turnstiles, my own commute. And I felt both big and small, anonymous and recognized.

The point of data science is to make sense of the world around us. To bring order to the chaos. And in this moment, a field of study that felt so nebulous and hypothetical suddenly rooted itself into the ground.