It’s 2020 already? Oh my word. Does anyone remember that 90s show called “Vision 2020” where they would fantasize about the future of the world? I remember watching that show back in the ghetto in Zimbabwe. It definitely opened my eyes to world but where are our flying cars? I thought we would be a multiplanetary species by now. Alas, we are still using our hands to type on our computers, virtual reality is still nauseating and don’t get me started on why we are still debating the need of electric vehicles.
Nevertheless, a new year and a new decade bring new hope of things to come. I wish you all the best!
I would like to start this year by emphasizing something I haven’t talked about in previous newsletters but is a critical component of whatever you do in your career. As I am in the Machine Learning field and this newsletter is aimed at sharing knowledge in that discipline, I will give data science-y examples but the concept is universal.
What makes doctors special? Why can’t I just walk into the theatre and poke around a body on an operating table? Actuaries. Chartered Accountants. Teachers. They are all professionals who possess domain knowledge of their field. I can look at a mountain and walk past but a geologist will tell me how many $millions in gold I would ignored. Domain Knowledge is critical for success in any field, data science included.
I am sure you know a machine learning pipeline is composed of different stages: data collection, data cleanup, model building, model testing, model deployment etc. Quite often I see people concentrating on the model building stage. That’s the sexy part. I admit, I am guilty of this too. But this article is to emphasise the need of gaining domain knowledge before spinning up a Jupyter notebook, importing Python libraries and coding some machine learning model. Indulge me a little:
- Imagine UberEats wants you to create a model to optimize delivery routes. What are you going to look for in the data? What features are important for you to build your dataset? Is it the distance between the store and the customer? How do you measure that distance? Can you instinctively know whether it is Manhattan or Euclidean with no exposure to this field? I’m guessing traffic congestion needs to be taken into account, right?
- South Africa’s economy has been stagnant for many consecutive quarters now. The government wants you to build a model to better understand how we can solve this issue. What is going to be our GDP growth in 2020? How do you build this model? What relevant dataset are you going to look at and composed of what features?
- I am a huge fan of American Football. My favourite team, the San Francisco 49ers, has been terrible for a few years now and we have been comforting ourselves by calling the period our “Rebuilding phase”. This year we have done surprisingly well. We were the top seed in the NFC West with 13 – 3! We had total of 336 first downs (110 coming from rushing plays, 195 from passing and 31 from penalties). Our third down conversions are currently at 45% and fourth downs are at 58%. 331 completed passes out of 478 attempts with 13 interceptions. Our quarterback, Jimmy Garoppolo, struggles with throwing away the ball so if he is rushed and no receiver is open, chances are high he will be sacked. This has happened 36 times in 16 games this season. On Saturday, we are facing the Minnesota Vikings who are 11 – 6. I need to put my bets down. Can you write a Machine Learning model to predict how much I should put?
If you are not a fan of NFL, the third example probably did not make sense. And that is the point of this quick read article. Domain Knowledge gives you have the ability to anticipate which features are probably important for the problem at hand and which ones are not. This applies not only to explicit features that you get with your dataset, for example the address of the delivery but you are more ready to derive implicit ones as well. Given Horsepower and Revs-per-minute, someone with domain knowledge of the automotive industry can quickly figure out that Torque is usually a good measure of the performance of a car (which is given by TORQUE = HP x 5252 ÷ RPM).
Even in financial services. How do we measure risk? One department’s risk definition is not the same as the other’s. Bringing it closer to home, when Banking talks about a client they mean an individual. What does it mean to create a Machine Learning model for Private Capital clients? How does marketing interact with our clients? What does it mean to personalize our app?
Domain Knowledge not only affects feature engineering. It affects how you define success, what model you can use, how often you have to retrain it, how it can be deployed, and how you make sense of the results (correlation does not necessarily mean causality).
Knowledge is indeed power and it’s something you have to gain to take your data science or any skills to the next level.
I could go on but this is quick read article. For now, all the best for the year!