npressfetimg-4089.png

Four Data Science Bad Habits That are Hard to Shake Off – DataDrivenInvestor

You probably at least do one of them

Photo by Manan Chhabra on Unsplash

I’ve done a few hobby data science projects over my career so far. For example, building a house value prediction model based on xgboost, a model that automatically did administration work based on naive Bayes, and a company ranking model using Keras. So, I’ve built up some data science experience so far.

Through completing these several projects and a whole lot more, I’ve seen to have attained some bad data science habits along the way. Frankly, these habits are so persistent that trying to rectify them would be a whole more work. I’ve come to live with these bad data science habits instead just because I’m lazy.

If you want to read a how-to article about avoiding and rectifying bad habits in data science, Terence Shin writes a great article called ‘Five Bad Habits Everday Data Scientist Should Avoid’.

This article takes a different spin on bad habits and I write about my bad habits that I think at least a few data scientists can relate to and sometimes are just too lazy to resolve themselves as well.

1. Using R instead of Python and vice versa

R does some things better than Python and Python does some things better than R. For example, TidyR is explicit in its syntax and Pandas is more compact.

Here’s an example of filtering data in R versus Python.

R - TidyR
df %>% filter('Field' == 'Hello World')
Python - Pandas
df[df.field=='Hello World']

Frankly, if you threw Pandas code to anyone, he/she couldn’t even guess if something is being filtered for.

When I need to do data cleaning, I habitually use R since that was the first data science language I learned. However, usually, after about the halfway mark of data cleaning, I figure I could use an API to make the data cleaning process faster. Except many of the great data verification APIs out there only have Python APIs. Consequently, I need painstakingly recreate the whole data cleaning process in Python to use these Python-only APIs.

Conversely, sometimes I start in Python because loading a Google Colab Notebook is just easier than opening an IDE on your old laptop, and it’s Google, so you know you’ll get a good quality IDE. But again, by half through a project, I realize I can run certain machine learning algorithms better in R than I can in Python since I’m more experienced in R. For instance, I can use Xgboost better in R than I do in Python because I understand how to manipulate factors in R to meet the data requirements for Xgboost. It’ll take me longer to data clean something in Python so that I can load it into Xgboost. Consequently again, I need to recreate the data cleaning process in a different language.

Nowadays, I don’t really do this bad habit as much as I used to because of experience, but I’ll occasionally fall into it.

2. Always using Xgboost or Deep Learning when something simpler would suffice

Out of habit, when I see a structured data set, I just use Xgboost without giving it much of a thought. Likewise, when I see data sets containing only numbers or pictures, I’ll automatically use deep learning.

This is a conundrum because Xgboost and deep learning algorithms are not always the best algorithms to use — kind of anyway. This is even supported by the ‘No Free Lunch Theorem’ which tells us that all machine learning algorithms perform equally well when performance is averaged across all possible problems.

The only reason why this matters to us is that 80% of the time is spent on data cleaning and only 20% of the time is spent fine-tuning your model even if it needs to be fine-tuned. In other words, to speed up machine learning, I’ll clean the data to a minimal extent and throw it into an algorithm I know that can take almost any sort of data type without throwing errors.

For example, I created a 40-column-wide house value prediction model (most of the columns were one hot encoded variable) and ran the model through Xgboost. It performed relatively well compared to actual sold prices.

Then, I did a simple model using 6 numeric-only columns and ran it through Kmeans. It turns out that the simpler data set and model performed just as well as the Xgboost model.

So, it turns out that I could’ve saved myself a lot more time from data cleaning if I explored the data bit more and started with a simpler model than using all of the data and going to the black-box model, which was probably overfit to all of the noise anyway.

3. Skipping out on EDA

I’ve got a fairly poor EDA process. It’s a two-step process. Figure out what data types I’m working with and count the number of rows.

I guess I have this bad habit because a lot of data we get nowadays comes from RESTful APIs, where we cherry-pick the data that goes into our model rather than exploring all possible attributes.

My bad habit is to work towards both ends of the extremes. Either throw all variables into a model or pick only a few variables I think will work. Usually, in the end, both models tend to overfit the training set anyway.

Most likely if I performed better EDA I wouldn’t build such poorly performing models in the first place. The reason for avoiding EDA is quite straightforward — EDA is time-consuming. To put it bluntly, data cleaning to visualize something can take as much time as data cleaning for machine learning.

For instance, if you have a horrendous free text field and you want to make a categorical graph, you have to go through the effort of feature engineering the field to get your desired object, and even then, you’re not sure if it will be included in your machine learning model. An example that comes into mind is the ‘Name’ field in the famous Titanic data set.

4. Guessing with Stack Overflow rather than thinking through a problem

To me, Stack Overflow is like a reliable genie in a lamp. It seldom disappoints when you need something from it. There’s almost an answer for any coding problem out there on the website. But, that’s also the problem. I tend to copy and paste Stack Overflow answers instead of thinking about how the person who answered got to it and other times I copy and past verbose answers rather than figuring out how to resolve the issue.

I guess this problem stems from the fact that I’m so exhausted doing data cleaning that I want something done soon. This is so that I can get to machine learning as quickly as possible, which is a bit more fun. Unfortunately, when you code so much at times, you want to take shortcuts and use heuristics than employ the most efficient and readable code.

Conclusion

Those are my bad habits in doing data science. You might also share some of these bad habits too. I guess the answer to resolving them is quite self-evident and just think strategically before doing data science, but given there’s so much to do in a data science project, who really has the time to think nowadays?

Source: https://medium.datadriveninvestor.com/four-data-science-bad-habits-that-are-hard-to-shake-off-13a75fa08ae0

Related Posts