Data Scientists Make These 3 Common Mistakes
The practice of training and putting machine learning systems into production involves a lot of trial and error. However, there are some recurring mistakes many data scientists make that can be avoided with the proper awareness and preparation. Here are a few that we’ve seen relatively frequently, and some tips on how to prevent these mistakes from happening within your organization.
#1: Data scientists incorrectly format the data set.
The problem: Data quality is the foundation of any successful machine learning implementation (garbage in, garbage out). Unfortunately, if your data is formatted incorrectly or inconsistently, you won’t get very far with your training, and you may get inaccurate results.
Most commonly, data scientists may run into problems if they’re aggregating various sources into a single dataset, or if there’s a lack of consistency across records within a dataset. For example, some ages or lengths of time may be stored in years, versus months. Or, units may be mixed up (such as inches vs. centimeters). These are just a few examples, but there are many other ways that the same representation of data could be characterized multiple ways within a database.
In many cases, data scientists don’t normalize their data. For example, if one input is the month a house was sold, and another is the square footage of a house, the square footage of the house is going to end up dominating our correlations when predicting the house price, and month won’t play a role.
How to avoid it: Take care to prepare your dataset in advance, ensuring that all formatting is consistent and clean across the board. While this process can be painstaking, it’ll be worth it in the end.
#2: Working with overly imbalanced classes.
The problem: This issue is also related to data, when each class within your dataset is unequal. Highly imbalanced datasets can impact performance and accuracy, although some degree of imbalance is to be expected. For example, if you’re using a heavily imbalanced dataset to detect product defects, and your algorithm doesn’t find any defective products, it’s probably inaccurate and missing a lot of existing defects in the field.
How to avoid it: Probably the simplest way to avoid imbalanced classes is to try to balance them, by undersampling instances of the majority class, or oversampling instances of the minority class. This detailed post from Devin Soni shows several other tactical ways to deal with imbalanced classes, many of which may work for you.
#3: Implementing your own architecture
The problem: Many times, data scientists opt to implement their own architecture, when there are many existing models that serve a similar purpose (and take way less time to implement). It can be tempting to try to build an architecture from scratch, especially if your use case seems unique or there’s additional time built into your project for experimentation. However, implementing your own architecture can be risky, as there may be unexpected problems that come from a lack of testing, existing research or benchmarking.
How to avoid it: Simply put, when you can, take from existing research, rather than building from scratch. Luckily, the machine learning community is built on open source, and frequently shares research, code, and best practices. Well-established models are heavily tested, and other teams have often documented their results. Chances are you can find a model that fits your specific use case based on someone else’s research, or you can adjust an existing model slightly to meet your needs.
Ideally, steering clear of these three common mistakes will make your machine learning training and implementation a lot easier and faster.
What other mistakes have you seen in the field and how have you fixed them? Share them with us on Twitter @NeuralMagic.