Getting hot and heavy with your data preparation

Annie Condon
2 min readMay 14, 2021

They’re not kidding when they say that data preparation is 89% of a data scientist’s job. Preparing a set of data for analysis and modeling can be one of the most time consuming parts of my job, especially if the data comes from multiple sources (database, csv, weird-formatted made by some man named Scott, etc.) I will admit that even though I’ve been a data scientist for more than five years, I still have to extract data from PDF files when the situation is dire 😱

Sometimes, preparing a data set feels like falling into a deep pit of sludge and then trying to swim back out 💩

Here are a few tips to help you work through some data preparation woes (and this advice is from someone who just spent 4 days on data preparation that I said would take 2 days 😅)

  1. Schedule more time than you think you’ll need for data preparation when planning a data science project. Stakeholders that do not work hands-on with data may not understand this part, but make sure you advocate for your time.
  2. Let go of perfectionism! THERE IS NO PERFECTLY PREPARED DATA SET. Get through your first draft of your data set. You can always go back make changes later. (For example, if just a handful of my ids in a large data set aren’t matching up after a merge, I note it to look at later, but I move forward anyway)
  3. Leave comments in your code as you go. Pretend you are explaining your work to your mom. Even if they’re messy comments, they will help you re-run your process later.
  4. Write scripts that can be replicated for other data sets. This will help you save time and standardize your data preparation steps.

--

--

Annie Condon

data scientist 👩‍💻| writer 📓| down with what you’re going through ❣️