3 Tips for Data Analysis

TL;DR:

The increasing importance of data analysis and engineering is evident throughout every industry, discipline, and project. The need to leverage data and drive insights that help design, deploy and measure impact are an imperative part of any initiative. As is the desire to dive right in and immerse oneself in the data and its analysis.

But smart data practitioners know better. There is tremendous value in a thoughtful, pragmatic approach. Here are three tips that you may want to consider before diving into the cold waters of data in your next project.

————————————————————————————

Every data scientists, engineer, or analyst will be familiar with this scenario: 

A project idea is born. The goal is to pull deep insights from a database, and gain clarity on a topic of interest.  Nearly everyone’s first inclination is to dive right in, quickly jumping to conclusions about what can be pulled from the dataset, what relationships occur between features, and building out mock visualizations of the insights that will soon be extracted. Nearly everyone jumps right into the analysis portion, without any pause to think.  The gap between these presumptuous initiatives and the delivery of ready-to-be analyzed data is where exceptional data engineers find their niche. 

Data engineers play an integral role in the data lifecycle. Their main function is to help those around them better understand and evaluate data that is being considered for further analytical use. This includes fully understanding where data comes from, what it contains, and where it may be able to provide valuable insights. Answering these questions is the impetus for critical understanding of the data at hand, and the direction needed to derive value and insight from the data. 

From my experience, there are three critical things to consider when approaching a dataset for the first time, and before doing any of the actual data analysis work itself: 


Tip 1: Consider the nature of your sources:

Two important facets of your data to dig into are 1. The producers or sources of your data i.e. what is generating the dataset itself and 2. The ways in which this data makes its way into the systems being used to perform further analyses. 

Understanding what is generating your data source(s) is crucial to gaining deep knowledge of the data itself. For example, data being manually entered into a system by a person may look vastly different than data written by a bot or created from a stream of website traffic. Having insight into the producers of data can shed light on important factors such as sources of errors in the data and potential biases that may exist throughout the data.

Inversely, it is essential to be cognizant of the methods by which the data is ingested into the system(s) used for processing. Data automatically streamed in via an API or loaded onto a network drive can be more predictably available and ready for processing, whereas systems that depend on manual uploads may need to operate on a more ad-hoc basis. The means by which data is introduced into the system can therefore impact processing cadence as well as necessitate potential checks (on schemas, file formats, etc.) to ensure data is properly loaded before proceeding with any further analysis.


Tip 2: Beware of the unknown unknowns.

It is important to be wary of making assumptions about the values contained in a dataset; oftentimes data will contain various values that hold the same meaning or include information from past processes that are no longer relevant. Some examples of common inconsistencies or ‘unknown’ unknowns to look out for in datasets are: 

  • Placeholder values: A certain column may state that it holds binary (0 or 1) values, however the producer of this data may have included a stipulation that this column can also contain a placeholder 2 to indicate the data for that row is incomplete. Some features may also contain multiple values that have the same meaning, such as ‘True’ and ‘Yes.’ Make sure the proper steps are taken to align these values before proceeding with any analysis.

  • Unique keys that are not unique: Just because a column is said to contain unique values does not mean it does. Oftentimes data producers will have exceptions to rules that allow for duplicate ‘unique’ key values. For example, a shoe company may have a unique identifier for each model of shoe, but a dataset may have two rows with the same ‘unique’ identifier that represent the same shoe model in different colors. Non-unique keys can cause issues such as Cartesian joins in data pipelines that involve joins, leading to exponential dataset growth.

  • Legacy data: It is common to take over datasets being transitioned from old sources or databases and these frequently bring along legacy values, inactive or outdated records, or features that are no longer maintained. Ensure these values are removed or properly translated before lumping them in with new information.

Tip 3: Let the data tell you what answers it can provide, not vice versa.

Be hesitant to assume what insights your dataset will provide before gaining a better understanding of the data. Taking the time to do pre-work before diving into outcome-focused analysis can go a long way when the time comes to pull results from data.

Beginning analysis with preconceived notions of relationships between features or how certain outputs will look can lead to forced results and skewed interpretations. Verify that the features in your dataset present all necessary information needed to drive to desired outcomes. Adding additional features to data or introducing more datasets to your system are examples of helpful options to ensure data is robust enough for analysis before continuing on.

  • Adding features: For example, if no unique key exists in your dataset but two columns can be combined to create a unique key consider adding this as a new feature before setting out to analyze the data.

  • Supplementing incomplete data: Consider the scenario where the source system from which your data is pulled does not contain a column that is crucial to downstream analysis. Performing due diligence to find a reliable source for this column and adding this source as an additional feeder is a preferred alternative to dealing with missing data further along in the process.

Final Thought:

Be patient with datasets. It takes time to understand their nuances and what answers they may be able to give to you. While the growth of plug and play models and data science tools is excellent for bringing trust and awareness to these capabilities, it has also lessened the barrier to entry when it comes to data. Just because a dataset can be used to obtain an output does not mean it should be. The goal is to obtain the insights and answers you set out to find, and to be presented with this information in an understandable and empowering manner. The first step towards achieving that is to dive into your data.

Previous
Previous

The Profit Paradox, Unraveled: How Data Empowers Nonprofits to Match For-Profit Success

Next
Next

Transforming the Media Landscape