Why are you still using Excel? - Hypergiant

Why are you still using Excel?




Written by Dr. Drew Lipman, Lead Data Scientist at Hypergiant

In a meeting with a potential client the question about how they are currently implementing the data analytics for the nation spanning company was met with: Microsoft Excel. Upon further inquiry it was revealed that this was not just for small, one-shot, or experimental analytics but for company spanning problems. Now, what is wrong with this image? They are using the wrong tool. Do not get me wrong, for what it is designed to do Excel is the best tool, it offers unmatched flexibility of application mixed with ease of use and visualization. However, what it is designed to do is not this. Excel is designed for prototyping business applications, or domestic usage i.e. small applications. You want to know what is going on with something: try it in Excel! But if you want to know something, at scale, regularly, spread over the company? Excel is not going to cut it. Once you start manipulating large datasets, at the point where looking at the data no longer offers insight, why are you using a spreadsheet? This is the realm where specialized applications come into play. Excel is not intended to be a database, it is not designed to do database manipulations, it is not capable of handling huge datasets at speed. Similarly for specialized problems: it is not designed to handle them, it does not have the tools to handle them.

So, what is Excel used for in business? Searching for Excel’s uses in the office gives some insight into the uses that the the cult-of-Excel highlights. Let us talk about several of these in particular: Storing Data, Research, Scheduling, Finance and Accounting. We will discuss each of these, in turn, highlighting how it begins to fail once you are attempting to implement things at scale.

Storing Data. I admit that this one is a pet peeve of mine. Yes, CSV files are often the file type used to exchange large amounts of data. Yes, these are natural things to come out of Excel. Yes, Excel is a good way to visually manipulate data in CSV files. That does not mean that they have to come out of Excel, or should be stored in Excel prior to being saved as a CSV. OK, now that we have that out of the way. While these are all true, it not a question of what software you can store the data in, it is a question of what is the right software. When a new problem or data source is presented, Excel is indeed a perfect starting point: you don’t know what you need, the format, all the data sources, how they are related and being able to visibly see the data, arrange it, and manipulate it to determine what you need, or where you need it can be priceless. This is exactly what Excel excels at. But then what? Once the dataset gets to be of a large size Excel, with its general framework, begins to struggle with doing the tasks assigned to it. Consider a dataset that includes all parent/child relationships since the founding of a small city. This is handled easily enough in Excel, each person gets two cells which indicate the row their parent is on. But now, we want to know who is a cousin to whom. Well, we determine who has a common Grandparent. Finding a person’s Grandparent is easy: determining all the grandchildren however is harder. You find yourself searching through the data set several times, or building an auxiliary data set. This is not so bad when the dataset is small, once it gets larger it is painfully slow. To counter this: a query for cousins in a graph database takes milliseconds. The right database for the right dataset and queries. Now consider the Oil and Gas industry. In this industry it is not unusual for all the chemicals to be stored in Excel, with the chemical mixtures a new mixture is made out of listed as the “parent”. In this way if you want to find a chemical combination that has similar properties one way but not another you might be looking for a “cousin” style relationship. With the number of chemicals that are used in Oil and Gas Exploration you can imagine how daunting extracting meaningful relationships can be.

Research. If there is one bane to my existence, as someone who programs, it is receiving large spreadsheets which are the amalgamation of several generations of employees, none of whom bothered with documentation. Consider a simple scenario: Alan is hired to perform some task, he builds a spreadsheet in order to do so. The company has grown, and so Barbra gets hired to help Alan, and so she adds her variation to the spreadsheet. Now, the company hires Charles, who takes the spreadsheet written by Alan, modified by Barbra and applies it to a similar, but different problem adding another layer of complexity on top of the already poorly documented spreadsheet. At this point the company decides to open a new branch and hires Desdemona at the new branch. Along with the company comes the legacy code which includes the Alan-Barbara-Charles spreadsheet. Now what does poor Desdemona do? You guessed it! She tries to unravel the labyrinthine documentation, but not knowing Alan’s original intent, does not properly understand his logic and hence kludges along adding another, disjoint layer of complexity. And so it goes. Until we find ourselves twenty years later in 2017 with this hydra of a spreadsheet. So, what happens? Suppose a new edition of Excel is announced. The noble spreadsheet which has served its purpose long beyond what Alan intended includes a function call that got removed. It no longer works, and worse: no one understands what is going on with it well enough to fix the problem. Think I am exaggerating? In many established, major, international, corporations there is legacy code, persisting twenty, or even thirty years: this is not an exaggeration. This code is often used to supply information fundamental to simulations, and hence the on-going research of the corporation. And they do not understand it. Sure, they could find out if needed, but that is the problem. They don’t until it is critical. They do not update, they layer on top. With spreadsheets and their nonlinear nature this is particularly painful to unravel enough to update.

Scheduling. When I was a graduate student we had a single person who did all the scheduling, and she did it all in Excel. So once again, it can be done, but should it? General scheduling falls into a class of problems known as integer programs which is an amazingly difficult class of problems to solve. This means that the best, or sometimes only, approach is probabilistic or heuristic. While these are not hard to implement, experiment with, adjust or automate (what else is Machine Learning after all?) none of these are easy to do in Excel, and if you do manage? Then we find ourselves back up in the Alan-Barbara-Charles-Desdemona tangle where Alan and Barbara tweaked values in the spreadsheet until they worked, but when the situation changes, no one knows where those values are or how to manipulate them in a meaningful way, basically, you will always find yourself dropping back into VBA, for many functions, in which case why not just program a solution in the first place?

Finance and Accounting. Here is where Excel truly shines! I am firmly of the opinion that Finance and Accounting is what the original designers of spreadsheets were thinking of when they first designed them. However, why are there so many Financing and Accounting software packages? Some of them are clearly there as frontends to web applications, or similar, but others are clearly specialized for Finance and Accounting. That means, that once again, while Excel can do it, it might lose to specialized software in this area. So yes you can do it, in fact I do my personal budgeting in Excel, but once you are at scale? Where you have hundreds of employees? Hundreds of expense accounts, projects, costs, and estimations? Well. See ABCD above and you get the idea. That being said, most of these specialized software packages lack flexibility, or the ability to manipulate the data, and are generally prohibitively expensive. That means that, in my personal opinion, you should be augmenting Excel with other software, as needed: build an ecosystem of customized software to balance out Excel’s weaknesses.

So, what is the answer? Do not be afraid to use custom software or abstract databases if that is what your problem needs. If you are intent on using Excel be open to augmenting it with a software ecosystem that buffers the shortcomings displayed. Get the right software for the right problem. If you need to handle data with abstract relations get a graph database, if you need to do a computationally hard problem get a Machine Learning program. You need both? Get a Machine Intelligence.