Click to learn more about co-author Rosaria Silipo.
Using Excel to Transform/Analyze Your Data?
I know you are still using Excel sheets
to transform and/or analyze your data! I know, because most of us still use it
to some extent. There is nothing wrong with using Excel. Excel spreadsheets are
a great tool to collect and transform small amounts of data. However, when the
game becomes harder and requires larger amounts of data, Excel starts showing
You do not believe me? Then let’s start with the list of most common issues when working with an Excel spreadsheet to transform data. For this post I used answers provided by fellow data scientists in this thread on LinkedIn. Thank you to everyone for contributing!
1. No Error Control
One main issue that came out of many conversations with fellow data scientists: Excel spreadsheets have no error control and are therefore error prone.
According to Meta Brown and Karen Hardie:
“It’s easy to inadvertently change a cell or make mistakes – I’ve seen people suddenly realize that a macro was wrong by one cell after using the process for a long time and then have to go back and figure out when that happened.”
There is no debugging tool and no
testing frame to inspect whether all cells keep working as expected, for
example after a change.
John Peck also commented that:
“Excel is great for simple, ad hoc calculations, but its lack of structure and difficulty in automating and documenting the contents make its use error prone. Analyses built in Excel tend to grow and sprawl making them difficult to validate and to use on repetitive tasks.”
This last hint on the difficulty of
using Excel spreadsheets for repetitive tasks takes us to issue #2.
2. Little Reusability
This one comes from the pool of my own
personal mistakes when using Excel spreadsheets for professional Data Management.
It had to do with the data input. Usually, data are stored in one or more
source columns in an Excel spreadsheet, while the other columns contain the
macros and formulas for the processing. Well, often, when reusing the
spreadsheet for the current month’s analysis, the new data were copied and
pasted manually into the dedicated source column(s). However, since the data
rows for the current month were usually more than the data rows for the
previous month, the pure copy/paste of the data would cover regions of the
sheet where macros had not yet been defined, producing wrong unverified sums
and macros results.
The lack of a verified, reliable,
repetitive way to collect data from multiple sources makes reusability limited
to very simple processes.
And if you’re thinking of using Excel as a data source: Roger Fried warns against it!
3. Problematic Scalability
In professional data wrangling
projects, we usually deal with very large amounts of data. Therefore,
scalability is often a concern when moving forward with the project. Excel
spreadsheets show their shortcomings when large amounts of data are involved.
David Langer lists “speed of iteration of analyses” as one of the main problems of using an Excel spreadsheet for professional data transformations:
“My experience has been that current Excel row limitations (I’m ignoring PowerPivot here) aren’t a concern in the vast majority of cases. What kicks me out of Excel most of the time is speed of iteration. For example, in linear regression modeling.”
For Giovanni Marano:
“Performance degradation and crashes, when running operations on big datasets” are a big limitation for serious professional usage of Excel spreadsheets, while Anna Chaney confirms that “Excel doesn’t have enough memory to load larger datasets.”
David Montfort points to the limit in number of processable rows:
“Excel has a row limit which can be an issue with very large datasets. Also, other programs offer better statistical and data visualization tools.”
So, either lack of memory, limit in
number of rows, general slow speed in execution, and performance degradation
represent a serious issue in scalability when implementing professional data
wrangling and data management projects.
4. Low Coverage of Data Operations
Again, Excel spreadsheets do well for
small datasets and for a reduced pool of data operations. However, when the
projects become bigger and require more sophisticated data operations, some are
not available in Excel.
Alessio Nicolai and his colleague Giovanni Marano focus on “ad-hoc” analyses (which don’t require a scalable process). They identified the following limitations in data operations available to an Excel spreadsheet:
- Operations on a filtered dataset
are limited (filtered-out data are only “hidden”)
- No availability of intermediate
steps in data preparation (e.g. when filtering)
- Formulas limitations (e.g. no
MAXIFS/MINIFS without using computationally expensive array formulas)
- Distinct count in pivot tables is
- The equivalent of Joiner (Vlookup)
is clunky and does not allow the Full Outer join
- Multi-key joiners / full outer
joiners not possible without work-arounds
- Analysis tools (like regressions,
correlations) are way too basic
- Number of rows in the spreadsheet
Amit Kulkarni adds the difficulty in referencing filtered sets for say vlookup functions and Sayed Bagher Nashemi Natanzi (Milad) would like to have more options for sorting and filtering.
5. Lack of Automation
Deeply connected with the lack of reusability is the lack of automation, as pointed out by Tyler Garrett below.
Copy and Paste operations are common when using Excel spreadsheets, to introduce new data, new cells, and new functions. Those are all operations that cannot be automated, because they require the start of the tool GUI and a certain degree of expertise. Every time, in order to calculate new values, you need to reopen Excel, perform such manual operations, and recalculate:
“It is great for prototyping, documenting, entry level input to get a ETL, analytics, or Data Science process started, but truly the value starts to disappear when the computer is offline. The “availability” being dependent on computers being ON, the “validity” being relevant only if users are experts (but even we make mistakes), and lack rules keeping it from being acid compliant.”
6. Not Open
We have referred to a Copy & Paste
action often so far. Of course, this is not the only way to get data into
Excel. You can connect to databases and some other external tools. However,
there is a plethora of data sources, data types, and data formats that are
usually needed within the scope of a data wrangling project. The openness of a
tool allows you to connect, import, and process a number of different data
sources and types, and to integrate scripts and workflows from other popular
Transparency is another sign of the
openness of the tool. The possibility to understand the formulas and operations
quickly in the blink of an eye is an important feature to pass your work to
someone else or to interpret your colleague’s work.
Alberto Marocchino has indicated this as another fault in the usage of Excel spreadsheets in data analysis. In particular he pointed out that:
- You do not know if a cell contains
a formula or a value (data and analysis are merged together)
- Formulas are hidden in cells
- There is no direct pipeline for
- It pushes data correction back to
can be a wonderful tool, it depends on the use. It is general purpose and since
most of the computer users stick with windows it is a native way to visually
interact with CSV. But probably ‘general tool’ is not necessarily a synonym for
quality when it comes to hardcore data analysis.”
This difficulty in documenting and
communicating what happens in the Excel spreadsheet takes us directly to the
7. Difficult Collaboration
Nowadays no data scientist or data engineer works alone anymore. We are all part of bigger or smaller labs and we all need to communicate around the applications we build. Team debugging, feature discussions, best practices, documentation are all necessary tasks in the daily work. Excel is really not made for collaboration in big teams.
It is resident on your local machine,
preferably hosting a Windows OS. Already exporting the spreadsheet to a Mac
might require some extra effort.
David Springer indicates the “major issue with Excel when processing data as mostly the default, non-portable, proprietary data format.”
Documentation is a big part of collaboration. Michael Reithel observes that:
“Manual modifications to a spreadsheet are often undocumented and consequently lost over time making it hard to reproduce results.”
Those are just a few issues that make
collaboration around Excel hard to implement.
8. Time Consuming
The lack of scalability, the manual operations, the limitations in the amount of data make the whole process around an Excel spreadsheet quite time consuming, as reported by Hrvoje Gabelica and Tyler Garrett.
Both are encouraging to investigate
other solutions that allow for automation, scheduling, openness, and better
9. Not User-friendly
All in all an Excel spreadsheet is not
user-friendly. It seems easy to use at the beginning when moving the first
steps in the world of data processing. However, when more complex operations
are required, when collaboration would come in handy it turns out it is not
that user-friendly after all.
Giovanni Marano lists two main reasons for that:
- Excel’s Macros for repeated
processes are not user-friendly and hard to code/debug in VBA
- When multiple formulas/operations
are set up in a spreadsheet, you don’t have an easy overview of the
dependencies between each other, and – unless you use complex VBA coding –
you need to run the whole execution at a single time
Evert Homan says that pivoting data in Excel is cumbersome. I would add that the lack of overview and the difficulty to introduce documentation make data processing in Excel quite user- hostile, even for simple tasks.
We can conclude with Davide Imperati’s statement: “It is the perfect device to generate corrupt data”, since we do not always understand the processing functions.
10. Productionizing Is Hard
Finally, after implementation, we need
to move our application into production. Without scheduling, automatic import
of new data, from many different data sources, automatic reset of macros before
re-execution, moving into production can be quite a hard task.
This leaves Excel to be an excellent
tool for small datasets and maybe prototyping, but unsuitable for professional
data management projects.
Try Something New
These listed here are just the most common ten issues data engineers have to deal with when working with Excel spreadsheets to store, clean, and transform their data. If you are still hooked on Excel and fighting to get the data in the right format, try to investigate a few alternative solutions for data analysis. Not all Data Science tools require programming or scripting skills. Some of them are based on visual programming, where drag&drop of visual icons and their connection into a pipeline takes the place of scripting.
Using an open source and open software for data analysis, with more than thousands data operations is a great option. It can take your data from most sources and most formats to whatever shape you need them in and export your results in most available formats on most available platforms (open). It relies on a Graphical User Interface (GUI) where by drag&drop you can easily assemble a pipeline of operations (called “workflow”), which can be reused at any time. Thanks to the GUI, it is easy to combine documentation and functionality within the same project. It also allows for easy productionization, collaboration, sharing, scheduling, and automation.
Credit: Source link