Spreadsheets, SQL, or R for Data Analysis

Working in BigQuery

tl;dr If doing some data analysis which is the best tool? As always it depends...


Publish Date: 2022-08-10


Return Home


Best tool for data analysis?




While completing Google's data analysis course, I was exposed to three different tools to use for data analysis - spreadsheets (Microsoft Excel and Google Sheets), SQL (using Google's BigQuery), and R (using R Studio).

So, is one better than the other? Are there certain situations where one is more useful or efficient? The answer is, yes - at least for the latter question.


Spreadsheets


So, spreadsheets are probably the one that is familiar to most people. The endless cells that you can type data into and then perform calculations and functions on. Spreadsheets are incredibly useful. And personally, for data cleaning I find spreadsheets to be far and away the easiest to use. This is because spreadsheets are the most visual of the three. It's easy to get a view of the columns and potential issues with the data.

Spreadsheets also allow to easily perform calculations and easily locate the resultant data since they must be placed into a new cell. These days spreadsheets also have very nice linting (error support) and syntax support to easily help you identify what data needs to go into the formula or function and what any problem may be.

The issue with spreadsheets is that they start to become useless when the amount of data stretches into the thousands of entries. They can get incredibly laggy or just flat out not work because they can't handle the amount of data. This means they likely have less usage for medium-large enterprise usage.

Doing data analysis and visualizations in a spreadsheet



So, when would I use a spreadsheet for data analysis? If the amount of entries is in the hundreds, then I would probably just stick with a spreadsheet. There is little reason to utilize either SQL or R for that amount. The main benefits of those two, which of course we'll talk about more in a bit, is how quickly and easily they can process huge amounts of data. Spreadsheets also offer basic all-in-one tooling, so you can clean the data, do your analysis, and create visualizations all in one program. Spreadsheets also have the benefit of being more commonly used by others, so if you're working in a team (especially if there are non-data analysts) they will more likely have access to and familiarity with a spreadsheet.


SQL

So, what about using SQL? Well, as mentioned in the spreadsheet section, I think the decision of whether to use SQL will mainly come about when considering the amount of data in the set. If the data set is quite large (again, thousands of entries), then SQL becomes a better option.

For me, I personally did not like using SQL that much for data analysis - particularly for data cleaning. This is largely because of the syntax. Complex queries get long and jumbled fast, and while the linter in BigQuery wasn't terrible, I didn't find it hugely useful. On a personal issue, I was also quite frustrated with the data import and table set up in BigQuery. The process is easy enough, but in the data sets there was a column in several of them where the date and time format did not match the requirements for that data type in BigQuery. So, BigQuery was not able to actually import the data because when it tried to auto-detect the schema it would always return an error. This was doubly annoying because the import attempt would take five or so minutes before returning the error (admittedly this could be because my internet speed is not great).

The end result was that I would have to open each data set, identify all the columns and their data types, and then manually set the schema. Which is doable of course, but it's still frustrating. It also led to some minor issues down the line because based on some searching, I set the data type for that date and time column as a string (FYI the issue was the time was in 12-hour format with AM/PM). This prevented BigQuery from being able to run datetime functions on it and required an extra step to convert the data to 24 hour time. It wasn't the end of the world, but again it was an extra step that also made the query longer and more complex. If I have to open the data in a spreadsheet in order to check the columns and data types, then I'd rather just use a spreadsheet for all (due to the size of the data set that was not possible). Granted, I also did not try to import that data using R or a programming language to see if R can just overcome that issue or if that will be an issue regardless.

SQL queries can get very complex very quickly which makes it challenging



I find SQL to be much more useful and effective when running relatively simple and short queries after all the data has been cleaned and organized. Short queries to return data in large data sets was really nice to do in SQL. The version of SQL used in BigQuery was fairly intuitive for simple calculations and tasks, and once you get used to some of the basic syntax it was quite a pleasant experience.


R (or some other programming language)

Is there a time to use a programming language like R? Yes, of course there is. Again, if the data set is large, then R can be an extremely useful tool to use.

I found R to be the most enjoyable tool to use for data cleaning and analysis. This is because you can write out your script for data cleaning and then just hit run, and it's done. Yay! 🥳 Granted this is also true for SQL, but I found writing longer scripts in R to be quite a bit easier and cleaner than SQL. I also quite liked that you can run the code line-by-line rather than all of it at once. That way it was easier to debug and modify as needed. The downside with this is that if you go back and re-run some section then you have to remember to re-run the rest of it to get what you want.

The other slightly confusing thing about R is that you can run scripts in two different places - just in the local run time environment and also in the 'global' environment. Again, this can have its benefits as you can deal with small issues locally without affecting the data set, but sometimes you start running scripts locally and then realize that you need to run all of those globally and have to go back and re-write or copy/paste everything into the global script.

R's syntax is quite different but once you get the grasp it's super useful



The big thing about R (or any programming language) is of course that it is more complex than a spreadsheet or SQL really. Both spreadsheets and SQL (BigQuery) are largely tools that are pre-set. It seems (emphasis on seems) that for the most part they are designed to take your data and then just run it through their existing tooling to get results. R, can be that, but it also offers a wider array of paths to take. You have a little more freedom and ability to parse your data and arrange it into smaller sets or chunks to work with. Another way to think about it would be, spreadsheets and SQL have a few paths to take to get your data from A to B, while R offers you many more paths. *disclaimer, I could be wrong on this take given I'm still relatively new to both SQL and R, but that's the initial feeling I get from working and reading on completing tasks*

R, when used in R Studio, also has the benefit of being a pretty true all-in-one tool. You can do data cleaning, analysis, and visualizations as well as write your documentation in Markdown (or R Markdown). That's pretty great. Granted it may not be the best (particularly for visualizations) tool, but it's still nice that you don't have to keep moving the data around from program to program.


Ideal workflow?

For me, like I said earlier, if the data set is small I would just do everything in a spreadsheet. I do find Markdown to be a very useful way to write documentation - as long as you're familiar with the basics. When you're just getting started Markdown can feel annoyingly extra having to manually write in all the special characters to indicate the header type or to italicize or bold something or create a list. But, as you get used to it that ability to manually control everything can lead to some really nice formatting.

If the data set was big, then I would probably (ideally) utilize both R and SQL. I would use R to do the data cleaning and organizing, the primary analysis on the data, and the visualization. I would, of course, also use Markdown for the documentation. SQL would come in more for any follow up checks and analysis. Like I said during the SQL section, I find SQL's biggest strength to be how easy and fast it is to return data when the set is already clean and organized.

I haven't spoken about Tableau - a program primarily for data visualization. While I've played with Tableau a little, I don't feel like I'm at a point where I can offer any significant thoughts on it. I can certainly see its power as a data visualization program, but I feel like one would only need that in certain situations.