Exploring Hadoop & Hive with AWS Athena
Background
I have been exploring the different analytic tools in AWS so I can better understand the cloud capabilities that are out there. I do plan to get to Microsoft Azure, Google Cloud, and IBM eventually but it made sense to start with the one that had the greatest market share (and when I started this exploration – they were the only ones in FedRamp).
Today, I want to share my experience with Athena.
Athena is a serverless offering that allows you to query data without setting up any servers, datawarehouses or persistent databases. It pulls from s3 and renders the data into Hadoop with Hive. It automatically runs the query in parallel without any setup required, and you only have to pay per query ($5 per Terabyte scanned). The performance is phenomenal and the clean up is easy.
Data Prep
The first thing I did was to upload over 15 years worth of stock ticker data (daily) for the S&P 500. This amounted to 500 files (one for each S&P 500 stock) with 15 years worth of data in each file. You can find the free data I used for this experiment here
You could just use ANSI SQL to add the column names, clean up the variables straight from the s3 buckets, and create new variables, but I chose to do my data prep with R (github repo here) and unify the 500 files into one. I then loaded the file into another s3 bucket and used that one for my queries. Whether you do it all using SQL or do it with R is your little red wagon as Stephen King would say.
The final tidy dataset was about 300 MB and looked like this:
Setting up Athena
Once I had the data in my s3 bucket – I simply went to the menu and selected Athena (which is located in the ‘Analytics’ section), and began walking through the wizard.
step 1
The biggest unknown on this step is the bucket name. It seems difficult by the example they have in place but really you just need to go get your s3 bucket name.
step 2
In this step it is important to use the following format for the s3 bucket link – s3://bucketname/ like the following for my bucket – s3://sp500-20yr-ticker-data/
step 3
In most cases you will probably have csv format like I do but whatever your format you will need to select it during this step.
step 4
This is where you create your column names and data types for each column. You can do this one at a time like what is shown here … or …
You can do it in bulk like what is shown here. If you do it in bulk – you need to name the column and data type and then follow the pair with a comma.
step 5
The final step is to partition the data if you desire. I leave this one alone but if you run into performance issues then you will need to learn more about partitions and explore this option.
step 6
If you’ve been successful, you should see something like this. It will also give you that nice confirmation that the ‘query [was] successful.’
Query Results
Now that you have completed the setup – the real fun begins. BTW, it took me much longer to put this blog together than to actually go through the table set up. In fact, I’d say it took me about 5 minutes or less to set things up. The data prep (as usual) took many hours.
query editor
You should notice that the new database is on the left (ticker2). This is what we will be querying.
query 1
Count the number of rows in the dataset.
Notice how Athena counted 1.8 million rows in about 2 seconds. That’s pretty fast.
query 2
Grab the first 10 observations to preview the table.
This one ran in 0.43 seconds.
query 3
Now I’m going to look for those companies that had a stock price greater than 100 dollars at its high point of the day.
There were quite a few. I expanded the table so you could see more of the results.
query 4
Filter for companies with high > 300
Amazon made it on the list (although you can’t see them). I wonder if they will be there in the end when I ramp up the price?
query 5
Filter for companies with high > 500
The field is slimming down with only 10 companies that meet this criteria.
query 6
Filter for companies with high > 700
It looks like Amazon dropped off but Google is hanging tough. [note: this data is from 1998 to 2013 … alot has changed since 2013. For example, Jeff Bezos networth in 2013 was only 25 billion and now it is 127 billion.]
query 7
Filter for companies with high > 1000
I looked up JDSU expecting to find an oil company or something like that but it turns out they are a company that is ‘a global leader in both network & service enablement and optical security and performance products and solutions.’ They could work for the government with a description like that. 8) Talk about boring … unless its in your stock portfolio.
AIG is a well-known multinational financial and life insurance broker that operates in more than 80 countries and employs over 50,000 people.
query 8
So now let’s see who had more days where they passed the 1000 bucks threshold. And the winner is … AIG
According to these query results … AIG kicked JDSU’s tail by having 32 times more days over a 1000 bucks than JDSU. This is kinda like when the young bucks lap me multiple times on the track during my fitness test. 8)
Amazon Quicksight
I’ll end the blog with a screenshot of the financial data I explored with Athena by using one of the analytic products Amazon put out in 2016 called QuickSight. It is basically a business intelligence tool that is easily integrated into all their other products. It can pull data from the s3 bucket we set up earlier. This won’t make Tableau or Qlik quake in their boots any time soon but it could end up being another competitor in an already tough market. We’ll have to stay tuned to see.
Conclusion
Hopefully this was useful to see how easy it is to set up an Hadoop & Hive implementation to jumpstart an analytic job using ANSI SQL. This is a very powerful tool in the toolbox and the pay per query model should make it very affordable.