As I’ve been dealing with streaming data one of the architectural decisions I’ve had to make is how to run periodic batch jobs on the data as it comes in. In the case of web traffic, it is logged into a database. What my batch jobs do is take the data from the MariaDB MySQL database, convert the data to Parquet format and then store the data in AWS S3. Once it’s in S3 I want to update RedShift Spectrum to be aware of the new data, then run a query on RedShift spectrum that I can then feed into a Redis database which is used by an application to give pretty close to real time results. Whew! That is a mouthful. Perhaps a diagram would be helpful:
The blue box represents the first job that has to run. As I did this work in python I thought at first I could use AWS Glue to run the job. After all, it’s a simple query and store operation. But as I’ve been debugging Glue I found that it was actually easier to just put this python script into a Kubernetes Cron Job. This gives me the same functionality and since I’m already paying for EKS it gives me more utilization out of it. Since a lot of the other infrastructure runs in EKS, there’s no reason not to use it. I’m familiar with both and this was a quick win.
The Glue job is the orange box. In this job it crawls the S3 directories that I setup and then creates the format. This is simply configured from the AWS Glue console with mostly default parameters. I’ll need to figure out how to make this part automated soon, but for now it seems to do the job. The part I keep having issues here is that some data when it comes in may not be formatted correctly and it crashes my queries. To get around this I keep having to change the job in the blue box. Kubernetes makes this pretty easy, but still not my favorite.
The green box represents the kubernetes cron job that runs queries in RedShift that our data scientist Min gave me. The query results are than placed into Redis for processing. Again I could have put this in Glue, but Glue I don’t think saves me much time that Kubernetes already gives me with cron jobs.
That therein is perhaps the part of Glue I’ve learned: It saves time on S3 crawls, but doesn’t save you much time in doing some basic other ETL jobs, especially when you have a Kubernetes cluster. I’m a big fan of keeping things serverless and using Kubernetes in this way still feels serverless to me.
One part we need to look into in the future is making sure all of our jobs are processing without errors, finding problems in the flows, and working on visibility for our end users. Pretty fun stuff!