{"id":3717,"date":"2019-04-24T14:02:52","date_gmt":"2019-04-24T20:02:52","guid":{"rendered":"http:\/\/benincosa.com\/?p=3717"},"modified":"2019-04-24T14:02:52","modified_gmt":"2019-04-24T20:02:52","slug":"kubernetes-cron-job-vs-aws-glue","status":"publish","type":"post","link":"https:\/\/benincosa.com\/?p=3717","title":{"rendered":"Kubernetes Cron Job vs. AWS Glue"},"content":{"rendered":"<p>As I&#8217;ve been dealing with streaming data one of the architectural decisions I&#8217;ve had to make is how to run periodic batch jobs on the data as it comes in.\u00a0 In the case of web traffic, it is logged into a database.\u00a0 What my batch jobs do is take the data from the MariaDB MySQL database, convert the data to Parquet format and then store the data in AWS S3.\u00a0 Once it&#8217;s in S3 I want to update RedShift Spectrum to be aware of the new data, then run a query on RedShift spectrum that I can then feed into a Redis database which is used by an application to give pretty close to real time results.\u00a0 Whew!\u00a0 That is a mouthful.\u00a0 Perhaps a diagram would be helpful:<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter wp-image-3718 size-large\" src=\"http:\/\/benincosa.com\/wp-content\/uploads\/2019\/04\/Screen-Shot-2019-04-24-at-12.49.54-PM-1024x501.png\" alt=\"\" width=\"640\" height=\"313\" srcset=\"https:\/\/benincosa.com\/wp-content\/uploads\/2019\/04\/Screen-Shot-2019-04-24-at-12.49.54-PM-1024x501.png 1024w, https:\/\/benincosa.com\/wp-content\/uploads\/2019\/04\/Screen-Shot-2019-04-24-at-12.49.54-PM-300x147.png 300w, https:\/\/benincosa.com\/wp-content\/uploads\/2019\/04\/Screen-Shot-2019-04-24-at-12.49.54-PM-768x376.png 768w\" sizes=\"(max-width: 640px) 100vw, 640px\" \/><\/p>\n<p>The blue box represents the first job that has to run.\u00a0 As I did this work in python I thought at first I could use AWS Glue to run the job.\u00a0 After all, it&#8217;s a simple query and store operation.\u00a0 But as I&#8217;ve been debugging Glue I found that it was actually easier to just put this python script into a <a href=\"https:\/\/kubernetes.io\/docs\/concepts\/workloads\/controllers\/cron-jobs\/\">Kubernetes Cron Job<\/a>.\u00a0 This gives me the same functionality and since I&#8217;m already paying for EKS it gives me more utilization out of it.\u00a0 Since a lot of the other infrastructure runs in EKS, there&#8217;s no reason not to use it.\u00a0 I&#8217;m familiar with both and this was a quick win.<\/p>\n<p>The Glue job is the orange box.\u00a0 In this job it crawls the S3 directories that I setup and then creates the format.\u00a0 This is simply configured from the AWS Glue console with mostly default parameters. I&#8217;ll need to figure out how to make this part automated soon, but for now it seems to do the job.\u00a0 The part I keep having issues here is that some data when it comes in may not be formatted correctly and it crashes my queries.\u00a0 To get around this I keep having to change the job in the blue box.\u00a0 Kubernetes makes this pretty easy, but still not my favorite.<\/p>\n<p>The green box represents the kubernetes cron job that runs queries in RedShift that our data scientist Min gave me.\u00a0 The query results are than placed into Redis for processing.\u00a0 Again I could have put this in Glue, but Glue I don&#8217;t think saves me much time that Kubernetes already gives me with cron jobs.<\/p>\n<p>That therein is perhaps the part of Glue I&#8217;ve learned:\u00a0 It saves time on S3 crawls, but doesn&#8217;t save you much time in doing some basic other ETL jobs, especially when you have a Kubernetes cluster.\u00a0 I&#8217;m a big fan of keeping things serverless and using Kubernetes in this way still feels serverless to me.<\/p>\n<p>One part we need to look into in the future is making sure all of our jobs are processing without errors, finding problems in the flows, and working on visibility for our end users.\u00a0 Pretty fun stuff!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>As I&#8217;ve been dealing with streaming data one of the architectural decisions I&#8217;ve had to make is how to run periodic batch jobs on the data as it comes in.\u00a0 In the case of web traffic, it is logged into a database.\u00a0 What my batch jobs do is take the data from the MariaDB MySQL&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[463,797],"tags":[899,893,900,897,894,798,896,898,895],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/benincosa.com\/index.php?rest_route=\/wp\/v2\/posts\/3717"}],"collection":[{"href":"https:\/\/benincosa.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/benincosa.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/benincosa.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/benincosa.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3717"}],"version-history":[{"count":1,"href":"https:\/\/benincosa.com\/index.php?rest_route=\/wp\/v2\/posts\/3717\/revisions"}],"predecessor-version":[{"id":3719,"href":"https:\/\/benincosa.com\/index.php?rest_route=\/wp\/v2\/posts\/3717\/revisions\/3719"}],"wp:attachment":[{"href":"https:\/\/benincosa.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3717"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/benincosa.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3717"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/benincosa.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3717"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}