If you are in a Big Data project, you may have experienced how slow is Hive to JOIN a couple of tables of few TBs (well, even GBs being honest). The first option always appears to be using PARQUET as your default storage engine and then when a query is too heavy, use Impala to process it.
Wait, is it that easy?
Well, it depends on many factors. First, an updated version of all these tools will help a lot. If it’s not updated, or your company has its own vendor patches (Cloudera, Hortonworks…) for cyber security purposes, then start praying…
The first thing you need to take into account is that Impala doesn’t cast automatically, as Hive does. Therefore you’ll need to explicitly call CAST(stuff AS NEW_TYPE) for mixing integers with doubles or strings, or using them as input in an expression that was created for another type. Also, if you’re a big fan of Hive UDFs, congrats, in Impala they don’t work!
Also, you need to bear in mind the Storage and Compression types in your distributed file system.
So, if Impala works fine for you, go for it! Otherwise you’ll have to smarter stuff. Give a try to Spark if you have it. If you’re not familiar with Python/Scala/Java or R you can just throw SQL embedded as showed here. Although, sometimes this is tricky… sometimes all the fancy new Big Data Software is not properly configured or, even better, only configured to deal with Small Data, presenting scalability issues. In this cases, one realizes that Big Data is a real back pain…
If this is not enough, then I have few lines for you. However, before you shake your cluster I need to tell that I’m not a Cluster admin, and probably before running this you should ask yours… This will be the way I fix your back pain, assuming I do…
See below a few MapReduce parameters that may improve the performance of Hive and also may prevent it from crashing.
-- If you're experiencing issues with compression, or shuffling, try that below...
-- If you have your own queue inside the cluster, give this a try.
SET mapreduce.job.priority= VERY_HIGH;
-- Increase the memory allocated in your queries for each used node
-- (CHANGE IT TO A SIZE THAT SUITS YOUR CLUSTER)
I recommend you have a look at use only what you need above. Look in your logs and tweak only what’s crashing. Give more memory to reducers or mappers if reducers or mappers crash, respectively. And, if you have any Java errors, then use this below to reduce overhead.
-- Increase Java Heap Space in reducers to avoid shuffling errors
-- Tweaking default Java install of your cluster to give more memory to your job.
-- Giving more memory to mappers
-- (CHANGE THEM TO SIZES THAT SUIT YOUR CLUSTER)
-- Giving more memory to reducers and dealing with Java Heap Space issues
SET mapreduce.reduce.java.opts=-Xmx16g -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSIncrementalPacing -XX:+CMSClassUnloadingEnabled -XX:+CMSParallelRemarkEnabled -XX:-UseGCOverheadLimit;
SET mapreduce.reduce.java.opts.max.heap=-Xmx20480m -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSIncrementalPacing -XX:+CMSClassUnloadingEnabled -XX:+CMSParallelRemarkEnabled -XX:-UseGCOverheadLimit;
Hopefully you have enjoyed this article and your cluster is not broken yet. Please, ask any question or comments you may have!