What is a distorted table in Hive

How do I process distorted data with Hive?

Mod Corner - Bloom (doom meets blood)

I'm doing a join operation in the beehive. But when the reducer reaches 99% the reducer will get stuck.

Then I found that the table had skew data. Ex. In Table A there are 1 million dates and in Table B there are only 10,000. In Table A, the connection values ​​are the same for 80% and the rest are different. So the beehive reducer got stuck at this value.

Here is my question:

Please suggest a possible solution. How can I handle the join operation on this type of data?

  • stackoverflow.com/questions/32370033/hive-join-optimization/…
  • Thanks @KishoreKumarSuthar for the reply. It's something cool.

Starting with Hive 0.10.0, tables can be created as distorted or modified to be distorted (in this case, partitions created after the ALTER statement will be distorted). In addition, skewed tables can use list bucketing by specifying the STORED AS DIRECTORIES option. For more information, see the DDL documentation: Create Table, Distorted Tables, and Change Table Distorted or Saved as Directories

Use this link as a reference.

  • Thanks for the repetition, but I am unable to follow this approach for daily processing.

I found a solution to the above problem.

Set the following parameters to run before the hive join.

Few parameters need to change according to your data size and cluster size.

  • 1 It would be helpful if this answer had some clues as to why this works. For example, four of the parameters appear to enable cost-based optimization: hortonworks.com/blog/5-ways-make-hive-queries-run-faster

You can try MapJoin just like below: