объединение разделенной на разделы, объединенной таблицы с использованием только разделенной таблицы (не разделенной таблицы) в улье

ProgramBox

объединение разделенной на разделы, объединенной таблицы с использованием только разделенной таблицы (не разделенной таблицы) в улье

Post author:admin
Запись опубликована:13 ноября, 2022
Post category:Вопросы по программированию

#hive #query-optimization #hiveql #bucket #hadoop-partitioning

#hive #оптимизация запросов #hiveql #корзина #hadoop-разделение

Вопрос:

у меня есть 2 таблицы:

q6_cms_list_key1 (с разделением по cm и se) с разделением по tr_dt … 99 000 000 000 строк q6_cm_first_visit (с разделением по cm и se) 25 000 000 000 строк

создание другой таблицы с использованием следующих условий

      insert into table q6_cm_first_visit1 PARTITION (trans) 
     select distinct 
            b.se10, b.dealer_id, b.terminal_id, b.se, 
            b.comp_start_n, b.comp_end_n, b.latest_date,
            b.cm,a.first_visit_date,b.trans 
       from q6_cm_first_visit a 
            inner join q6_cms_list_key1 b 
            on b.trans BETWEEN DATE_SUB('${hiveconf:run.date}', 180) AND '${hiveconf:run.date}' 
               and a.cm = b.cm and a.se = b.se;

запрос занимает много времени.
Как сократить время и следует ли объединять не разделенную / объединенную таблицу с разделенной / разделенной таблицей?

 formatted plan is below 

1   STAGE DEPENDENCIES:
2   Stage-1 is a root stage
3   Stage-2 depends on stages: Stage-1
4   Stage-3 depends on stages: Stage-2
5   Stage-0 depends on stages: Stage-3
6   Stage-4 depends on stages: Stage-0
7   STAGE PLANS:
8   Stage: Stage-1
9   Map Reduce
10  Map Operator Tree:
11  TableScan
12  alias: a
13  Statistics: Num rows: 25638813900 Data size: 10691743838084 Basic stats: COMPLETE Column stats: NONE
14  Filter Operator
15  predicate: (cm is not null and se is not null) (type: boolean)
16  Statistics: Num rows: 25638813900 Data size: 10691743838084 Basic stats: COMPLETE Column stats: NONE
17  Reduce Output Operator
18  key expressions: cm (type: string), eff_se (type: bigint)
19  sort order:   
20  Map-reduce partition columns: cm (type: string), se (type: bigint)
21  Statistics: Num rows: 25638813900 Data size: 10691743838084 Basic stats: COMPLETE Column stats: NONE
22  value expressions: first_visit_date (type: string)
23  TableScan
24  alias: b
25  Statistics: Num rows: 91896551441 Data size: 52063175338060 Basic stats: COMPLETE Column stats: NONE
26  Filter Operator
27  predicate: (cm is not null and se is not null and trans_dt BETWEEN null AND '${hiveconf:run.date}') (type: boolean)
28  Statistics: Num rows: 1 Data size: 566 Basic stats: COMPLETE Column stats: NONE
29  Reduce Output Operator
30  key expressions: cm (type: string), se (type: bigint)
31  sort order:   
32  Map-reduce partition columns: cm (type: string), se (type: bigint)
33  Statistics: Num rows: 1 Data size: 566 Basic stats: COMPLETE Column stats: NONE
34  value expressions: se10 (type: string), dealer_id (type: string), terminal_id (type: string), comp_start_n (type: string), comp_end_n (type: string), latest_date (type: date), trans_dt (type: string)
35  Reduce Operator Tree:
36  Join Operator
37  condition map:
38  Inner Join 0 to 1
39  keys:
40  0 cm (type: string), se (type: bigint)
41  1 cm (type: string), se (type: bigint)
42  outputColumnNames: _col5, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17
43  Statistics: Num rows: 28202695901 Data size: 11760918476803 Basic stats: COMPLETE Column stats: NONE
44  Select Operator
45  expressions: _col5 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: string), _col12 (type: bigint), _col13 (type: string), _col14 (type: string), _col15 (type: date), _col16 (type: string), _col17 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: string), _col12 (type: bigint), _col13 (type: string), _col14 (type: string), _col15 (type: date), _col16 (type: string), _col5 (type: string), _col17 (type: string)
46  outputColumnNames: _col5, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col5, _col17
47  Statistics: Num rows: 28202695901 Data size: 11760918476803 Basic stats: COMPLETE Column stats: NONE
48  Group By Operator
49  keys: _col9 (type: string), _col10 (type: string), _col11 (type: string), _col12 (type: bigint), _col13 (type: string), _col14 (type: string), _col15 (type: date), _col16 (type: string), _col5 (type: string), _col17 (type: string)
50  mode: hash
51  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9
52  Statistics: Num rows: 28202695901 Data size: 11760918476803 Basic stats: COMPLETE Column stats: NONE
53  File Output Operator
54  compressed: true
55  table:
56  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
57  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
58  serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
59  Stage: Stage-2
60  Map Reduce
61  Map Operator Tree:
62  TableScan
63  Reduce Output Operator
64  key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: bigint), _col4 (type: string), _col5 (type: string), _col6 (type: date), _col7 (type: string), _col8 (type: string), _col9 (type: string)
65  sort order:           
66  Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: bigint), _col4 (type: string), _col5 (type: string), _col6 (type: date), _col7 (type: string), _col8 (type: string), _col9 (type: string)
67  Statistics: Num rows: 28202695901 Data size: 11760918476803 Basic stats: COMPLETE Column stats: NONE
68  Reduce Operator Tree:
69  Group By Operator
70  keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string), KEY._col3 (type: bigint), KEY._col4 (type: string), KEY._col5 (type: string), KEY._col6 (type: date), KEY._col7 (type: string), KEY._col8 (type: string), KEY._col9 (type: string)
71  mode: mergepartial
72  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9
73  Statistics: Num rows: 14101347950 Data size: 5880459238192 Basic stats: COMPLETE Column stats: NONE
74  File Output Operator
75  compressed: true
76  table:
77  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
78  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
79  serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
80  Stage: Stage-3
81  Map Reduce
82  Map Operator Tree:
83  TableScan
84  Reduce Output Operator
85  sort order:
86  Map-reduce partition columns: _col9 (type: string)
87  Statistics: Num rows: 14101347950 Data size: 5880459238192 Basic stats: COMPLETE Column stats: NONE
88  value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: bigint), _col4 (type: string), _col5 (type: string), _col6 (type: date), _col7 (type: string), _col8 (type: string), _col9 (type: string)
89  Reduce Operator Tree:
90  Select Operator
91  expressions: UDFToLong(VALUE._col0) (type: bigint), VALUE._col1 (type: string), VALUE._col2 (type: string), VALUE._col3 (type: bigint), VALUE._col4 (type: string), VALUE._col5 (type: string), VALUE._col6 (type: date), VALUE._col7 (type: string), VALUE._col8 (type: string), VALUE._col9 (type: string)
92  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9
93  Statistics: Num rows: 14101347950 Data size: 5880459238192 Basic stats: COMPLETE Column stats: NONE
94  File Output Operator
95  compressed: false
96  Statistics: Num rows: 14101347950 Data size: 5880459238192 Basic stats: COMPLETE Column stats: NONE
97  table:

1. пожалуйста, предоставьте ОБЪЯСНИТЕ вывод команды, а также, пожалуйста, предоставьте более подробную информацию о выполнении: какая вершина работает медленно, если она находится на Tez, или подробности о выполнении MR

2. Вы хотите сказать, что у вас 99 000 000 000 строк? Пожалуйста, не используйте индийские слова, такие как «крор»

Ответ №1:

Некоторые предложения:

Улучшите фильтрацию. Вычислите date_sub вне скрипта и передайте уже вычисленную дату, если это возможно. Функции в предикатах могут препятствовать обрезке разделов. EXPLAIN DEPENDENCY предоставляет дополнительную информацию о том, какие разделы считываются. Убедитесь, что удаление разделов работает.
Улучшите стратегию объединения. Что стоит попробовать, так это объединение корзины сортировки слиянием или объединение карты корзины сортировки слиянием. Если обе таблицы объединены и отсортированы по одним и тем же столбцам и имеют одинаковое количество сегментов, вы можете попробовать добавить эти настройки:

     set hive.auto.convert.sortmerge.join=true;
    set hive.optimize.bucketmapjoin = true;
    set hive.optimize.bucketmapjoin.sortedmerge = true;
    set hive.auto.convert.sortmerge.join.noconditionaltask=true;
    set hive.auto.convert.join.noconditionaltask.size = 10000000; --can be increased
    set hive.mapjoin.smalltable.filesize=10000000; --can be increased

Последние 2 параметра определяют, какой размер таблицы может поместиться в памяти. Не уверен, будет ли это работать для таких таблиц, как ваша, или нет, установка слишком большого размера может привести к исключению ООМ. Но даже без преобразования Map join объединение групп сортировки слиянием должно значительно повысить производительность. Проверьте план после добавления этих настроек, он должен быть Sort Merge Bucket Join Operator вместо Join Operator .
Подробнее обо всех настройках смотрите здесь: LanguageManual JoinOptimization

Векторизация и Tez. Вы работаете на MR и без векторизации. Вы выиграете от Tez и векторизации, попробуйте эти настройки:

     set hive.execution.engine=tez;
    SET hive.optimize.ppd=true;          --works for ORC
    SET hive.vectorized.execution.enabled=true;
    SET hive.vectorized.execution.reduce.enabled=true;

Окончательное улучшение редуктора. Если выходная таблица также разделена, проверьте, улучшит ли добавление distribute by <partition key> в конце производительность или нет. Если ключ раздела распределен равномерно без перекоса, добавление распределения по повысит производительность. Попробуйте добавить DISTRIBUTE BY trans в и запроса.
Ранняя агрегация перед объединением. Например, если таблица first_visit содержит более одной строки на ключ объединения и дублирует строки после объединения, удалите дублирование в подзапросе перед объединением, используя row_number или distinct и т. Д. Это может повысить производительность, но также может предотвратить объединение сортировки-объединения-корзины.

Лучше попробуйте все эти улучшения по отдельности, чтобы проверить, как каждое из них влияет на производительность, и, наконец, объединить те, которые повышают производительность.

1. благодаря @leftjoin ниже приведен полный запрос вставки в таблицу q6_cm_first_visit1 РАЗДЕЛ (trans) выберите distinct b.se10, b.dealer_id, b.terminal_id, b.se , b.comp_start_n, b.comp_end_n, b.latest_date, b.cm ,a.first_visit_date,b.trans из q6_cm_first_visit aвнутреннее соединение q6_cms_list_key1 b на b.trans МЕЖДУ DATE_SUB(‘$ {hiveconf:run.date}’, 180) И ‘$ {hiveconf:run.date}’ и a.cm = b.cm и a.se = b.se ; q6_cms_list_key1 (привязан к cm и se)разделено по trans … 99 000 000 000 строк q6_cm_first_visit (разделено по cm и se) 25 000 000 000 строк