#sql #performance #apache-spark #spark-structured-streaming
Вопрос:
У нас есть задание Spark SQL, которое мы хотели бы оптимизировать. Мы пытаемся выяснить, какая часть нашего конвейера работает медленнее/быстрее.
В прикрепленном графике SQL-запроса есть 3 целых поля с кодом, все с одинаковой продолжительностью: 2,9 с, 2,9 с, 2,9 с. Смотрите рисунок ниже:
But if we check the Stage graph, it shows 3 seconds for the total stage. See the below picture:
So the durations in the WholeStageCodegen boxes do not add up, it seems
that these durations refer to the sum of the whole stage. Do we miss
something here? Is there a way to figure out the duration for the
individual boxes?
Sometimes there is some difference in the duration, but not more than
0.1s, examples:
- 18.3s, 18.3s, 18.4s
- 968ms, 967ms, 1.0s
The Stage duration is always as much as one of the WholeStageCodegen’s
duration, or at most 0.1-0.3sec larger.
How can one figure out the duration for each of the WholeStageCodegen
parts, and is that actually measured? I suspect that Spark would have to trace individual operations as units of generated functions. Is that measurement actually performed there, or are these numbers more like a placeholder for a feature that does not exist?