Как рассчитывается продолжительность в структурированном потоковом интерфейсе Spark?

#sql #performance #apache-spark #spark-structured-streaming

Вопрос:

У нас есть задание Spark SQL, которое мы хотели бы оптимизировать. Мы пытаемся выяснить, какая часть нашего конвейера работает медленнее/быстрее.

В прикрепленном графике SQL-запроса есть 3 целых поля с кодом, все с одинаковой продолжительностью: 2,9 с, 2,9 с, 2,9 с. Смотрите рисунок ниже:

But if we check the Stage graph, it shows 3 seconds for the total stage. See the below picture:

So the durations in the WholeStageCodegen boxes do not add up, it seems
that these durations refer to the sum of the whole stage. Do we miss
something here? Is there a way to figure out the duration for the
individual boxes?

Sometimes there is some difference in the duration, but not more than
0.1s, examples:

18.3s, 18.3s, 18.4s
968ms, 967ms, 1.0s
The Stage duration is always as much as one of the WholeStageCodegen’s
duration, or at most 0.1-0.3sec larger.

How can one figure out the duration for each of the WholeStageCodegen parts, and is that actually measured? I suspect that Spark would have to trace individual operations as units of generated functions. Is that measurement actually performed there, or are these numbers more like a placeholder for a feature that does not exist?

Вопрос:

Вам также может понравиться

Невозможно загрузить набор данных рукописного текста MNIST из fetch_openml()

Обрезать и сохранить изображение

Локализация приложения не меняет язык cocoapods, используемых в приложении