Анализ HTML-содержимого с помощью

#powershell #powershell-4.0

Вопрос:

У меня есть вывод ссылки ниже для мониторинга, который я пытаюсь проанализировать, чтобы изменить.

 lt;htmlgt; lt;headgt; lt;style type="text/css"gt;lt;/stylegt; lt;/headgt; lt;bodygt; lt;div style="float:left;margin-right:50px"gt; lt;divgt;DATA CENTERS WITH GLOBAL REPLICATION TIER ENABLED/SUSPENDED:   lt;divgt;lt;brgt;lt;brgt;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp; DataCenter: DC1 NY [ENABLED] lt;divgt;lt;brgt;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp; Active Zone : BW Zone 1[1], amp;nbsp;amp;nbsp;VIP = 192.168.254.10lt;/divgt;  lt;divgt;lt;brgt;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp; lt;a href=https://192.168.254.10/checkGlobalReplicationTiergt;https://192.168.254.10/checkGlobalReplicationTierlt;/agt; amp;nbsp;amp;nbsp;[ACTIVE]lt;/divgt; lt;divgt;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp; lt;a href=https://192.168.254.10/checkReplicationgt;https://192.168.254.10/checkReplicationlt;/agt;lt;/divgt; lt;divgt;lt;brgt;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp; lt;a href=https://192.168.254.11/checkGlobalReplicationTiergt;https://192.168.254.11/checkGlobalReplicationTierlt;/agt; amp;nbsp;amp;nbsp;[STANDBY]lt;/divgt; lt;divgt;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp; lt;a href=https://192.168.254.11/checkReplicationgt;https://192.168.254.11/checkReplicationlt;/agt;lt;/divgt; lt;divgt;lt;brgt;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp; Local Zones:lt;/divgt; lt;divgt;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp; LC Zone 3[3], amp;nbsp;amp;nbsp;VIP = 192.168.254.13 lt;divgt;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp; lt;a href=https://192.168.254.13/checkReplicationgt;https://192.168.254.13/checkReplicationlt;/agt; amp;nbsp;amp;nbsp;[ACTIVE]lt;/divgt;   lt;divgt;lt;brgt;lt;brgt;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp; DataCenter: DC2 NJ [ENABLED] amp;nbsp;[DEFAULT DC]lt;/divgt; lt;divgt;lt;brgt;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp; Active Portal Zone : BW Zone 2[2], amp;nbsp;amp;nbsp;VIP = 192.168.253.10lt;/divgt;  lt;divgt;lt;brgt;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp; lt;a href=https://192.168.253.10/checkGlobalReplicationTiergt;https://192.168.253.10/checkGlobalReplicationTierlt;/agt; amp;nbsp;amp;nbsp;[ACTIVE]lt;/divgt; lt;divgt;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp; lt;a href=https://192.168.253.10/checkReplicationgt;https://192.168.253.10/checkReplicationlt;/agt;lt;/divgt; lt;divgt;lt;brgt;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp; lt;a href=https://192.168.253.11/checkGlobalReplicationTiergt;https://192.168.253.11/checkGlobalReplicationTierlt;/agt; amp;nbsp;amp;nbsp;[STANDBY]lt;/divgt; lt;divgt;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp; lt;a href=https://192.168.253.11/checkReplicationgt;https://192.168.253.11/checkReplicationlt;/agt;lt;/divgt; lt;divgt;lt;brgt;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp; Local Zones:lt;/divgt; lt;divgt;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp; LC Zone 4[4], amp;nbsp;amp;nbsp;VIP = 192.168.253.13 lt;divgt;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp; lt;a href=https://192.168.253.13/checkReplicationgt;https://192.168.253.13/checkReplicationlt;/agt; amp;nbsp;amp;nbsp;[ACTIVE]lt;/divgt; lt;divgt;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp; lt;a href=https://192.168.253.14/checkReplicationgt;https://192.168.253.14/checkReplicationlt;/agt; amp;nbsp;amp;nbsp;[STANDBY]lt;/divgt;    --gt; lt;/divgt; lt;/divgt; lt;/bodygt; lt;/htmlgt;  

я хотел бы проанализировать это, чтобы получить

 Data Center Active Zone VIP Local Zone VIP DC1 NY [Enabled] BW Zone 1[1] 192.168.254.10 LC Zone 3[3] 192.168.254.13 DC2 NJ [Enabled] [DEFAULT DC] BW Zone 2[2] 192.168.253.10 LC Zone 4[4] 192.168.253.13   

Код, похоже, не может быть проанализирован, и регулярное выражение-лучший способ проанализировать эту страницу или мне следует попробовать какой-то другой способ.

 $zone = "https://192.168.0.90/checkConfiguration" $html = Invoke-WebRequest -Uri $zone -ErrorAction Stop $DC = ($html.ParsedHtml.getElementsByTagName('div') | Where-Object { $_.InnerHTML -like 'lt;divgt;lt;brgt;lt;brgt;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp; DataCenter: *' }) | Foreach-Object {$_.outerText -replace '(?lt;!:.*):', '='} | %{new-object psobject -prop (ConvertFrom-StringData $_)}  

Комментарии:

1. Какая бедная душа использует amp;nbsp; вместо css отступы для интервалов?

Ответ №1:

Для этого вы могли бы сделать это:

 $div = $html.ParsedHtml.getElementsByTagName('div') | Where-Object { $_.InnerHTML -like 'lt;divgt;*DataCenter:*' } $DC = if ($div -and $div.outerText -match '(?s)DataCenters*:s*(w ).*Active Zones*:s*([^,] ),s VIPs*=s*([d.] )') {  [PsCustomObject]@{  'DataCenter' = $matches[1]  'Active Zone' = $matches[2]  'VIP' = $matches[3]  } }  $DC | Format-Table -AutoSize  

Выход:

 DataCenter Active Zone VIP  ---------- ----------- ---  DC1 BW Zone 192.168.0.95  

или как список

 $DC | Format-List  

Выход:

 DataCenter : DC1 Active Zone : BW Zone VIP : 192.168.0.95  

Вот другой подход, когда в html-файле находится несколько центров обработки данных:

 # use outerText to get the plain text for the surrounding lt;divgt;DATA CENTERS WITH GLOBAL REPLICATION TIER ENABLED/SUSPENDED ...lt;/divgt; $content = ($html.ParsedHtml.getElementsByTagName('div') | Where-Object { $_.innerHtml -like 'lt;divgt;DATA CENTERS*' }).outerText $DC = $content -split 'DataCenters*:s*' |  Where-Object { $_ -match '(?s)([w ] (?:[ [w]]*)).*Active (?:Portal )?Zones*:s*([^,] ),s VIPs*=s*([d.] )' } |   ForEach-Object {   [PsCustomObject]@{  'DataCenter' = $matches[1]  'Active Zone' = $matches[2]  'VIP' = $matches[3]  }  }  $DC | Format-Table -AutoSize   

Выход:

 DataCenter Active Zone VIP  ---------- ----------- ---  DC1 NY [ENABLED] BW Zone 1[1] 192.168.254.10 DC2 NJ [ENABLED] [DEFAULT DC] BW Zone 2[2] 192.168.253.10  

Сведения о регулярном выражении:

 (?s) Match the remainder of the regex with the options: dot matches newline (s) ( Match the regular expression below and capture its match into backreference number 1  [w ] Match a single character present in the list below  A word character (letters, digits, etc.)  The character “ ”    Between one and unlimited times, as many times as possible, giving back as needed (greedy)  (?: Match the regular expression below  [ [w]] Match a single character present in the list below  One of the characters “ [”  A word character (letters, digits, etc.)  A ] character  * Between zero and unlimited times, as many times as possible, giving back as needed (greedy)  )  )  . Match any single character  * Between zero and unlimited times, as many times as possible, giving back as needed (greedy) Active Match the characters “Active ” literally (?: Match the regular expression below  Portal Match the characters “Portal ” literally )? Between zero and one times, as many times as possible, giving back as needed (greedy) Zone Match the characters “Zone” literally s Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)  * Between zero and unlimited times, as many times as possible, giving back as needed (greedy) : Match the character “:” literally s Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)  * Between zero and unlimited times, as many times as possible, giving back as needed (greedy) ( Match the regular expression below and capture its match into backreference number 2  [^,] Match any character that is NOT a “,”    Between one and unlimited times, as many times as possible, giving back as needed (greedy) )  , Match the character “,” literally s Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)    Between one and unlimited times, as many times as possible, giving back as needed (greedy) VIP Match the characters “VIP” literally s Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)  * Between zero and unlimited times, as many times as possible, giving back as needed (greedy) = Match the character “=” literally s Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)  * Between zero and unlimited times, as many times as possible, giving back as needed (greedy) ( Match the regular expression below and capture its match into backreference number 3  [d.] Match a single character present in the list below  A single digit 0..9  The character “.”    Between one and unlimited times, as many times as possible, giving back as needed (greedy) )  

Комментарии:

1. Спасибо, Тео.. как получить ` DC1 [ВКЛЮЧЕНО]`, записанный в Центре обработки данных, и если код содержит более одного центра обработки данных: вкладка.. как захватить все ….

2. @Enigma Пожалуйста, посмотрите мое редактирование. Однако у меня не было времени проверить это..

3. Тео … я обновил html-код.. Является ли регулярное выражение лучшим способом разбора этого на переменные??

4. @Enigma Да, из-за очень плохо сконструированного HTML, на самом деле вряд ли есть другой вариант, чем использование регулярного выражения для получения свойств, которые вам нужно проанализировать.

5. Спасибо, Тео.. Это сработало как заклинание, и вы очень помогли мне с моим ограниченным навыком регулярного выражения. Очень признателен.