Настройка производительности Apache nutch для полного обхода веб-страниц

#java #performance #web-crawler #nutch

#java #Производительность #веб-сканер #nutch

Вопрос:

Я собираюсь использовать nutch для обхода около 300 веб-страниц. Сканирование работает нормально примерно до 6 минут! после того, как он начнет работать все медленнее и медленнее, пока не упадет почти до нулевой производительности. Я проверяю журнал, и кажется, что количество потоков ожидания вращения увеличивается с течением времени. Не могли бы вы помочь мне решить эту проблему ?!

Вот мой nutch-site.xml конфигурационный файл:

  <property>
   <name>plugin.folders</name>
   <value>/home/nutch/workspace/trunk/src/plugin</value>
 </property>
 <property>
  <name>http.agent.name</name>
  <value>nutch-test</value>
 </property>

<property>
  <name>parser.skip.truncated</name>
  <value>false</value>
  <description>Boolean value for whether we should skip parsing for truncated documents. By default this 
  property is activated due to extremely high levels of CPU which parsing can sometimes take.  
  </description>
</property>
<property>
  <name>db.ignore.external.links</name>
  <value>true</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  </description>
</property>
<property>
  <name>fetcher.server.delay</name>
  <value>1.0</value>
  <description>The number of seconds the fetcher will delay between 
   successive requests to the same server.</description>
</property>
<property>
  <name>http.max.delays</name>
  <value>2</value>
  <description>The number of times a thread will delay when trying to
  fetch a page.  Each time it finds that a host is busy, it will wait
  fetcher.server.delay.  After http.max.delays attepts, it will give
  up on the page for now.</description>
</property>
<property>
  <name>fetcher.server.min.delay</name>
  <value>0.5</value>
  <description>The minimum number of seconds the fetcher will delay between 
  successive requests to the same server. This value is applicable ONLY
  if fetcher.threads.per.host is greater than 1 (i.e. the host blocking
  is turned off).</description>
</property>
<property>
  <name>fetcher.threads.per.host</name>
  <value>3</value>
</property>
<property>
  <name>fetcher.threads.fetch</name>
  <value>100</value>
  <description>The number of FetcherThreads the fetcher should use.
  This is also determines the maximum number of requests that are
  made at once (each FetcherThread handles one connection). The total
  number of threads running in distributed mode will be the number of
  fetcher threads * number of nodes as fetcher has one map task per node.
  </description>
</property>
<property>
  <name>generate.max.count</name>
  <value>10000</value>
  <description>The maximum number of urls in a single
  fetchlist.  -1 if unlimited. The urls are counted according
  to the value of the parameter generator.count.mode.
  </description>
</property>
<property>
 <name>fetcher.max.crawl.delay</name>
 <value>10</value>
 <description>
 If the Crawl-Delay in robots.txt is set to greater than this value (in
 seconds) then the fetcher will skip this page, generating an error report.
 If set to -1 the fetcher will never skip such pages and will wait the
 amount of time retrieved from robots.txt Crawl-Delay, however long that
 might be.
 </description>
</property> 
<property>
  <name>generate.max.per.host</name>
  <value>3</value>
</property>

С наилучшими пожеланиями.

1. Страдает ли целевой сайт от проблем с производительностью?

2. Одна вещь, которую вы можете попробовать, это снизить количество потоков выборки со 100 примерно до 10 и посмотреть, есть ли у вас эта проблема.

3. Нет, целевые веб-сайты в порядке. Я уже уменьшил количество потоков, производительность будет снижена в начале, и в конце она такая же, как и раньше. (производительность была снижена по прошествии времени)

Ответ №1:

Я думаю, что значение для generate.max.count довольно высокое, если сайт работает медленно, и у вас есть, например, 10000 URL-адресов для этого сайта, это может замедлить его.

Вы должны попытаться уменьшить это число.

Вопрос:

Комментарии:

Ответ №1:

Вам также может понравиться

Сообщения Twitter r отключены

Как выполнить тестирование производительности для мобильного приложения, которое зависит от идентификатора уведомления о предупреждении веб-приложения?

Ошибка утверждения электронной почты Римана?