Индексирование Nutch с помощью elasticsearch завершается неудачно

#elasticsearch #nutch

Вопрос:

Я пытаюсь заставить натча работать с elasticsearch. Насколько я понимаю, мне удалось выполнить часть обхода, но что-то не получается с elasticsearch.

Я столкнулся с проблемой во время работы

nutch index crawl/crawldb/ -linkdb crawl/linkdb/20211126092727 -filter -normalize -deleteGone

и я понятия не имею, чем это вызвано.

Я добавляю ту часть индексатора, которая выходит из строя. Я должен добавить, что я использую Docker.

 nutch_1 | log4j:WARN No appenders could be found for logger (org.apache.htrace.core.Tracer). nutch_1 | log4j:WARN Please initialize the log4j system properly. nutch_1 | log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. nutch_1 | 2021-11-26 09:29:37,993 INFO o.a.n.s.SegmentChecker [main] Segment dir is complete: crawl/segments/20211126092727. nutch_1 | 2021-11-26 09:29:37,998 INFO o.a.n.i.IndexingJob [main] Indexer: starting at 2021-11-26 09:29:37 nutch_1 | 2021-11-26 09:29:38,005 INFO o.a.n.i.IndexingJob [main] Indexer: deleting gone documents: true nutch_1 | 2021-11-26 09:29:38,005 INFO o.a.n.i.IndexingJob [main] Indexer: URL filtering: true nutch_1 | 2021-11-26 09:29:38,005 INFO o.a.n.i.IndexingJob [main] Indexer: URL normalizing: true nutch_1 | 2021-11-26 09:29:38,007 INFO o.a.n.i.IndexerMapReduce [main] IndexerMapReduce: crawldb: crawl/crawldb nutch_1 | 2021-11-26 09:29:38,009 INFO o.a.n.i.IndexerMapReduce [main] IndexerMapReduces: adding segment: crawl/segments/20211126092727 nutch_1 | 2021-11-26 09:29:38,013 INFO o.a.n.i.IndexerMapReduce [main] IndexerMapReduce: linkdb: crawl/linkdb nutch_1 | 2021-11-26 09:29:38,983 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] Plugins: looking in: /root/nutch_source/runtime/local/plugins nutch_1 | 2021-11-26 09:29:39,184 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] Plugin Auto-activation mode: [true] nutch_1 | 2021-11-26 09:29:39,185 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] Registered Plugins: nutch_1 | 2021-11-26 09:29:39,187 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] Regex URL Filter (urlfilter-regex) nutch_1 | 2021-11-26 09:29:39,187 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] Html Parse Plug-in (parse-html) nutch_1 | 2021-11-26 09:29:39,187 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] HTTP Framework (lib-http) nutch_1 | 2021-11-26 09:29:39,187 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] the nutch core extension points (nutch-extensionpoints) nutch_1 | 2021-11-26 09:29:39,187 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] Basic Indexing Filter (index-basic) nutch_1 | 2021-11-26 09:29:39,187 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] Anchor Indexing Filter (index-anchor) nutch_1 | 2021-11-26 09:29:39,187 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] Tika Parser Plug-in (parse-tika) nutch_1 | 2021-11-26 09:29:39,187 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] Basic URL Normalizer (urlnormalizer-basic) nutch_1 | 2021-11-26 09:29:39,188 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] Regex URL Filter Framework (lib-regex-filter) nutch_1 | 2021-11-26 09:29:39,188 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] Regex URL Normalizer (urlnormalizer-regex) nutch_1 | 2021-11-26 09:29:39,188 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] CyberNeko HTML Parser (lib-nekohtml) nutch_1 | 2021-11-26 09:29:39,188 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] OPIC Scoring Plug-in (scoring-opic) nutch_1 | 2021-11-26 09:29:39,188 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] Pass-through URL Normalizer (urlnormalizer-pass) nutch_1 | 2021-11-26 09:29:39,188 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] Http Protocol Plug-in (protocol-http) nutch_1 | 2021-11-26 09:29:39,188 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] ElasticIndexWriter (indexer-elastic) nutch_1 | 2021-11-26 09:29:39,188 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] Registered Extension-Points: nutch_1 | 2021-11-26 09:29:39,189 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] Nutch Content Parser (org.apache.nutch.parse.Parser) nutch_1 | 2021-11-26 09:29:39,189 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] Nutch URL Filter (org.apache.nutch.net.URLFilter) nutch_1 | 2021-11-26 09:29:39,189 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) nutch_1 | 2021-11-26 09:29:39,189 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) nutch_1 | 2021-11-26 09:29:39,189 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) nutch_1 | 2021-11-26 09:29:39,189 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] Nutch Publisher (org.apache.nutch.publisher.NutchPublisher) nutch_1 | 2021-11-26 09:29:39,189 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] Nutch Exchange (org.apache.nutch.exchange.Exchange) nutch_1 | 2021-11-26 09:29:39,190 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] Nutch Protocol (org.apache.nutch.protocol.Protocol) nutch_1 | 2021-11-26 09:29:39,190 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] Nutch URL Ignore Exemption Filter (org.apache.nutch.net.URLExemptionFilter) nutch_1 | 2021-11-26 09:29:39,190 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] Nutch Index Writer (org.apache.nutch.indexer.IndexWriter) nutch_1 | 2021-11-26 09:29:39,190 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter) nutch_1 | 2021-11-26 09:29:39,190 INFO o.a.n.p.PluginRepository [LocalJobRunner Map Task Executor #0] Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) nutch_1 | 2021-11-26 09:29:39,215 INFO o.a.n.n.u.r.RegexURLNormalizer [LocalJobRunner Map Task Executor #0] can't find rules for scope 'indexer', using default nutch_1 | 2021-11-26 09:29:39,342 INFO o.a.n.n.u.r.RegexURLNormalizer [LocalJobRunner Map Task Executor #0] can't find rules for scope 'indexer', using default nutch_1 | 2021-11-26 09:29:39,435 INFO o.a.n.n.u.r.RegexURLNormalizer [LocalJobRunner Map Task Executor #0] can't find rules for scope 'indexer', using default nutch_1 | 2021-11-26 09:29:39,539 INFO o.a.n.n.u.r.RegexURLNormalizer [LocalJobRunner Map Task Executor #0] can't find rules for scope 'indexer', using default nutch_1 | 2021-11-26 09:29:39,629 INFO o.a.n.n.u.r.RegexURLNormalizer [LocalJobRunner Map Task Executor #0] can't find rules for scope 'indexer', using default nutch_1 | 2021-11-26 09:29:39,771 INFO o.a.n.i.IndexWriters [pool-5-thread-1] Index writer org.apache.nutch.indexwriter.elastic.ElasticIndexWriter identified. nutch_1 | 2021-11-26 09:29:39,800 WARN o.a.n.e.Exchanges [pool-5-thread-1] No exchange was configured. The documents will be routed to all index writers. nutch_1 | 2021-11-26 09:29:40,703 ERROR o.a.n.i.IndexingJob [main] Indexing job did not succeed, job status:FAILED, reason: NA nutch_1 | 2021-11-26 09:29:40,704 ERROR o.a.n.i.IndexingJob [main] Indexer: java.lang.RuntimeException: Indexing job did not succeed, job status:FAILED, reason: NA nutch_1 | at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:152) nutch_1 | at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:293) nutch_1 | at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) nutch_1 | at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:302) nutch_1 |  indexer_nutch_1 exited with code 255  

Во время разбора у меня были некоторые проблемы с tika, но разбор закончен, поэтому я предполагаю, что это не главная проблема.

 nutch_1 | 2021-11-26 09:27:16,721 WARN o.a.n.p.ParserFactory [LocalJobRunner Map Task Executor #0] ParserFactory: Plugin: org.apache.nutch.parse.feed.FeedParser mapped to contentType application/rss xml via parse-plugins.xml, but not enabled via plugin.includes in nutch-default.xml nutch_1 | 2021-11-26 09:27:16,977 ERROR o.a.n.p.t.TikaParser [LocalJobRunner Map Task Executor #0] Problem loading custom Tika configuration from tika-config.xml nutch_1 | java.lang.NumberFormatException: For input string: "" nutch_1 | at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) ~[?:?] nutch_1 | at java.lang.Integer.parseInt(Integer.java:662) ~[?:?] nutch_1 | at java.lang.Integer.parseInt(Integer.java:770) ~[?:?] nutch_1 | at org.apache.tika.config.TikaConfig.updateXMLReaderUtils(TikaConfig.java:303) ~[tika-core-1.25.jar:1.25] nutch_1 | at org.apache.tika.config.TikaConfig.lt;initgt;(TikaConfig.java:192) ~[tika-core-1.25.jar:1.25] nutch_1 | at org.apache.tika.config.TikaConfig.lt;initgt;(TikaConfig.java:182) ~[tika-core-1.25.jar:1.25] nutch_1 | at org.apache.tika.config.TikaConfig.lt;initgt;(TikaConfig.java:157) ~[tika-core-1.25.jar:1.25] nutch_1 | at org.apache.nutch.parse.tika.TikaParser.setConf(TikaParser.java:276) [parse-tika.jar:?] nutch_1 | at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:175) [apache-nutch-1.19-SNAPSHOT.jar:?] nutch_1 | at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:136) [apache-nutch-1.19-SNAPSHOT.jar:?] nutch_1 | at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:75) [apache-nutch-1.19-SNAPSHOT.jar:?] nutch_1 | at org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:122) [apache-nutch-1.19-SNAPSHOT.jar:?] nutch_1 | at org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:77) [apache-nutch-1.19-SNAPSHOT.jar:?] nutch_1 | at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) [hadoop-mapreduce-client-core-3.1.3.jar:?] nutch_1 | at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799) [hadoop-mapreduce-client-core-3.1.3.jar:?] nutch_1 | at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) [hadoop-mapreduce-client-core-3.1.3.jar:?] nutch_1 | at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:271) [hadoop-mapreduce-client-common-3.1.3.jar:?] nutch_1 | at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] nutch_1 | at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] nutch_1 | at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] nutch_1 | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] nutch_1 | at java.lang.Thread.run(Thread.java:834) [?:?] nutch_1 | Nov 26, 2021 9:27:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem nutch_1 | WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. nutch_1 | See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io nutch_1 | for optional dependencies.  

Я также добавляю содержимое nutch-site.xml

 lt;?xml version="1.0"?gt; lt;?xml-stylesheet type="text/xsl" href="configuration.xsl"?gt;  lt;!-- Put site-specific property overrides in this file. --gt;  lt;configurationgt;  lt;propertygt;  lt;namegt;http.agent.namelt;/namegt;  lt;valuegt;SICrawlerlt;/valuegt;  lt;descriptiongt;HTTP 'User-Agent' request header. MUST NOT be empty -   please set this to a single word uniquely related to your organization.   NOTE: You should also check other related properties:   http.robots.agents  http.agent.description  http.agent.url  http.agent.email  http.agent.version   and set their values appropriately.   lt;/descriptiongt;  lt;/propertygt;  lt;propertygt;  lt;namegt;plugin.includeslt;/namegt;  lt;valuegt;protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-elasticlt;/valuegt;  lt;/propertygt;  lt;propertygt;  lt;namegt;db.ignore.external.linkslt;/namegt;  lt;valuegt;falselt;/valuegt;  lt;descriptiongt;If true, outlinks leading from a page to external hosts or domain  will be ignored. This is an effective way to limit the crawl to include  only initially injected hosts or domains, without creating complex URLFilters.  See 'db.ignore.external.links.mode'.  lt;/descriptiongt;  lt;/propertygt;  lt;propertygt;  lt;namegt;elastic.hostlt;/namegt;  lt;valuegt;elasticsearchlt;/valuegt;  lt;descriptiongt;The hostname to send documents to using TransportClient.  Either host and port must be defined or cluster.  lt;/descriptiongt;  lt;/propertygt;  lt;propertygt;  lt;namegt;elastic.portlt;/namegt;  lt;valuegt;9300lt;/valuegt;  lt;descriptiongt;  The port to connect to using TransportClient.  lt;/descriptiongt;  lt;/propertygt;  lt;propertygt;  lt;namegt;elastic.clusterlt;/namegt;  lt;valuegt;elasticsearchlt;/valuegt;  lt;descriptiongt;The cluster name to discover. Either host and port must  be defined.  lt;/descriptiongt;  lt;/propertygt;  lt;propertygt;  lt;namegt;elastic.indexlt;/namegt;  lt;valuegt;nutchlt;/valuegt;  lt;descriptiongt;  The name of the elasticsearch index. Will normally be autocreated if it  doesn't exist.  lt;/descriptiongt;  lt;/propertygt;  lt;/configurationgt;  

Комментарии:

1. Не могли бы вы проверить файл журнала hadoop.log — в нем должен отображаться полный стек исключения, которое приводит к сбою задания индексирования. В случае изменения конфигурации ведения журнала вам необходимо выяснить, куда записываются сообщения журнала. «использование Docker» — доступен ли индекс Elasticsearch из контейнера, в котором запущен Nutch?