Сбой импорта SOLR при обработке документа Tika

#solr #apache-tika

#solr #apache-tika

Вопрос:

У меня возникают трудности при выполнении импорта Solr с помощью Tika, мои документы продолжают сбоить при индексации веб-страниц.

Я удаляю содержимое документов Tika и перезапускаю импорт, но это очень утомительно, и я, очевидно, теряю содержимое этих документов.

Вот журнал сбоев:

 org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read content Processing Document # 927
    at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
    at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:130)
    at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:596)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:622)
    at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:268)
    at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)
    at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359)
    at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427)
    at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408)
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@b623d7
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
    at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:128)
    ... 8 more
Caused by: java.lang.NullPointerException

Nov 10, 2011 10:51:29 AM org.apache.solr.common.SolrException log
SEVERE: Full Import failed:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read content Processing Document # 927
    at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
    at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:130)
    at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:596)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:622)
    at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:268)
    at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)
    at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359)
    at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427)
    at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408)
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@b623d7
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
    at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:128)
    ... 8 more
Caused by: java.lang.NullPointerException
  

Пример сбоя данных:

 pageText=pageText(1.0)={<table width="100%" height="100%" border="0" cellpadding="0" cellspacing="0" nodeIndex="3" class="ril_layoutTable">
    <tr nodeIndex="2">
        <td width="50%" rowspan="3" nodeIndex="1">amp;nbsp;</td>           
        <td width="1" rowspan="3" nodeIndex="4"></td>           
        <td nodeIndex="5">          
            <!-- ImageReady Slices (headergraphics.psd) -->
            <table width="780" border="0" cellpadding="0" cellspacing="0" nodeIndex="8" class="ril_layoutTable">
                <tr nodeIndex="7">
                    <td colspan="9" nodeIndex="6">                      
                        <table width="780" height="40" border="0" cellpadding="0" cellspacing="0" nodeIndex="11" class="ril_layoutTable">
                            <tr nodeIndex="10">
                                <td width="500" nodeIndex="9">amp;nbsp;</td>                                   
                                <td width="135" nodeIndex="12">                                     
                                    <a href="/login.html" nodeIndex="80"></a>
                                    <a href="/login.html" nodeIndex="81"></a>                           
                                </td>               
                                <td width="135" nodeIndex="13">amp;nbsp;</td>          
                                <td nodeIndex="14">amp;nbsp;</td>              
                            </tr>
                        </table>
                    </td>           
                </tr>
                <tr nodeIndex="16">
                    <td nodeIndex="15"></td>        
                    <td nodeIndex="17" childIsOnlyALink="1">
                        <a href="/index.html" nodeIndex="84"></a>
                    </td>       
                    <td nodeIndex="18" childIsOnlyALink="1">
                        <a href="/history.html" nodeIndex="86"></a>
                    </td>       
                    <td nodeIndex="19" childIsOnlyALink="1">
                        <a href="/faq.html" nodeIndex="88"></a>
                    </td>       
                    <td nodeIndex="20" childIsOnlyALink="1">
                        <a href="/prep.html" nodeIndex="90"></a>
                    </td>       
                    <td nodeIndex="21"></td>        
                    <td nodeIndex="22" childIsOnlyALink="1">
                        <a href="/exercises.html" nodeIndex="93"></a>
                    </td>       
                    <td nodeIndex="23" childIsOnlyALink="1">
                        <a href="/faq.html?contact=true" nodeIndex="95"></a>
                    </td>       
                    <td nodeIndex="24"></td>        
                </tr>
                <tr nodeIndex="26">
                    <td colspan="9" nodeIndex="25"></td>
                </tr>
            </table><!-- End ImageReady Slices -->
        </td>   
        <td width="1" rowspan="3" nodeIndex="27"></td>  
        <td width="50%" rowspan="3" nodeIndex="28">amp;nbsp;</td>      
    </tr>
    <tr nodeIndex="30">
        <td height="100%" valign="top" nodeIndex="29">  
            <table width="780" border="0" cellpadding="0" cellspacing="0" nodeIndex="33" class="ril_layoutTable">
                <tr nodeIndex="32">
                    <td width="534" valign="top" nodeIndex="31">        
                        <table width="534" border="0" cellpadding="0" cellspacing="0" nodeIndex="36" class="ril_layoutTable">
                            <tr nodeIndex="35">
                                <td width="534" valign="top" class="bgdown" nodeIndex="34">
                                    <table cellspacing="0" cellpadding="0" nodeIndex="39" class="ril_layoutTable">
                                        <tr nodeIndex="38">
                                            <td valign="top" width="508" nodeIndex="37">                                                                    
                                                <!--Begin Content-->
                                                <h2 nodeIndex="40">Welcome to IQTest.com, home of the original  online IQ test.</h2>
                                                <p nodeIndex="41" childIsOnlyALink="1">
                                                    <a href="/prep.html" nodeIndex="100">Click here</a> to take our free, private, and fun IQ test.</p>
                                                <p nodeIndex="42">
                                                    Our original IQ test  is the most scientifically valid IQ test available on 
                                                    the web today. Previously offered only to corporations, schools, and in certified professional applications, it is now available to you. In addition to measuring your general IQ, our exclusive  test  assesses your performance in 13 different areas of intelligence, revealing your key cognizant 
                                                    strengths and weaknesses.</p>
                                                <p nodeIndex="43">
                                                    Developed by PhDs and statistically sound, our  test  reflects the best research available.<br nodeIndex="101">
                                                        <a href="/prep.html" nodeIndex="102">Click here to begin</a>
                                                        <br nodeIndex="103">
                                                            <br nodeIndex="104">
                                                </p>
                                                <h2 nodeIndex="44">
                                                    <a href="/prep.html" nodeIndex="105">IQTest.com<br nodeIndex="106">
                                                            Take the Test</a>
                                                </h2>
                                                <br nodeIndex="107">
                                                    <h2 nodeIndex="45">
                                                        <strong nodeIndex="108">What is an IQ?
                                                        </strong>
                                                    </h2>
                                                    <p nodeIndex="46">An Intelligence Quotient  indicates a person's mental abilities relative to others of approximately the same age. Everyone has hundreds of specific mental 
                                                        abilities--some  can be measured accurately and are reliable predictors of  academic and financial success.</p>
                                                    <p nodeIndex="47">Read more about <a href="whatisaniqscore.html" nodeIndex="109">Intelligence Testing</a></p>
                                                    <!-- End of StatCounter Code -->
                                                    <!--End Content-->
                                                    <br nodeIndex="113">
                                                        <p nodeIndex="48"></p>               
                                            </td>
                                        </tr>
                                    </table><!-- </div> -->
                                </td>
                            </tr>
                            <tr nodeIndex="50">
                                <td nodeIndex="49"></td>
                            </tr>
                        </table>
                    </td>   
                    <!--Begin Sidebar-->
                    <td height="100%" nodeIndex="51">amp;nbsp;</td>
                    <td width="225" valign="top" nodeIndex="52">
                        <table class="ril_layoutTable" width="225" border="0" cellpadding="0" cellspacing="0" nodeIndex="55">
                            <tr nodeIndex="54">
                                <td nodeIndex="53"></td>
                            </tr>
                            <tr nodeIndex="57">
                                <td width="225" valign="top" nodeIndex="56">            
                                    <h4 nodeIndex="118">What does my score mean?</h4>               
                                    <p nodeIndex="58">Please <a href="whatisaniqscore.html" nodeIndex="119">click here</a> for an explanation of IQ testing and standard deviation.<br nodeIndex="120">
                                            Please <a href="faq.html#chart" nodeIndex="121">click here</a> for a test score comparison chart.<br nodeIndex="122">
                                                Please <a href="history.html" nodeIndex="123">click here</a> for a history of intelligence testing.</p>
                                    <div align="center" margin="0" nodeIndex="59">
                                    </div>
                                </td>               
                            </tr>
                            <tr nodeIndex="61">
                                <td nodeIndex="60"></td>
                            </tr>
                            <tr nodeIndex="63">
                                <td width="225" valign="top" nodeIndex="62">            
                                    <h4 nodeIndex="127">What is the Complete Personal Intelligence Profile?</h4>                
                                    <p nodeIndex="64">Your Complete Personal Intelligence Profile will give you much greater detail about the range and variety of your mental abilities. <a href="profileexplain.html" nodeIndex="128">Read More...</a></p>                    
                                </td>               
                            </tr>
                            <tr nodeIndex="66">
                                <td nodeIndex="65"></td>
                            </tr>
                            <tr nodeIndex="68">
                                <td width="225" valign="top" nodeIndex="67">    
                                    <h4 nodeIndex="130">Consciousness Exercises</h4>    
                                    <p nodeIndex="69">The Consciousness Exercises are a set of entertaining psycho-spiritual games, puzzles, dialogs, and more, which can expand your awareness. <a href="exercises.html" nodeIndex="131">Read More...</a></p>                      
                                </td>
                            </tr>
                            <tr nodeIndex="71">
                                <td nodeIndex="70"></td>
                            </tr>
                        </table>
                    </td>       
                    <!--End Sidebar-->
                </tr>
            </table>
        </td>   
    </tr>
    <tr nodeIndex="73">
        <td nodeIndex="72">
            <table width="780" border="0" cellpadding="0" cellspacing="0" nodeIndex="76" class="ril_layoutTable">
                <tr nodeIndex="75">
                    <td width="780" height="33" align="center" nodeIndex="74">
                        <a href="/index.html" nodeIndex="133">Home</a>amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;
                        <a href="/history.html" nodeIndex="134">History</a>amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;
                        <a href="/faq.html" nodeIndex="135">FAQ</a>amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;
                        <a href="/prep.html" nodeIndex="136">Test</a>amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;
                        <a href="/exercises.html" nodeIndex="137">Consciousness Exercises</a>amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;
                        <a href="/faq.html?contact=true" nodeIndex="138">Contact Us</a>amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;
                        <a href="/privacy.html" nodeIndex="139">Privacy Policy</a>amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;
                        <a href="/remove.html" nodeIndex="140">Unsubscribe</a>
                    </td>
                </tr>
                <tr nodeIndex="78">
                    <td width="780" height="34" align="center" nodeIndex="77">amp;copy; 2003 -2011 Autumn Group. All rights reserved</td>
                </tr>
            </table>
        </td>   
    </tr>
  

Комментарии:

1. Какую версию Solr вы используете? Анализируется ли ваш документ последней версией Apache Tika в автономном режиме?

2. Использование SOLR 3.4.0. Не уверен, что автономный Tika анализирует его…