синтаксический анализ google data fusion xml — ‘parse-xml-to-json’: несоответствующая заметка о закрытии тега на 6

#google-cloud-platform #google-cloud-data-fusion #parsexml

#google-облачная платформа #google-cloud-data-fusion #parsexml

Вопрос:

Я новичок в Google Cloud Data Fusion. Мне удалось успешно обработать CSV-файл и загрузить в BigQuery. Мое требование — обработать XML-файл и загрузить в BigQuery. Чтобы попробовать, я просто взял очень простой XML

XML-файл:

 {<?xml version="1.0" encoding="UTF-8"?> <note> <to>Tove</to <from>Jani</from>  <heading>Reminder</heading>  <body>Don't forget me this weekend!</body> </note> }
  

Сообщение об ошибке 1

 java.lang.Exception: Stage:Wrangler - Reached error threshold 1, terminating processing due to error : Error encountered while executing 'parse-xml-to-json' : Mismatched close tag note at 6 [character 7 line 1]
at io.cdap.wrangler.Wrangler.transform(Wrangler.java:404) ~[1601903767453-0/:na]
at io.cdap.wrangler.Wrangler.transform(Wrangler.java:83) ~[1601903767453-0/:na]
at io.cdap.cdap.etl.common.plugin.WrappedTransform.lambda$transform$5(WrappedTransform.java:90) ~[cdap-etl-core-6.2.0.jar:na]
at io.cdap.cdap.etl.common.plugin.Caller$1.call(Caller.java:30) ~[cdap-etl-core-6.2.0.jar:na]
at io.cdap.cdap.etl.common.plugin.StageLoggingCaller.call(StageLoggingCaller.java:40) ~[cdap-etl-core-6.2.0.jar:na]
at io.cdap.cdap.etl.common.plugin.WrappedTransform.transform(WrappedTransform.java:89) ~[cdap-etl-core-6.2.0.jar:na]
at io.cdap.cdap.etl.common.TrackedTransform.transform(TrackedTransform.java:74) ~[cdap-etl-core-6.2.0.jar:na]
at io.cdap.cdap.etl.spark.function.TransformFunction.call(TransformFunction.java:50) ~[hydrator-spark-core2_2.11-6.2.0.jar:na]
at io.cdap.cdap.etl.spark.Compat$FlatMapAdapter.call(Compat.java:126) ~[hydrator-spark-core2_2.11-6.2.0.jar:na]
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) ~[scala-library-2.11.8.jar:na]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) ~[scala-library-2.11.8.jar:na]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:128) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:127) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1415) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:139) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.scheduler.Task.run(Task.scala:109) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) [spark-core_2.11-2.3.3.jar:2.3.3]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_252]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_252]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_252]
  

Вызвано: исключение io.cdap.wrangler.api.RecipeException: ошибка, возникшая при выполнении ‘parse-xml-to-json’: несоответствующая заметка о закрытии тега на 6 [символ 7 строка 1]
в io.cdap.wrangler.executor.RecipePipelineExecutor.execute(RecipePipelineExecutor.java:149) ~ [wrangler-core-4.2.0.jar: na]
в io.cdap.wrangler.executor.RecipePipelineExecutor.execute(RecipePipelineExecutor.java:97) ~ [wrangler-core-4.2.0.jar: na]
в io.cdap.wrangler.Wrangler.transform(Wrangler.java:376) ~ [1601903767453-0 /:na]
… Пропущено 26 общих фреймов
Вызвано: исключение io.cdap.wrangler.api.DirectiveExecutionException: ошибка, возникшая при выполнении ‘parse-xml-to-json’: несоответствующая заметка о закрытии тега в 6 [символ 7 строка 1]
в io.cdap.directives.xml.XmlToJson.execute(XmlToJson.java:106) ~ [na:na]
в io.cdap.directives.xml.XmlToJson.execute(XmlToJson.java:49) ~ [na:na]
в io.cdap.wrangler.executor.RecipePipelineExecutor.execute(RecipePipelineExecutor.java:129) ~ [wrangler-core-4.2.0.jar: na]
… Пропущено 28 общих фреймов
Вызвано: org.json.JSONException: несоответствующая заметка о закрытии тега в 6 [символ 7 строка 1]
в org.json.JSONTokener.SyntaxError(JSONTokener.java:505) ~ [org.json.json-20090211.jar: na]
в org.json.XML.parse(XML.java: 311) ~ [org.json.json-20090211.jar: na]
в org.json.XML.toJSONObject(XML.java: 520) ~ [org.json.json-20090211.jar: na]
в org.json.XML.toJSONObject(XML.java: 548) ~ [org.json.json-20090211.jar: na]
в org.json.XML.toJSONObject(XML.java: 472) ~ [org.json.json-20090211.jar: na]
в io.cdap.directives.xml.XmlToJson.execute(XmlToJson.java:96) ~ [na:na]
… Пропущено 30 общих фреймов

Сообщение об ошибке 2:

 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): UnknownReason
  

Трассировка стека драйвера:
в org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1661 ) ~[spark-core_2.11-2.3.3.jar:2.3.3]
в org.apache.spark.scheduler.DAGScheduler $$anonfun$abortStage$1.применить(DAGScheduler.scala: 1649) ~[spark-core_2.11-2.3.3.jar: 2.3.3]
в org.apache.spark.scheduler.DAGScheduler $$anonfun$abortStage$1.применить(DAGScheduler.scala: 1648) ~[spark-core_2.11-2.3.3.jar: 2.3.3]
в scala.collection.mutable.Изменяемый размер $class.foreach(изменяемый размер.scala:59) ~ [scala-library-2.11.8.jar:na]
в scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) ~ [scala-library-2.11.8.jar: na]
в org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1648) ~[spark-core_2.11-2.3.3.jar:2.3.3]
в org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.применить(DAGScheduler.scala:831) ~[spark-core_2.11-2.3.3.jar: 2.3.3]
в org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.применить(DAGScheduler.scala:831) ~[spark-core_2.11-2.3.3.jar: 2.3.3]
в scala.Опция.foreach(Option.scala:257) ~ [scala-library-2.11.8.jar: na]
в org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) ~[spark-core_2.11-2.3.3.jar: 2.3.3]
в org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1882) ~[spark-core_2.11-2.3.3.jar: 2.3.3]
в org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1831) ~[spark-core_2.11-2.3.3.jar: 2.3.3]
в org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1820) ~ [spark-core_2.11-2.3.3.jar: 2.3.3]
в org.apache.spark.util.EventLoop $$anon$1.run(EventLoop.scala:48) ~ [spark-core_2.11-2.3.3.jar: 2.3.3]
в org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala: 642) ~ [spark-core_2.11-2.3.3.jar: 2.3.3]
в org.apache.spark.SparkContext.runJob(SparkContext.scala: 2034) ~ [na: 2.3.3]
в org.apache.spark.SparkContext.runJob(SparkContext.scala: 2055) ~ [na: 2.3.3]
в org.apache.spark.SparkContext.runJob(SparkContext.scala:2087) ~ [na: 2.3.3]
в org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:78) ~[spark-core_2.11-2.3.3.jar: 2.3.3]
в org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1083 ) [spark-core_2.11-2.3.3.jar: 2.3.3]
в org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081 ) [spark-core_2.11-2.3.3.jar: 2.3.3]
в org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081 ) [spark-core_2.11-2.3.3.jar: 2.3.3]
в org.apache.spark.rdd.RDDOperationScope $.withScope(RDDOperationScope.scala:151) [spark-core_2.11-2.3.3.jar: 2.3.3]
в org.apache.spark.rdd.RDDOperationScope $.withScope(RDDOperationScope.scala:112) [spark-core_2.11-2.3.3.jar: 2.3.3]
в org.apache.spark.rdd.RDD.withScope(RDD.scala: 363) [spark-core_2.11-2.3.3.jar: 2.3.3]
в org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1081) [spark-core_2.11-2.3.3.jar: 2.3.3]
в org.apache.spark.api.java.JavaPairRDD.saveAsNewAPIHadoopDataset(JavaPairRDD.scala:831) [spark-core_2.11-2.3.3.jar: 2.3.3]
в io.cdap.cdap.etl.spark.batch.SparkBatchSinkFactory.writeFromRDD(SparkBatchSinkFactory.java:98) [hydrator-spark-core2_2.11-6.2.0.jar: na]
в io.cdap.cdap.etl.spark.batch.RDDCollection$ 1.run(RDDCollection.java:179) [hydrator-spark-core2_2.11-6.2.0.jar: na]
в io.cdap.cdap.etl.spark.SparkPipelineRunner.runPipeline(SparkPipelineRunner.java:350) [hydrator-spark-core2_2.11-6.2.0.jar: na]
в io.cdap.cdap.etl.spark.batch.BatchSparkPipelineDriver.run(BatchSparkPipelineDriver.java:148) [hydrator-spark-core2_2.11-6.2.0.jar: na]
в io.cdap.cdap.app.runtime.spark.SparkTransactional$ 2.run(SparkTransactional.java:236) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar: na]
в io.cdap.cdap.app.runtime.spark.SparkTransactional.execute(SparkTransactional.java:208) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar: na]
в io.cdap.cdap.app.runtime.spark.SparkTransactional.execute(SparkTransactional.java:138) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar: na]
в io.cdap.cdap.app.runtime.spark.AbstractSparkExecutionContext.execute(AbstractSparkExecutionContext.scala:228) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar: na]
в io.cdap.cdap.app.runtime.spark.SerializableSparkExecutionContext.execute(SerializableSparkExecutionContext.scala:61 ) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar: na]
в io.cdap.cdap.app.runtime.spark.DefaultJavaSparkExecutionContext.execute(DefaultJavaSparkExecutionContext.scala:89) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar:na]
в io.cdap.cdap.api.Transactionals.execute(Transactionals.java:63) [na:na]
в io.cdap.cdap.etl.spark.batch.BatchSparkPipelineDriver.run(BatchSparkPipelineDriver.java:116) [hydrator-spark-core2_2.11-6.2.0.jar: na]
в io.cdap.cdap.app.runtime.spark.SparkMainWrapper $.main(SparkMainWrapper.scala:86) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar: na]
в io.cdap.cdap.app.runtime.spark.SparkMainWrapper.main(SparkMainWrapper.scala) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar: na]
в sun.reflect.NativeMethodAccessorImpl.invoke0 (собственный метод) ~ [na: 1.8.0_252]
в sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~ [na: 1.8.0_252]
в sun.reflect.Делегирование methodaccessorimpl.invoke(делегирование methodaccessorimpl.java:43) ~ [na: 1.8.0_252]
в java.lang.reflect.Метод.invoke(Method.java:498) ~ [na: 1.8.0_252]
в org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:56) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar:2.3.3]
в org.apache.spark.deploy.SparkSubmit $.org $apache$spark$deploy$SparkSubmit $$runMain(SparkSubmit.scala:894) [na: 2.3.3]
в org.apache.spark.deploy.SparkSubmit $.doRunMain $ 1(SparkSubmit.scala:198) [na: 2.3.3]
в org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228) [na: 2.3.3]
в org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) [na: 2.3.3]
в org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) [spark-core_2.11-2.3.3.jar: 2.3.3]
в io.cdap.cdap.app.runtime.spark.submit.AbstractSparkSubmitter.submit(AbstractSparkSubmitter.java:172) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar: na]
в io.cdap.cdap.app.runtime.spark.submit.AbstractSparkSubmitter.access$ 000(AbstractSparkSubmitter.java:54) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar:na]
в io.cdap.cdap.app.runtime.spark.submit.AbstractSparkSubmitter $ 5.run(AbstractSparkSubmitter.java:111) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar:na]
в java.util.concurrent.Исполнители$RunnableAdapter.call(Executors.java:511) [na:1.8.0_252]
в java.util.concurrent.FutureTask.run(FutureTask.java:266) [na: 1.8.0_252]
в java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na: 1.8.0_252]
в java.util.concurrent.ThreadPoolExecutor $Worker.run(ThreadPoolExecutor.java:624) [na: 1.8.0_252]
на java.lang.Thread.run(Thread.java:748) [na: 1.8.0_252]

Комментарии:

1. Ваш XML недействителен. Попробуйте использовать это: <?xml version=»1.0″ encoding=»UTF-8″?> <note> <to>Туве</to> <from>Яни</from> <heading>Напоминание</heading> <body>Не забывай меня в эти выходные! </body> </note>

Ответ №1:

Кажется, что ваш XML неверен. Попробуйте использовать приведенный ниже XML:

 <?xml version="1.0" encoding="UTF-8"?> <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>