Hadoop, ошибка тайм-аута сокета

#hadoop

#hadoop

Вопрос:

Я пытаюсь запустить terasort на Hadoop. Я получаю ошибку выполнения тайм-аута, как показано ниже.

 [hadoop@master mapreduce]$ hadoop jar $(ls hadoop-mapreduce-examples-2*.jar) teragen 100000000 /terasort/in
16/10/08 21:30:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/10/08 21:30:17 INFO client.RMProxy: Connecting to ResourceManager at master/10.90.110.160:8032
16/10/08 21:30:33 INFO terasort.TeraSort: Generating 100000000 using 2
16/10/08 21:30:33 INFO mapreduce.JobSubmitter: number of splits:2
16/10/08 21:30:34 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1475979237007_0002
16/10/08 21:30:34 INFO impl.YarnClientImpl: Submitted application application_1475979237007_0002
16/10/08 21:30:34 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1475979237007_0002/
16/10/08 21:30:34 INFO mapreduce.Job: Running job: job_1475979237007_0002
16/10/08 21:38:25 INFO mapreduce.Job: Job job_1475979237007_0002 running in uber mode : false
16/10/08 21:38:25 INFO mapreduce.Job:  map 0% reduce 0%
16/10/08 21:38:25 INFO mapreduce.Job: Job job_1475979237007_0002 failed with state FAILED due to: Application application_1475979237007_0002 failed 2 times due to Error launching appattempt_1475979237007_0002_000002. Got exception: org.apache.hadoop.net.ConnectTimeoutException: Call From master.someplace.net/69.172.201.153 to 69.172.201.153:35751 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=69.172.201.153/69.172.201.153:35751]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:751)
    at org.apache.hadoop.ipc.Client.call(Client.java:1480)
    at org.apache.hadoop.ipc.Client.call(Client.java:1407)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
    at com.sun.proxy.$Proxy32.startContainers(Unknown Source)
    at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96)
    at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:119)
    at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:254)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=69.172.201.153/69.172.201.153:35751]
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
    at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:609)
    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
    at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1529)
    at org.apache.hadoop.ipc.Client.call(Client.java:1446)
    ... 9 more
. Failing the application.
16/10/08 21:38:25 INFO mapreduce.Job: Counters: 0
  

Я проверил свои три узла, и они работают нормально.

 Live datanodes (3):

Name: 10.90.110.160:50010 (master.hadoop.mids.lulz.bz)
Hostname: 69.172.201.153
Decommission Status : Normal
Configured Capacity: 105554829312 (98.31 GB)
DFS Used: 831488 (812 KB)
Non DFS Used: 5449568256 (5.08 GB)
DFS Remaining: 100104429568 (93.23 GB)
DFS Used%: 0.00%
DFS Remaining%: 94.84%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Sat Oct 08 21:47:42 CDT 2016


Name: 10.90.110.169:50010 (slave2.hadoop.mids.lulz.bz)
Hostname: 69.172.201.153
Decommission Status : Normal
Configured Capacity: 105554829312 (98.31 GB)
DFS Used: 831488 (812 KB)
Non DFS Used: 5448441856 (5.07 GB)
DFS Remaining: 100105555968 (93.23 GB)
DFS Used%: 0.00%
DFS Remaining%: 94.84%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Sat Oct 08 21:47:42 CDT 2016


Name: 10.90.110.165:50010 (slave1.hadoop.mids.lulz.bz)
Hostname: 69.172.201.153
Decommission Status : Normal
Configured Capacity: 105554829312 (98.31 GB)
DFS Used: 831488 (812 KB)
Non DFS Used: 5448441856 (5.07 GB)
DFS Remaining: 100105555968 (93.23 GB)
DFS Used%: 0.00%
DFS Remaining%: 94.84%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Sat Oct 08 21:47:42 CDT 2016
  

Пожалуйста, помогите мне, где еще искать решение. Я здесь полностью потерялся… Заранее спасибо!

Ответ №1:

Я думаю, что система использует период ожидания по умолчанию, пока DFSClient связывается с узлом данных. Следующая конфигурация может быть полезна для увеличения dfs.datanode.socket.write.timeout и dfs.socket.timeout.

Измените или добавьте приведенную ниже конфигурацию, чтобы увеличить время ожидания,

 <property>
  <name>dfs.datanode.socket.write.timeout</name>
  <value>2000000</value>
</property>

<property>
  <name>dfs.socket.timeout</name>
  <value>2000000</value>
</property>
  

Далее, система пытается подключиться к 69.172.201.153 в журналах. Это правильный IP ?.