最近在运行Hive on Spark跑TPCDS 10P的时候碰到一个问题:

Caused by: org.apache.spark.SparkException: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "nullscan"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3239)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3259)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3310)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3278)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:475)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
at org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2311)
at org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2307)
at org.apache.hadoop.hive.ql.exec.Utilities$GetInputPathsCallable.call(Utilities.java:3087)
at org.apache.hadoop.hive.ql.exec.Utilities.getInputPaths(Utilities.java:3051)
at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.cloneJobConf(SparkPlanGenerator.java:302)
at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:226)
at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:109)
at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:346)
at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:358)
at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

编译分支使用的:https://github.com/apache/hive/tree/release-2.3.9-rc0

看了下原因,是因为当出现with 语法做临时表,例如tpcds中的q9语法:

with year_total as (
select c_customer_id customer_id
,c_first_name customer_first_name
,c_last_name customer_last_name
,c_preferred_cust_flag customer_preferred_cust_flag
,c_birth_country customer_birth_country
,c_login customer_login
,c_email_address customer_email_address
,d_year dyear
,sum(ss_ext_list_price-ss_ext_discount_amt) year_total
,'s' sale_type
from customer
,store_sales
省略其他。。。

year_total就是一张临时表,这时候这个临时表的path是空,当sql parser去解析table的时候,拿到path为null ,则抛出了nullscan异常。

翻了下这是个hive的bug,在3.0修复的,issue地址为:

https://issues.apache.org/jira/browse/HIVE-18442?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16346916#comment-16346916

修复代码也比较简单:
https://issues.apache.org/jira/secure/attachment/12906094/HIVE-18442.1.patch

就是在spark submit的时候注册一下fs:

// make sure NullScanFileSystem can be loaded - HIVE-18442
+ jobConf.set("fs." + NullScanFileSystem.getBaseScheme() + ".impl",
+ NullScanFileSystem.class.getCanonicalName());

NullScanFileSystem为hive自己实现的针对path为空的table的处理方式。

但是不知道为啥社区没有重新打release-2.3.9-rc0的tag,在branch2.3倒是合入了。


扫码手机观看或分享: