![]() ![]() |
大話存儲(chǔ)后傳/次世代數(shù)據(jù)存儲(chǔ)思維與技術(shù) ![]()
全書(shū)分為: 靈活的數(shù)據(jù)布局、應(yīng)用感知及可視化存儲(chǔ)智能、存儲(chǔ)類芯片、儲(chǔ)海鉤沉、集群和多控制器、傳統(tǒng)存儲(chǔ)系統(tǒng)、新興存儲(chǔ)系統(tǒng)、大話光存儲(chǔ)系統(tǒng)、體系結(jié)構(gòu)、IO協(xié)議棧及性能分析、存儲(chǔ)軟件、固態(tài)存儲(chǔ)幾個(gè)大章, 其中每章又有多個(gè)小節(jié)。每一個(gè)小節(jié)都是一個(gè)獨(dú)立的課題。
冬瓜哥對(duì)技術(shù)的追求已經(jīng)到了“癡迷”的境界,與10年前相比,文筆解析更為到位,技術(shù)理解更為精準(zhǔn)。其公眾號(hào)的每篇文章,都是存儲(chǔ)業(yè)界風(fēng)向標(biāo)。
冬瓜哥(張冬),現(xiàn)任某半導(dǎo)體公司系統(tǒng)架構(gòu)師,著有《大話存儲(chǔ)》系列圖書(shū)。存儲(chǔ)領(lǐng)域技術(shù)專家和布道者。
第一章 靈活的數(shù)據(jù)布局 ·········································································1
1.1 Raid1.0和Raid1.5 ······························································································2 1.2 Raid5EE和Raid2.0 ·····························································································4 1.3 Lun2.0/SmartMotion ························································································13 第二章 應(yīng)用感知及可視化存儲(chǔ)智能 ·····················································23 2.1 應(yīng)用感知精細(xì)化自動(dòng)存儲(chǔ)分層······································································25 2.2 應(yīng)用感知精細(xì)化SmartMotion ········································································27 2.3 應(yīng)用感知精細(xì)化QoS ······················································································28 2.4 產(chǎn)品化及可視化展現(xiàn)······················································································31 2.5 包裝概念制作PPT ···························································································43 2.6 評(píng)浪潮“活性”存儲(chǔ)概念··············································································49 第三章 存儲(chǔ)類芯片 ··············································································53 3.1 通道及Raid控制器架構(gòu) ··················································································54 3.2 SAS Expander架構(gòu) ··························································································60 第四章 儲(chǔ)海鉤沉 ··················································································65 4.1 你絕對(duì)想不到的兩種高格調(diào)存儲(chǔ)器······························································66 4.2 JBOD里都有什么····························································································70 4.3 Raid4校驗(yàn)盤(pán)之殤 ····························································································72 4.4 為什么說(shuō)Raid卡是臺(tái)小電腦 ··········································································73 4.5 為什么Raid卡電池被換為超級(jí)電容 ······························································74 4.6 固件和微碼到底什么區(qū)別··············································································75 4.7 FC成環(huán)器內(nèi)部真的是個(gè)環(huán)嗎 ·········································································76 4.8 為什么說(shuō)SAS、FC對(duì)CPU耗費(fèi)比TCPIP+以太網(wǎng)低 ····································77 4.9 雙控存儲(chǔ)之間的心跳線都跑了哪些流量······················································78 第五章集群和多控制器 ······································································· 79 5.1 淺談雙活和多路徑··························································································80 5.2 “淺”談容災(zāi)和雙活數(shù)據(jù)中心(上)··························································82 5.3 “淺”談容災(zāi)和雙活數(shù)據(jù)中心(下)··························································87 5.4 集群文件系統(tǒng)架構(gòu)演變深度梳理圖解··························································96 5.5 從多控緩存管理到集群鎖············································································107 5.6 共享式與分布式各論····················································································115 5.7 “冬瓜哥畫(huà)PPT”雙活是個(gè)坑 ·····································································118 第六章傳統(tǒng)存儲(chǔ)系統(tǒng) ········································································· 121 6.1 與存儲(chǔ)系統(tǒng)相關(guān)的一些基本話題分享························································122 6.2 高端存儲(chǔ)系統(tǒng)江湖風(fēng)云錄!········································································133 6.3 驚了!原來(lái)高端存儲(chǔ)架構(gòu)是這樣演進(jìn)的!················································145 6.4 傳統(tǒng)高端存儲(chǔ)系統(tǒng)把數(shù)據(jù)緩存集中外置一石三鳥(niǎo)····································155 6.5 傳統(tǒng)外置存儲(chǔ)已近黃昏················································································156 6.6 存儲(chǔ)圈老炮大戰(zhàn)小鮮肉················································································166 6.7 傳統(tǒng)存儲(chǔ)老矣,新興存儲(chǔ)能當(dāng)大任否?····················································167 第七章次世代存儲(chǔ)系統(tǒng) ····································································· 185 7.1 一桿老槍照玩次世代存儲(chǔ)系統(tǒng)····································································187 7.2 最有傳統(tǒng)存儲(chǔ)格調(diào)的次世代存儲(chǔ)系統(tǒng)························································192 7.3 最適合大規(guī)模數(shù)據(jù)中心的次世代存儲(chǔ)系統(tǒng)················································203 7.4 最高性能的次世代存儲(chǔ)系統(tǒng)········································································206 7.5 最具備感知應(yīng)用能力的次世代存儲(chǔ)系統(tǒng)····················································214 7.6 最具有數(shù)據(jù)管理靈活性的次時(shí)代存儲(chǔ)系統(tǒng)················································225 第八章光存儲(chǔ)系統(tǒng)············································································ 237 8.1 光存儲(chǔ)基本原理····························································································238 8.2 神秘的激光頭及藍(lán)光技術(shù)············································································244 8.3 剖析藍(lán)光存儲(chǔ)系統(tǒng)························································································249 8.4 光存儲(chǔ)系統(tǒng)生態(tài)····························································································253 8.5 站在未來(lái)看現(xiàn)在····························································································259 第九章體系結(jié)構(gòu) ················································································ 263 9.1 大話眾核心處理器體系結(jié)構(gòu)········································································264 9.2 致敬龍芯!冬瓜哥手工設(shè)計(jì)了一個(gè)CPU譯碼器! ····································271 9.3 NUNA體系結(jié)構(gòu)首次落地InCloudRack機(jī)柜 ···············································274 9.4 評(píng)宏杉科技的CloudSAN架構(gòu) ······································································278 9.5 內(nèi)存竟然還能這么玩?!············································································283 9.6 PCIe交換,什么鬼?····················································································293 9.7 聊聊FPGA/GPCPU/PCIe/Cache-Coherency ················································300 9.8 【科普】超算到底是怎樣算的?································································305 第十章 I/O 協(xié)議棧及性能分析 ···························································· 317 10.1 最完整的存儲(chǔ)系統(tǒng)接口/協(xié)議/連接方式總結(jié) ···········································318 10.2 I/O協(xié)議棧前沿技術(shù)研究動(dòng)態(tài) ····································································332 10.3 Raid組的Stripe Size到底設(shè)置為多少合適? ·············································344 10.4 并發(fā)I/O——系統(tǒng)性能的根本! ································································347 10.5 關(guān)于I/O時(shí)延你被騙了多久? ····································································349 10.6 如何測(cè)得整條I/O路徑上的并發(fā)度? ························································351 10.7 隊(duì)列深度、時(shí)延、并發(fā)度、吞吐量的關(guān)系到底是什么··························351 10.8 為什么Raid對(duì)于某些場(chǎng)景沒(méi)有任何提速作用? ······································365 10.9 為什么測(cè)試時(shí)性能出色,上線時(shí)卻慘不忍睹?······································366 10.10 隊(duì)列深度過(guò)淺有什么影響?····································································368 10.11 隊(duì)列深度調(diào)節(jié)為多大最理想? ································································369 10.12 機(jī)械盤(pán)的隨機(jī)I/O平均時(shí)延為什么有一過(guò)性降低? ······························370 10.13 數(shù)據(jù)布局到底是怎么影響性能的?························································371 10.14 關(guān)于同步I/O與阻塞I/O的誤解 ·································································374 10.15 原子寫(xiě),什么鬼?!················································································375 10.16 何不做個(gè)USB Target? ·············································································385 10.17 冬瓜哥的一項(xiàng)新存儲(chǔ)技術(shù)專利已正式通過(guò)············································385 10.18 小梳理一下iSCSI底層 ··············································································394 10.19 FC的4次Login過(guò)程簡(jiǎn)析 ···········································································396 第十一章存儲(chǔ)軟件············································································ 397 11.1 Thin就是個(gè)坑誰(shuí)用誰(shuí)找抽!······································································398 11.2 存儲(chǔ)系統(tǒng)OS變遷 ·························································································400 第十二章固態(tài)存儲(chǔ)············································································ 409 12.1 淺析固態(tài)介質(zhì)在存儲(chǔ)系統(tǒng)中的應(yīng)用方式··················································410 12.2 關(guān)于SSD元數(shù)據(jù)及掉電保護(hù)的誤解··························································420 12.3 關(guān)于閃存FTL的Host Base和Device Based的誤解 ····································421 12.4 關(guān)于SSD HMB與CMB ···············································································423 12.5 同有科技展翅歸來(lái)······················································································424 12.6 和老唐說(shuō)相聲之SSD性能測(cè)試之“玉”··················································435 12.7 固態(tài)盤(pán)到底該怎么做Raid? ······································································441 12.8 當(dāng)Raid2.0遇上全固態(tài)存儲(chǔ) ·········································································448 12.9 上/下頁(yè)、快/慢頁(yè)、MSB/LSB都些什么鬼? ··········································451 12.10 關(guān)于對(duì)MSB/LSB寫(xiě)0時(shí)的步驟 ·································································457
1.1 Raid1.0和Raid1.5
在機(jī)械盤(pán)時(shí)代,影響最終I/O性能的根本因素?zé)o非就是兩個(gè),一個(gè)是頂端源頭,也就是應(yīng)用的I/O調(diào)用方式和I/O屬性;另一個(gè)是底端源頭,那就是數(shù)據(jù)最終是以什么形式、狀態(tài)存放在多少機(jī)械盤(pán)上的。應(yīng)用如何I/O調(diào)用完全不是存儲(chǔ)系統(tǒng)可以控制的事情,所以從這個(gè)源頭來(lái)解決性能問(wèn)題對(duì)于存儲(chǔ)系統(tǒng)來(lái)講是無(wú)法做什么工作的。但是數(shù)據(jù)如何組織、排布,絕對(duì)是存儲(chǔ)系統(tǒng)重中之重的工作。 這一點(diǎn)從Raid誕生開(kāi)始就一直在不斷的演化當(dāng)中。舉個(gè)最簡(jiǎn)單的例子,從Raid3到Raid4再到Raid5,Raid3當(dāng)時(shí)設(shè)計(jì)的時(shí)候致力于單線程大塊連續(xù)地址I/O吞吐量最大化,為了實(shí)現(xiàn)這個(gè)目的,Raid3的條帶非常窄,窄到每次上層下發(fā)的I/O目標(biāo)地址基本上都落在了所有盤(pán)上,這樣幾乎每個(gè)I/O都會(huì)讓多個(gè)盤(pán)并行讀寫(xiě)來(lái)服務(wù)于這個(gè)I/O,而其他I/O就必須等待,所以我們說(shuō)Raid3陣列場(chǎng)景下,上層的I/O之間是不能并發(fā)的,但是單個(gè)I/O是可以采用多盤(pán)為其并發(fā)的。所以,如果系統(tǒng)內(nèi)只有一個(gè)線程(或者說(shuō)用戶、程序、業(yè)務(wù)),而且這個(gè)線程是大塊連續(xù)地址I/O追求吞吐量的業(yè)務(wù),那么Raid3非常合適。但是大部分業(yè)務(wù)其實(shí)不是這樣,而是追求上層的I/O能夠充分地并行執(zhí)行,比如多線程、多用戶發(fā)出的I/O能夠并發(fā)地被響應(yīng),此時(shí)就需要增大條帶到一個(gè)合適的值,讓一個(gè)I/O目標(biāo)地址范圍不至于牽動(dòng)Raid組中所有盤(pán)為其服務(wù),這樣就有一定幾率讓一組盤(pán)同時(shí)響應(yīng)多個(gè)I/O,而且盤(pán)數(shù)越多,并發(fā)幾率就越大。Raid4相當(dāng)于條帶可調(diào)的Raid3,但是Raid4獨(dú)立校驗(yàn)盤(pán)的存在不但讓其成為高故障率的熱點(diǎn)盤(pán),而且也制約了本可以并發(fā)的I/O,因?yàn)榘殡S著每個(gè)I/O的執(zhí)行,校驗(yàn)盤(pán)上對(duì)應(yīng)條帶的校驗(yàn)塊都需要被更新,而由于所有校驗(yàn)塊只存放在這塊盤(pán)上,所以上層的I/O只能一個(gè)一個(gè)第一章 靈活的數(shù)據(jù)布局3地順著執(zhí)行,不能并發(fā)。Raid5則通過(guò)把校驗(yàn)塊打散在Raid組中所有磁盤(pán)上,從而實(shí)現(xiàn)了并發(fā)I/O。大部分存儲(chǔ)廠商提供針對(duì)條帶寬度的設(shè)置,比如從32KB到128KB。假設(shè)一個(gè)I/O請(qǐng)求讀16KB,在一個(gè)8塊盤(pán)做的Raid5組里,如果條帶為32KB,則每塊盤(pán)上的段(Segment)為4KB,這個(gè)I/O起碼要占用4塊盤(pán),假設(shè)并發(fā)幾率為100%,那么這個(gè)Raid組能并發(fā)兩個(gè)16KB的I/O,并發(fā)8個(gè)4KB的I/O;如果將條帶寬度調(diào)節(jié)為128KB,則在100%并發(fā)幾率的條件下可并發(fā)8個(gè)小于等于16KB的I/O。 講到這里,我們可以看到單單是調(diào)節(jié)條帶寬度,以及優(yōu)化校驗(yàn)塊的布局,就可以得到迥異的性能表現(xiàn)。但是再怎么折騰,I/O性能始終受限在Raid組那少得可憐的幾塊或者十幾塊盤(pán)上。為什么是幾塊或者十幾塊?難道不能把100塊盤(pán)做成一個(gè)大Raid5組,然后,通過(guò)把所有邏輯卷創(chuàng)建在它上面來(lái)增加每個(gè)邏輯卷的性能么?你不會(huì)選擇這么做的,當(dāng)一旦有一塊盤(pán)壞掉,系統(tǒng)需要重構(gòu)的時(shí)候,你會(huì)后悔當(dāng)時(shí)的決定,因?yàn)槟銜?huì)發(fā)現(xiàn)此時(shí)整個(gè)系統(tǒng)性能大幅降低,哪個(gè)邏輯卷也別想好過(guò),因?yàn)榇藭r(shí)99塊盤(pán)都 在全速讀出數(shù)據(jù),系統(tǒng)計(jì)算xor校驗(yàn)塊,然后把校驗(yàn)塊寫(xiě)入熱備盤(pán)中。當(dāng)然,你可以控制降速重構(gòu),來(lái)緩解在線業(yè)務(wù)的I/O性能,但是付出的代價(jià)就是增加了重構(gòu)時(shí)間,重構(gòu)周期內(nèi)如果有盤(pán)再壞,那么全部數(shù)據(jù)蕩然無(wú)存。所以,必須縮小故障影響域,所以一個(gè)Raid組最好是幾塊或者十幾塊盤(pán)。這比較尷尬,所以人們想出了解決辦法,那就是把多個(gè)小Raid5/6組拼接成大Raid0,也就是Raid50/60,然后將邏輯卷分布在其上。當(dāng)然,目前的存儲(chǔ)廠商黔驢技窮,再也弄出什么新花樣,所以它們習(xí)慣把這個(gè)大Raid50/60組成“Pool”,也就是池,從而迷惑一部分人,認(rèn)為存儲(chǔ)又在革新了,存儲(chǔ)依然生命力旺盛。 那冬瓜哥在這里也不妨順?biāo)浦酆鲇埔幌拢绻褌鹘y(tǒng)的Raid組叫作Raid1.0,把Raid50/60叫作Raid1.5。我們其實(shí)在這里可以體會(huì)出一種周期式上升的規(guī)律,早期盤(pán)數(shù)較少,主要靠條帶寬度來(lái)調(diào)節(jié)不同場(chǎng)景的性能;后來(lái)人們想通了,為何不用Raid50呢? 把數(shù)據(jù)直接分布到幾百塊盤(pán)中,豈不快哉?上層的并發(fā)線程I/O在底層可以實(shí)現(xiàn)大規(guī)模并發(fā),達(dá)到超高吞吐量。此時(shí),人們被成功沖昏了頭腦,沒(méi)人再去考慮另一個(gè)可怕的問(wèn)題。至這些文字傾諸筆端時(shí)仍沒(méi)有人考慮這個(gè)問(wèn)題,至少?gòu)膹S商的產(chǎn)品動(dòng)向里沒(méi)有看出。究其原因,可能是另一輪底層的演變,那就是固態(tài)介質(zhì)。底層的車輪是不斷地提速的,上層的形態(tài)是循環(huán)往復(fù)的,但有時(shí)候上層可能直接跨越式前進(jìn),跨越了其中應(yīng)該有的一個(gè)形態(tài),這個(gè)形態(tài)或者轉(zhuǎn)瞬即逝,亦或者根本沒(méi)出現(xiàn)過(guò),但是總會(huì)有人產(chǎn)生火花,即便這火花是那么微弱。這個(gè)可怕的問(wèn)題其實(shí)被一個(gè)更可怕的問(wèn)題蓋過(guò)了,這個(gè)更可怕的問(wèn)題就是重構(gòu)時(shí)間過(guò)長(zhǎng)。一塊4TB的SATA盤(pán),在重構(gòu)的時(shí)候就算全速寫(xiě)入,其轉(zhuǎn)速?zèng)Q定了其吞吐量極4 大話存儲(chǔ)后傳——次世代數(shù)據(jù)存儲(chǔ)思維與技術(shù)限也基本在80MB/s左右,可以算一下,需要58h,實(shí)際中為了保證在線業(yè)務(wù)的性能,一般會(huì)限制在中速重構(gòu),也就是40MB/s左右,此時(shí)需要116h,也就是5天5夜,我敢打賭沒(méi)有哪個(gè)系統(tǒng)管理員能在這一周內(nèi)睡好覺(jué)。 1.2 Raid5EE和Raid2.0 20年前有人發(fā)明過(guò)一種叫作Raid5EE的技術(shù),其目的有兩個(gè),第一是把平時(shí)閑著沒(méi)事干的熱備盤(pán)用起來(lái),第二就是加速重構(gòu)。很顯然,如果把下圖中用“H(hot spare)”表示的熱備盤(pán)的空間也像校驗(yàn)盤(pán)一樣,打散到所有盤(pán)上的話,就會(huì)變成圖右側(cè)所示的布局,每個(gè)P塊都跟著一個(gè)H塊。這樣整個(gè)Raid組能比原來(lái)多一塊磁盤(pán)可用于工作。另外,由于H空間也被打散了,當(dāng)有一塊盤(pán)損壞時(shí),重構(gòu)的速度理應(yīng)被加快,因?yàn)榇藭r(shí)可以多盤(pán)并發(fā)寫(xiě)入了。但是實(shí)際卻不然,整個(gè)系統(tǒng)的重構(gòu)速度其實(shí)并不是被這塊單獨(dú)的熱備盤(pán)限制了,而是被所有盤(pán)一起限制了,因?yàn)闊醾浔P(pán)以滿速率寫(xiě)入重構(gòu)后的數(shù)據(jù)的前提是,其他所有盤(pán)都以滿速率讀出數(shù)據(jù),然后系統(tǒng)對(duì)其做xor。就算把熱備盤(pán)打散,甚至把熱備盤(pán)換成SSD、內(nèi)存,對(duì)結(jié)果也毫無(wú)影響。那到底怎樣才能加速重構(gòu)呢?唯一的辦法只有像下圖所示這樣,把原本擠在5塊盤(pán)里的條帶,橫向打散,請(qǐng)注意,是以條帶為粒度打散,打散單盤(pán)是毫無(wú)用處的。這樣,才能成倍地提升重構(gòu)速度。
你還可能感興趣
我要評(píng)論
|