Wednesday, November 14, 2007

Creating sparse block devices

I've not found much material online with regards to creating sparse filesystems, and it's a useful thing if you're doing filesystem testing, or just want to pretend that you have more storage than you do to impress your friends (that's fun). The concept of doing this is to make it appear as though you have a larger block device than you actually do. How you accomplish this on Linux is a creative use of device-mapper.

First, a little bit of an introduction to device-mapper is in order. Device mapper is a way to concatenate and extend block devices in ways not before possible. It is the basis of how LVM works (using a linear map by default). You can see the tables that LVM has created for you by typing 'dmsetup table'. On my workstation for example, it outputs the following:


[root@rugrat ~]# dmsetup table
bootvg-usr: 0 20971520 linear 8:2 4194688
bootvg-var: 0 4194304 linear 8:2 25166208
bootvg-swap: 0 8388608 linear 8:2 147521920
bootvg-root: 0 2097152 linear 8:2 384
bootvg-data: 0 118161408 linear 8:2 29360512
bootvg-data: 118161408 65536 linear 8:2 155910528
bootvg-tmp: 0 2097152 linear 8:2 2097536


A little explanation of how these tables are formatted. The device mapper tables are expressed in terms of 512-byte sectors. The first number is the starting position within the mapped device, and the second number is the length (of that segment of the device). The third argument is the name of the target (linear in this case). All of the remaining arguments on the line are target specific options.

There are various targets that you can use with device-mapper, these are actually kernel modules that get loaded in at runtime. There are several targets available, including:

linear - specifies linear regions on disks
stripe - specifies striped devices
mirror - sets up mirrored devices
zero - creates a device of xero's (similar to /dev/zero but as a block device)
multipath - used for setting up multiple paths to one device (for example on a Fibre Channel SAN)

There are other targets. but that should get you started for now.

The linear target has two target-specific options - the device that you're referring to, and the starting sector within the device. Note that while it's not done here, you can actually 'stack' dm devices - this 9is where it comes to be quite powerful. In all of these examples, you see that we're referring to 8:2, which you can decode by looking at the /dev directory and finding the device with major 8, minor 2 (/dev/sda2 in this case).

Where this gets interesting is you notice that there are two entries for bootvg-data above. One encompassing sector 0 through 118161408 of bootvg-data, and the other encompassing sectors 118161408, continuing for a length of 65536 sectors. What this indicates is that this volume has been expanded in the past. Also, if the volume were spread out over multiple disks, then there would be multiple tables for it. Now that we've got basic device-mapper theory down, let's get into snapshots.

Again, device mapper is the basis for LVM snapshots. A snapshot is a point-in-time copy of a volume. The interesting thing about LVM snapshots as compared to some hardware array vendor's snapshots is that BOTH the source and snapshot volumes are read/write. We'll see how this can happen here.

When an LVM snapshot is created, four device-mapper tables are created by LVM. Tehy are, in this order and name (for a volume called base in the volume group vol0, and the snapshot volume is called snap):

vol0-base-real: the original mapping of the volume
vol0-snap-cow: the copy-on-write device, this is a table that specifies physical storage
vol0-snap: the user-visible snapshot volume, dm target is snapshot, with the origin being the 'real' volume above, and the COW device being the COW device above.
vol0-base: The user-visible base volume, target is snapshot-origin, with the backing device of the 'real' volume created above. All reads come from the backing device specified for the snapshot, and all writes go to the COW device.

Now that we have a firm understanding of the basic facilities of device-mapper, let's explore how to create a sparse block device. The concept here is to combine two types of devices - zero and snapshot. First, we create a huge chunk of just zeros. So let's create a 15TiB GFS filesystem with only 2GB of backing storage (the reason for not using ext3 in this case is due to an architectural advantage of GFS in this case, which is that it does not allocate inode tables at filesystem creation time. Therefore large filesystems don't take hours and more than 2GB to create).

First, we need to get the number of 512-byte sectors that 15TiB makes up:


[root@dhcp-144 ~]# echo $[15 * (2**40) /512]
32212254720


Having that, we can now create a 15TiB block device, and take a snapshot of that device, using a real 2GB logical volume that I created earlier as backing for that device. Now that we know the size, we can create the device:


[root@dhcp-144 ~]# echo "0 32212254720 zero" | dmsetup create zero
[root@dhcp-144 ~]# echo "0 32212254720 snapshot /dev/mapper/zero /dev/vg0/backing P 16" | dmsetup create gfs-huge


Note that the above commands took a fraction of a second, and now I have a 15TiB block device that I can write just 2GB of data to. Next, let's create a GFS filesystem on this device:


[root@dhcp-144 ~]# gfs_mkfs -p lock_nolock -j 2 -t cluster2:temp /dev/mapper/gfs-huge
This will destroy any data on /dev/mapper/gfs-huge.

Are you sure you want to proceed? [y/n] y

Device: /dev/mapper/gfs-huge
Blocksize: 4096
Filesystem Size: 4026174448
Journals: 2
Resource Groups: 15360
Locking Protocol: lock_nolock
Lock Table: cluster2:temp

Syncing...
All Done


[root@dhcp-144 ~]# mount -t gfs /dev/mapper/gfs-huge /mnt


So now the GFS filesystem is created and mounted, let's do a df on it, and see that we really have a 15TiB filesystem:


[root@dhcp-144 ~]# df -h /mnt
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/gfs-huge 15T 1.5M 15T 1% /mnt


As you can see, the system believes that I've got a 15TiB filesystem mounted up!

4 comments:

Henrik Nordstrom said...

Another way doing pretty much the same thing is using a sparse loopback device. Difference is that the sparse loopbeack device do not have a strict allocation limit but allocations can grow as needed until the backing filesystem runs out of free space.

Kal McFate said...

It seems if you run out of space on your backing black device it corrupts the entire thing. Any suggestions?

Kal McFate said...
This post has been removed by the author.
ghkj said...

I would gold für wow cultivate courage.buy wow gold “Nothing is so mild wow gold cheap and gentle as courage, nothing so cruel and pitiless as cowardice,” syas a wise author. We too often borrow trouble, and anticipate that may never appear.”wow gold kaufen The fear of ill exceeds the ill we fear.” Dangers will arise in any career, but presence of mind will often conquer the worst of them. Be prepared for any fate, and there is no harm to be freared. If I were a boy again, I would look on the cheerful side. life is very much like a mirror:sell wow gold if you smile upon it,maple mesos I smiles back upon you; but if you frown and look doubtful on it,cheap maplestory mesos you will get a similar look in return. Inner sunshine warms not only the heart of the owner,world of warcraft power leveling but of all that come in contact with it. “ who shuts love out ,in turn shall be shut out from love.” If I were a boy again, I would school myself to say no more often.billig wow gold I might cheap mesos write pages maple meso on the importance of learning very early in life to gain that point where a young boy can stand erect, and decline doing an unworthy act because it is unworthy.wow powerleveling If I were a boy again, I would demand of myself more courtesy towards my companions and friends,wow leveling and indeed towards strangers as well.Maple Story Account The smallest courtesies along the rough roads of life are wow powerleveln like the little birds that sing to us all winter long, and make that season of ice and snow more endurable. Finally,maple story powerleveling instead of trying hard to be happy