dcsimg

Linux 2.6.37: Scalability Improvements Abound

While 2.6.37 might be considered a quiet release, there are some very nice scalability improvements for file systems and one cool new feature that warrant a review.

This year’s holiday kernel was 2.6.37, which was actually released on 4 January 2011 (perhaps it’s a New Year’s kernel) and is a good example of a kernel release during the holidays. At first glance, one would think that it was a quiet kernel with no flaming articles on the web or some seriously flawed benchmarks being posted, but you didn’t see too much of that. However, there are some great things that happened in 2.6.37 around file systems and one really cool feature that I’ll talk about at the end of the article.

Improvements for ext4

Ext4 is the proverbial little engine that could. The file system has proven to have remarkably great performance and it solves many (most) of the issues with ext3. However, it is still really limited to 16 TB because the user space tools have not been updated yet (good project if anyone is interested). In 2.6.37, several really cool features were added to ext4, primarily around scalability.

Systems are getting more cores faster than we may realize. A four-socket AMD system that has 12-cores per socket resulting in a total of 48 cores in a single system, is a fairly affordable server. In the 2.6.37 kernel, scalability improvement patches were added to ext4. In particular, ext4 will now use the “bio” layer directly rather than use the intermediate “buffer” layer. The basic reason is that buffer layer has a number of performance and SMP scalability issues. The bio layer (bio = Block I/O) is the part of the kernel that sends the requests to the I/O scheduler allowing performance and scalability to improve.

An example of the scalability improvement was that a ffsb benchmark on a 48-core AMD system using a 24-disk hardware RAID array with 192 simultaneous ffsb threads improved performance by 300% (400% if journaling was disabled) compared to performance before this patch was applied. Moreover, CPU usage was reduced by a factor of 3-4 in the benchmark.

In addition to the scalability patches for ext4, 2.6.37 added a couple of cool new features. The first one is that mke2fs, the command to create an ext based file system, now has the ability to leave the inode table uninitialized. This means that the creation of an ext4 file system can now happen very quickly whereas before the inode table had to be constructed taking some time. However, the inode table has to be initialized as quickly as possible for the file system to be useful. So on the first mount of the file system, the kernel runs a kernel thread that will initialize the table.

The second patch added batched discard support to ext4. This may sound uneventful, but it has a big improvement in one area – SSD’s. Recall that the TRIM command in SSD’s can result in much better overall performance because the blocks are marked for erasing which is done only when needed. In Linux, the basic concept of TRIM is called discard. So this patch adds the ability to do batched discards (multiple blocks) allowing the entire file system to be “trimmed” if needed. So far, ext4 is the first file system in Linux to support batched discards.

In addition to these two major new features in ext4, there was a somewhat minor one that is useful nonetheless. In 2.6.37 a patch was added that allows ext4 to list or “advertise” the features of the particular version of ext4 in sysfs. More specifically, there is a “features” directory in /sys/fs/ext4 that advertises what features are available in ext4 in the running kernel. That can be very useful for people wanting to know, or needing to know, exactly what features are available in the particular version of ext4.

Scalability improvements in xfs

Xfs is one of the highest performing file systems in Linux for certain workloads. It is very popular in the HPC (High Performance Computing) crowd because of the excellent file performance, particularly for larger files. However, it has the reputation of not having very good metadata performance. It is still under heavy development and many of the recent patches have been targeting metadata performance.

In the 2.6.37 kernel release, xfs gained some scalability performance improvements. In particular, the scalability of xfs metadata workload performance improved. For example, on an 8-way system, running the fs_mark benchmark for an instance of 50 million files, improved the performance by over 15%. The performance of the removal of those files improved by over 100%.

Of course other improvements and features were added to xfs in 2.6.37. In a previous article I mentioned a new logging option (delayed logging) was added in the 2.6.35 kernel that can greatly improve I/O bandwidth for the log by several orders of magnitude. This can greatly improve metadata performance for really heavy metadata workloads. In 2.6.37, a patch was added that removed the “experimental” label from delayed logging making it production ready.

Other improvements/changes added to xfs in 2.6.37 are:


  • Project quotes to support 32-bit project ids were added
  • XFS_IO_ZERO_RANGE was introduced which is a function that enables files to quickly zero ranges of files without changing the layout of the file in any way
  • The cache hash was converted to use rbtree in this patch. This was done because the buffer cache hash was showing scalability problems. By switching to rbtrees performance the performance and scalability should be greatly improved, particularly for systems doing a great deal of I/O.

Btrfs improvements

Everyone’s favorite file-system-in-development, btrfs, had some interesting patches added in 2.6.37. Overall, if you watch the btrfs mailing list, you will see lots of active testing of btrfs. This has resulted in a number of good patches even if they aren’t adding significant new “features”. Several of the patches can be considered “major” while there are also some very good “minor” patches as well.

Probably the most significant feature added to btrfs is to cache the free space information on disk. It sounds kind of confusing so let me explain. Before this patch, if btrfs had to allocate from a block group that was not previously cached, it had to scan the entire extent-tree (i.e. it took a great deal of time and resources to find available block groups). After this patch, every time a transaction is committed producing a dirtied block group, the free space is dumped to the on-disk free space cache. So finding available block groups is a simple lookup greatly improving performance for this situation.

This patch results in an disk format change for btrfs. Recall that it is still in development so don’t be surprised by any disk format changes. However, you can mount existing btrfs file systems so that this option is not used. In fact, currently, you have to enable this new option using the “-o space_cache” mount option.

Another major feature that was added to btrfs in 2.6.37 was asynchronous snapshot creation. The benefit of this features is that you don’t have to wait for a new snapshot to be committed to the disk. You can use this feature by adding “async” to the “btrfs subvolume snapshot” command.

Believe it or not, the asynchronous snapshot creation capability was added primarily with ceph in mind. Remember that ceph was added a few kernel versions ago and is a distributed parallel file system that is still under heavy development. Ceph uses btrfs as the underlying file system (Ceph can arguably be called a meta file system since it is file system on top of a file system). There is more on Ceph itself later in this article.

A somewhat minor feature that was added to btrfs in the recently released 2.6.37 kernel is the ability to delete sub-volumes by unprivileged users. However, the user can only delete the sub-volume if they have “write” and “execute” permission on the sub-volume root inode. Otherwise they don’t have permission to delete it. The option “-o user_subvol_rm_allowed” can be used during the mounting of btrfs to enable this option.

An additional minor feature was added that switched from extent buffer rbtrees to a radix tree. This switch should reduce CPU time spent in the extent buffer search and improve performance for some operations (see the commit link for more details).

The last feature for btrfs that I want to mention is all around chunk allocation tuning. This particular patch allows data and metadata block groups to be mixed. According to the kernel newbies article on 2.6.37 this should be useful for small storage devices.

Comments on "Linux 2.6.37: Scalability Improvements Abound"

Spot on with this write-up, I honestly think this site needs
a lot more attention. I’ll probably be back again to read through more, thanks
for the advice!

It is simple to see how the contemporary deck of playing cards was derived from this part of the Tarot. All of the annoying ideas and everything else weighing you down will just float absent…

Wonderful blog! I found it while surfing around on Yahoo News.
Do you have any suggestions on how to get listed in Yahoo News?
I’ve been trying for a while but I never seem to get there!
Many thanks

Incredible story there. What occurred after? Thanks!

Heya i am for the primary time here. I found this board
and I find It really useful & it helped me out a lot.
I’m hoping to give something again and aid others like you aided me.

Hey there! I could have sworn I’ve been to this site before but after reading
through some of the post I realized it’s new to me. Anyhow,
I’m definitely delighted I found it and I’ll be bookmarking and checking
back often!

In playing field of Italian Leather Bags are decidedly
in requiring because of the improved and the stylish look.
The design concept of Gucci tends to be simple and extensive just like the
Pelham one-shoulder handbag I am carrying on. This website has a solution for all
your footwear needs, and when you pick up shoes from this
website you can be rest assured that each shoe guarantees you
exceptional quality and great comfort.

Hi there! I just wanted to ask if you ever have any trouble with hackers?
My last blog (wordpress) was hacked and I ended up losing
a few months of hard work due to no backup. Do you have any solutions to prevent hackers?

whoah this blog is wonderful i really like reading your
articles. Stay up the good work! You understand, a lot of persons
are looking round for this info, you can help them greatly.

You really make it appear so easy along with your presentation but I
to find this topic to be really one thing which I feel I would
never understand. It kind of feels too complex and extremely
wide for me. I am looking ahead to your next put up, I’ll attempt to
get the dangle of it!

Thanks for sharing your info. I truly appreciate your
efforts and I will be waiting for your further post thank you once again.

Alstom : livre le dernier tramway Citadis à Nottingham.
Gagner de l’argent sur internet bourse. Elles
étaient attendues en hausse de 0,7%. Gagner argent
bourse ligne forum.
Les autres acteurs hexagonaux du logiciel génèrent moins de 100 ME de revenus annuels.

Stratégie trading option binaire. Selon les derniers chiffres du Comité Professionnel du Pétrole,
les livraisons de carburants routiers sur le marché français
se sont élevées à 4,37 millions de mètres cubes en avril 2015,
en hausse de 1,7% par rapport à avril 2014. Comment
gagner argent internet bourse.
Si elle rencontre un succès commercial, le caractère innovant du produit pourrait intéresser
une major du secteur souhaitant compléter sa gamme de systèmes d’ancrages standards.
Miser en bourse. L’indice des biotechs du Nasdaq a progressé
de 4,4%. Bourse comment trader.
Ainsi, les gérants tendent toujours à favoriser les émissions à
échéances courtes. Bourse forex.
Le spécialiste de l’audition publiera son chiffre d’affaires du troisième trimestre.

Forex binaire.
L’action bondit de plus de 10% Actualisé avec
réaction en Bourse. Bourse comment fonctionne les dividendes.

Philips, Renesas et Infineon ont reçu une communication de griefs dans le cadre de l’enquête de la Commission européenne sur une possible entente dans le
domaine des puces pour cartes à puces. Options binaires trading.

Les prix n’ont donc pas baissé, comme les marchés l’anticipaient (-0,2%),
mais l’économie japonaise est encore loin de la cible de
2% d’inflation de la BoJ et la tendance à la baisse des cours du pétrole ne devrait pas tirer les prix vers le haut dans les mois à venir.

Bourse ligne et formation. Le groupe maintient par ailleurs ses estimations en termes de profit ajusté, marge et cash flow.
Comment acheter de l’or en bourse.
Ainsi Goldman Sachs estime que le titre Nexans souffre d’une
prime de risque de 250 points de base par
rapport au secteur, décote qui ne paraît pas justifiée.
Comment gagner sa vie avec la bourse.
Medasys : lance une nouvelle plate-forme de services sécurisés pour les organisations de
santé. Robot trader option binaire.
Peut être maintenue sous contrôle avec les dispositifs existant et déjà testés :
prêts bilatéraux, FESF-MES, PSI etc. Comment vendre
en bourse.
Ce recul s’explique principalement par la baisse du résultat brut d’exploitation et par des amortissements d’actifs industriels plus élevés résultant des investissements réalisés
ces dernières années. Gagner argent en bourse sur
internet.
Les résultats de l’étude de phase I aux Etats-Unis ,associés à ceux de l’étude de
phase II en Europe, dont le recrutement est terminé, nous permettront de concevoir les études suivantes du développement clinique deProCervix en Europe
et aux Etats-Unis. Options binaires keytrade.
Ingenico : augmentation du nombre d’actions par conversion d’obligations non cotées.
Comparatif brokers binaires. La statistique pour les services comprend notamment l’activité en cours, les anticipations d’activité, les prix des intrants et l’emploi.
Formation bourse nice.
A la suite des débordements du 5 octobre, six salariés ont été
mis à pied à titre conservatoire. Comment gagner trading binaire.

Le spécialiste des batteries de haute technologie communiquera son chiffre d’affaires du troisième trimestre.
Bourse cac.
Nous le constatons avec notre portefeuille, le rendement
moyen des bureaux en Ile de France se situe autour de
5% alors qu’il évolue davantage autour de 7% en Province.
Options binaires suisse. Nous avons, à quelques exceptions près,
une faible volatilité, un environnement de primes de risques
comprimées. Comment mettre mon argent bourse.
Orpéa : croissance à deux chiffres des résultats,
0,80 euro de coupon. Acheté des action en bourse.

The time to read or pay a visit to the content material or internet sites we’ve linked to below.

What’s Going down i am new to this, I stumbled upon this I have discovered It absolutely useful and it has helped me out loads.

I am hoping to contribute & help different customers like
its helped me. Great job.

Leave a Reply