From a0feff3c2cc00c665d2497dd1c4080469ec75df9 Mon Sep 17 00:00:00 2001 From: Waldemar Brodkorb Date: Thu, 22 Oct 2015 22:29:23 +0200 Subject: add aufs patch for 4.1.x --- target/linux/patches/4.1.10/aufs.patch | 35215 +++++++++++++++++++++++++++++++ 1 file changed, 35215 insertions(+) create mode 100644 target/linux/patches/4.1.10/aufs.patch (limited to 'target/linux/patches') diff --git a/target/linux/patches/4.1.10/aufs.patch b/target/linux/patches/4.1.10/aufs.patch new file mode 100644 index 000000000..749c90989 --- /dev/null +++ b/target/linux/patches/4.1.10/aufs.patch @@ -0,0 +1,35215 @@ +diff -Nur linux-4.1.10.orig/Documentation/ABI/testing/debugfs-aufs linux-4.1.10/Documentation/ABI/testing/debugfs-aufs +--- linux-4.1.10.orig/Documentation/ABI/testing/debugfs-aufs 1970-01-01 01:00:00.000000000 +0100 ++++ linux-4.1.10/Documentation/ABI/testing/debugfs-aufs 2015-10-22 21:35:53.000000000 +0200 +@@ -0,0 +1,50 @@ ++What: /debug/aufs/si_/ ++Date: March 2009 ++Contact: J. R. Okajima ++Description: ++ Under /debug/aufs, a directory named si_ is created ++ per aufs mount, where is a unique id generated ++ internally. ++ ++What: /debug/aufs/si_/plink ++Date: Apr 2013 ++Contact: J. R. Okajima ++Description: ++ It has three lines and shows the information about the ++ pseudo-link. The first line is a single number ++ representing a number of buckets. The second line is a ++ number of pseudo-links per buckets (separated by a ++ blank). The last line is a single number representing a ++ total number of psedo-links. ++ When the aufs mount option 'noplink' is specified, it ++ will show "1\n0\n0\n". ++ ++What: /debug/aufs/si_/xib ++Date: March 2009 ++Contact: J. R. Okajima ++Description: ++ It shows the consumed blocks by xib (External Inode Number ++ Bitmap), its block size and file size. ++ When the aufs mount option 'noxino' is specified, it ++ will be empty. About XINO files, see the aufs manual. ++ ++What: /debug/aufs/si_/xino0, xino1 ... xinoN ++Date: March 2009 ++Contact: J. R. Okajima ++Description: ++ It shows the consumed blocks by xino (External Inode Number ++ Translation Table), its link count, block size and file ++ size. ++ When the aufs mount option 'noxino' is specified, it ++ will be empty. About XINO files, see the aufs manual. ++ ++What: /debug/aufs/si_/xigen ++Date: March 2009 ++Contact: J. R. Okajima ++Description: ++ It shows the consumed blocks by xigen (External Inode ++ Generation Table), its block size and file size. ++ If CONFIG_AUFS_EXPORT is disabled, this entry will not ++ be created. ++ When the aufs mount option 'noxino' is specified, it ++ will be empty. About XINO files, see the aufs manual. +diff -Nur linux-4.1.10.orig/Documentation/ABI/testing/sysfs-aufs linux-4.1.10/Documentation/ABI/testing/sysfs-aufs +--- linux-4.1.10.orig/Documentation/ABI/testing/sysfs-aufs 1970-01-01 01:00:00.000000000 +0100 ++++ linux-4.1.10/Documentation/ABI/testing/sysfs-aufs 2015-10-22 21:35:53.000000000 +0200 +@@ -0,0 +1,31 @@ ++What: /sys/fs/aufs/si_/ ++Date: March 2009 ++Contact: J. R. Okajima ++Description: ++ Under /sys/fs/aufs, a directory named si_ is created ++ per aufs mount, where is a unique id generated ++ internally. ++ ++What: /sys/fs/aufs/si_/br0, br1 ... brN ++Date: March 2009 ++Contact: J. R. Okajima ++Description: ++ It shows the abolute path of a member directory (which ++ is called branch) in aufs, and its permission. ++ ++What: /sys/fs/aufs/si_/brid0, brid1 ... bridN ++Date: July 2013 ++Contact: J. R. Okajima ++Description: ++ It shows the id of a member directory (which is called ++ branch) in aufs. ++ ++What: /sys/fs/aufs/si_/xi_path ++Date: March 2009 ++Contact: J. R. Okajima ++Description: ++ It shows the abolute path of XINO (External Inode Number ++ Bitmap, Translation Table and Generation Table) file ++ even if it is the default path. ++ When the aufs mount option 'noxino' is specified, it ++ will be empty. About XINO files, see the aufs manual. +diff -Nur linux-4.1.10.orig/Documentation/filesystems/aufs/design/01intro.txt linux-4.1.10/Documentation/filesystems/aufs/design/01intro.txt +--- linux-4.1.10.orig/Documentation/filesystems/aufs/design/01intro.txt 1970-01-01 01:00:00.000000000 +0100 ++++ linux-4.1.10/Documentation/filesystems/aufs/design/01intro.txt 2015-10-22 21:35:53.000000000 +0200 +@@ -0,0 +1,170 @@ ++ ++# Copyright (C) 2005-2015 Junjiro R. Okajima ++# ++# This program is free software; you can redistribute it and/or modify ++# it under the terms of the GNU General Public License as published by ++# the Free Software Foundation; either version 2 of the License, or ++# (at your option) any later version. ++# ++# This program is distributed in the hope that it will be useful, ++# but WITHOUT ANY WARRANTY; without even the implied warranty of ++# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ++# GNU General Public License for more details. ++# ++# You should have received a copy of the GNU General Public License ++# along with this program. If not, see . ++ ++Introduction ++---------------------------------------- ++ ++aufs [ei ju: ef es] | [a u f s] ++1. abbrev. for "advanced multi-layered unification filesystem". ++2. abbrev. for "another unionfs". ++3. abbrev. for "auf das" in German which means "on the" in English. ++ Ex. "Butter aufs Brot"(G) means "butter onto bread"(E). ++ But "Filesystem aufs Filesystem" is hard to understand. ++ ++AUFS is a filesystem with features: ++- multi layered stackable unification filesystem, the member directory ++ is called as a branch. ++- branch permission and attribute, 'readonly', 'real-readonly', ++ 'readwrite', 'whiteout-able', 'link-able whiteout', etc. and their ++ combination. ++- internal "file copy-on-write". ++- logical deletion, whiteout. ++- dynamic branch manipulation, adding, deleting and changing permission. ++- allow bypassing aufs, user's direct branch access. ++- external inode number translation table and bitmap which maintains the ++ persistent aufs inode number. ++- seekable directory, including NFS readdir. ++- file mapping, mmap and sharing pages. ++- pseudo-link, hardlink over branches. ++- loopback mounted filesystem as a branch. ++- several policies to select one among multiple writable branches. ++- revert a single systemcall when an error occurs in aufs. ++- and more... ++ ++ ++Multi Layered Stackable Unification Filesystem ++---------------------------------------------------------------------- ++Most people already knows what it is. ++It is a filesystem which unifies several directories and provides a ++merged single directory. When users access a file, the access will be ++passed/re-directed/converted (sorry, I am not sure which English word is ++correct) to the real file on the member filesystem. The member ++filesystem is called 'lower filesystem' or 'branch' and has a mode ++'readonly' and 'readwrite.' And the deletion for a file on the lower ++readonly branch is handled by creating 'whiteout' on the upper writable ++branch. ++ ++On LKML, there have been discussions about UnionMount (Jan Blunck, ++Bharata B Rao and Valerie Aurora) and Unionfs (Erez Zadok). They took ++different approaches to implement the merged-view. ++The former tries putting it into VFS, and the latter implements as a ++separate filesystem. ++(If I misunderstand about these implementations, please let me know and ++I shall correct it. Because it is a long time ago when I read their ++source files last time). ++ ++UnionMount's approach will be able to small, but may be hard to share ++branches between several UnionMount since the whiteout in it is ++implemented in the inode on branch filesystem and always ++shared. According to Bharata's post, readdir does not seems to be ++finished yet. ++There are several missing features known in this implementations such as ++- for users, the inode number may change silently. eg. copy-up. ++- link(2) may break by copy-up. ++- read(2) may get an obsoleted filedata (fstat(2) too). ++- fcntl(F_SETLK) may be broken by copy-up. ++- unnecessary copy-up may happen, for example mmap(MAP_PRIVATE) after ++ open(O_RDWR). ++ ++In linux-3.18, "overlay" filesystem (formerly known as "overlayfs") was ++merged into mainline. This is another implementation of UnionMount as a ++separated filesystem. All the limitations and known problems which ++UnionMount are equally inherited to "overlay" filesystem. ++ ++Unionfs has a longer history. When I started implementing a stackable ++filesystem (Aug 2005), it already existed. It has virtual super_block, ++inode, dentry and file objects and they have an array pointing lower ++same kind objects. After contributing many patches for Unionfs, I ++re-started my project AUFS (Jun 2006). ++ ++In AUFS, the structure of filesystem resembles to Unionfs, but I ++implemented my own ideas, approaches and enhancements and it became ++totally different one. ++ ++Comparing DM snapshot and fs based implementation ++- the number of bytes to be copied between devices is much smaller. ++- the type of filesystem must be one and only. ++- the fs must be writable, no readonly fs, even for the lower original ++ device. so the compression fs will not be usable. but if we use ++ loopback mount, we may address this issue. ++ for instance, ++ mount /cdrom/squashfs.img /sq ++ losetup /sq/ext2.img ++ losetup /somewhere/cow ++ dmsetup "snapshot /dev/loop0 /dev/loop1 ..." ++- it will be difficult (or needs more operations) to extract the ++ difference between the original device and COW. ++- DM snapshot-merge may help a lot when users try merging. in the ++ fs-layer union, users will use rsync(1). ++ ++You may want to read my old paper "Filesystems in LiveCD" ++(http://aufs.sourceforge.net/aufs2/report/sq/sq.pdf). ++ ++ ++Several characters/aspects/persona of aufs ++---------------------------------------------------------------------- ++ ++Aufs has several characters, aspects or persona. ++1. a filesystem, callee of VFS helper ++2. sub-VFS, caller of VFS helper for branches ++3. a virtual filesystem which maintains persistent inode number ++4. reader/writer of files on branches such like an application ++ ++1. Callee of VFS Helper ++As an ordinary linux filesystem, aufs is a callee of VFS. For instance, ++unlink(2) from an application reaches sys_unlink() kernel function and ++then vfs_unlink() is called. vfs_unlink() is one of VFS helper and it ++calls filesystem specific unlink operation. Actually aufs implements the ++unlink operation but it behaves like a redirector. ++ ++2. Caller of VFS Helper for Branches ++aufs_unlink() passes the unlink request to the branch filesystem as if ++it were called from VFS. So the called unlink operation of the branch ++filesystem acts as usual. As a caller of VFS helper, aufs should handle ++every necessary pre/post operation for the branch filesystem. ++- acquire the lock for the parent dir on a branch ++- lookup in a branch ++- revalidate dentry on a branch ++- mnt_want_write() for a branch ++- vfs_unlink() for a branch ++- mnt_drop_write() for a branch ++- release the lock on a branch ++ ++3. Persistent Inode Number ++One of the most important issue for a filesystem is to maintain inode ++numbers. This is particularly important to support exporting a ++filesystem via NFS. Aufs is a virtual filesystem which doesn't have a ++backend block device for its own. But some storage is necessary to ++keep and maintain the inode numbers. It may be a large space and may not ++suit to keep in memory. Aufs rents some space from its first writable ++branch filesystem (by default) and creates file(s) on it. These files ++are created by aufs internally and removed soon (currently) keeping ++opened. ++Note: Because these files are removed, they are totally gone after ++ unmounting aufs. It means the inode numbers are not persistent ++ across unmount or reboot. I have a plan to make them really ++ persistent which will be important for aufs on NFS server. ++ ++4. Read/Write Files Internally (copy-on-write) ++Because a branch can be readonly, when you write a file on it, aufs will ++"copy-up" it to the upper writable branch internally. And then write the ++originally requested thing to the file. Generally kernel doesn't ++open/read/write file actively. In aufs, even a single write may cause a ++internal "file copy". This behaviour is very similar to cp(1) command. ++ ++Some people may think it is better to pass such work to user space ++helper, instead of doing in kernel space. Actually I am still thinking ++about it. But currently I have implemented it in kernel space. +diff -Nur linux-4.1.10.orig/Documentation/filesystems/aufs/design/02struct.txt linux-4.1.10/Documentation/filesystems/aufs/design/02struct.txt +--- linux-4.1.10.orig/Documentation/filesystems/aufs/design/02struct.txt 1970-01-01 01:00:00.000000000 +0100 ++++ linux-4.1.10/Documentation/filesystems/aufs/design/02struct.txt 2015-10-22 21:35:53.000000000 +0200 +@@ -0,0 +1,258 @@ ++ ++# Copyright (C) 2005-2015 Junjiro R. Okajima ++# ++# This program is free software; you can redistribute it and/or modify ++# it under the terms of the GNU General Public License as published by ++# the Free Software Foundation; either version 2 of the License, or ++# (at your option) any later version. ++# ++# This program is distributed in the hope that it will be useful, ++# but WITHOUT ANY WARRANTY; without even the implied warranty of ++# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ++# GNU General Public License for more details. ++# ++# You should have received a copy of the GNU General Public License ++# along with this program. If not, see . ++ ++Basic Aufs Internal Structure ++ ++Superblock/Inode/Dentry/File Objects ++---------------------------------------------------------------------- ++As like an ordinary filesystem, aufs has its own ++superblock/inode/dentry/file objects. All these objects have a ++dynamically allocated array and store the same kind of pointers to the ++lower filesystem, branch. ++For example, when you build a union with one readwrite branch and one ++readonly, mounted /au, /rw and /ro respectively. ++- /au = /rw + /ro ++- /ro/fileA exists but /rw/fileA ++ ++Aufs lookup operation finds /ro/fileA and gets dentry for that. These ++pointers are stored in a aufs dentry. The array in aufs dentry will be, ++- [0] = NULL (because /rw/fileA doesn't exist) ++- [1] = /ro/fileA ++ ++This style of an array is essentially same to the aufs ++superblock/inode/dentry/file objects. ++ ++Because aufs supports manipulating branches, ie. add/delete/change ++branches dynamically, these objects has its own generation. When ++branches are changed, the generation in aufs superblock is ++incremented. And a generation in other object are compared when it is ++accessed. When a generation in other objects are obsoleted, aufs ++refreshes the internal array. ++ ++ ++Superblock ++---------------------------------------------------------------------- ++Additionally aufs superblock has some data for policies to select one ++among multiple writable branches, XIB files, pseudo-links and kobject. ++See below in detail. ++About the policies which supports copy-down a directory, see ++wbr_policy.txt too. ++ ++ ++Branch and XINO(External Inode Number Translation Table) ++---------------------------------------------------------------------- ++Every branch has its own xino (external inode number translation table) ++file. The xino file is created and unlinked by aufs internally. When two ++members of a union exist on the same filesystem, they share the single ++xino file. ++The struct of a xino file is simple, just a sequence of aufs inode ++numbers which is indexed by the lower inode number. ++In the above sample, assume the inode number of /ro/fileA is i111 and ++aufs assigns the inode number i999 for fileA. Then aufs writes 999 as ++4(8) bytes at 111 * 4(8) bytes offset in the xino file. ++ ++When the inode numbers are not contiguous, the xino file will be sparse ++which has a hole in it and doesn't consume as much disk space as it ++might appear. If your branch filesystem consumes disk space for such ++holes, then you should specify 'xino=' option at mounting aufs. ++ ++Aufs has a mount option to free the disk blocks for such holes in XINO ++files on tmpfs or ramdisk. But it is not so effective actually. If you ++meet a problem of disk shortage due to XINO files, then you should try ++"tmpfs-ino.patch" (and "vfs-ino.patch" too) in aufs4-standalone.git. ++The patch localizes the assignment inumbers per tmpfs-mount and avoid ++the holes in XINO files. ++ ++Also a writable branch has three kinds of "whiteout bases". All these ++are existed when the branch is joined to aufs, and their names are ++whiteout-ed doubly, so that users will never see their names in aufs ++hierarchy. ++1. a regular file which will be hardlinked to all whiteouts. ++2. a directory to store a pseudo-link. ++3. a directory to store an "orphan"-ed file temporary. ++ ++1. Whiteout Base ++ When you remove a file on a readonly branch, aufs handles it as a ++ logical deletion and creates a whiteout on the upper writable branch ++ as a hardlink of this file in order not to consume inode on the ++ writable branch. ++2. Pseudo-link Dir ++ See below, Pseudo-link. ++3. Step-Parent Dir ++ When "fileC" exists on the lower readonly branch only and it is ++ opened and removed with its parent dir, and then user writes ++ something into it, then aufs copies-up fileC to this ++ directory. Because there is no other dir to store fileC. After ++ creating a file under this dir, the file is unlinked. ++ ++Because aufs supports manipulating branches, ie. add/delete/change ++dynamically, a branch has its own id. When the branch order changes, ++aufs finds the new index by searching the branch id. ++ ++ ++Pseudo-link ++---------------------------------------------------------------------- ++Assume "fileA" exists on the lower readonly branch only and it is ++hardlinked to "fileB" on the branch. When you write something to fileA, ++aufs copies-up it to the upper writable branch. Additionally aufs ++creates a hardlink under the Pseudo-link Directory of the writable ++branch. The inode of a pseudo-link is kept in aufs super_block as a ++simple list. If fileB is read after unlinking fileA, aufs returns ++filedata from the pseudo-link instead of the lower readonly ++branch. Because the pseudo-link is based upon the inode, to keep the ++inode number by xino (see above) is essentially necessary. ++ ++All the hardlinks under the Pseudo-link Directory of the writable branch ++should be restored in a proper location later. Aufs provides a utility ++to do this. The userspace helpers executed at remounting and unmounting ++aufs by default. ++During this utility is running, it puts aufs into the pseudo-link ++maintenance mode. In this mode, only the process which began the ++maintenance mode (and its child processes) is allowed to operate in ++aufs. Some other processes which are not related to the pseudo-link will ++be allowed to run too, but the rest have to return an error or wait ++until the maintenance mode ends. If a process already acquires an inode ++mutex (in VFS), it has to return an error. ++ ++ ++XIB(external inode number bitmap) ++---------------------------------------------------------------------- ++Addition to the xino file per a branch, aufs has an external inode number ++bitmap in a superblock object. It is also an internal file such like a ++xino file. ++It is a simple bitmap to mark whether the aufs inode number is in-use or ++not. ++To reduce the file I/O, aufs prepares a single memory page to cache xib. ++ ++As well as XINO files, aufs has a feature to truncate/refresh XIB to ++reduce the number of consumed disk blocks for these files. ++ ++ ++Virtual or Vertical Dir, and Readdir in Userspace ++---------------------------------------------------------------------- ++In order to support multiple layers (branches), aufs readdir operation ++constructs a virtual dir block on memory. For readdir, aufs calls ++vfs_readdir() internally for each dir on branches, merges their entries ++with eliminating the whiteout-ed ones, and sets it to file (dir) ++object. So the file object has its entry list until it is closed. The ++entry list will be updated when the file position is zero and becomes ++obsoleted. This decision is made in aufs automatically. ++ ++The dynamically allocated memory block for the name of entries has a ++unit of 512 bytes (by default) and stores the names contiguously (no ++padding). Another block for each entry is handled by kmem_cache too. ++During building dir blocks, aufs creates hash list and judging whether ++the entry is whiteouted by its upper branch or already listed. ++The merged result is cached in the corresponding inode object and ++maintained by a customizable life-time option. ++ ++Some people may call it can be a security hole or invite DoS attack ++since the opened and once readdir-ed dir (file object) holds its entry ++list and becomes a pressure for system memory. But I'd say it is similar ++to files under /proc or /sys. The virtual files in them also holds a ++memory page (generally) while they are opened. When an idea to reduce ++memory for them is introduced, it will be applied to aufs too. ++For those who really hate this situation, I've developed readdir(3) ++library which operates this merging in userspace. You just need to set ++LD_PRELOAD environment variable, and aufs will not consume no memory in ++kernel space for readdir(3). ++ ++ ++Workqueue ++---------------------------------------------------------------------- ++Aufs sometimes requires privilege access to a branch. For instance, ++in copy-up/down operation. When a user process is going to make changes ++to a file which exists in the lower readonly branch only, and the mode ++of one of ancestor directories may not be writable by a user ++process. Here aufs copy-up the file with its ancestors and they may ++require privilege to set its owner/group/mode/etc. ++This is a typical case of a application character of aufs (see ++Introduction). ++ ++Aufs uses workqueue synchronously for this case. It creates its own ++workqueue. The workqueue is a kernel thread and has privilege. Aufs ++passes the request to call mkdir or write (for example), and wait for ++its completion. This approach solves a problem of a signal handler ++simply. ++If aufs didn't adopt the workqueue and changed the privilege of the ++process, then the process may receive the unexpected SIGXFSZ or other ++signals. ++ ++Also aufs uses the system global workqueue ("events" kernel thread) too ++for asynchronous tasks, such like handling inotify/fsnotify, re-creating a ++whiteout base and etc. This is unrelated to a privilege. ++Most of aufs operation tries acquiring a rw_semaphore for aufs ++superblock at the beginning, at the same time waits for the completion ++of all queued asynchronous tasks. ++ ++ ++Whiteout ++---------------------------------------------------------------------- ++The whiteout in aufs is very similar to Unionfs's. That is represented ++by its filename. UnionMount takes an approach of a file mode, but I am ++afraid several utilities (find(1) or something) will have to support it. ++ ++Basically the whiteout represents "logical deletion" which stops aufs to ++lookup further, but also it represents "dir is opaque" which also stop ++further lookup. ++ ++In aufs, rmdir(2) and rename(2) for dir uses whiteout alternatively. ++In order to make several functions in a single systemcall to be ++revertible, aufs adopts an approach to rename a directory to a temporary ++unique whiteouted name. ++For example, in rename(2) dir where the target dir already existed, aufs ++renames the target dir to a temporary unique whiteouted name before the ++actual rename on a branch, and then handles other actions (make it opaque, ++update the attributes, etc). If an error happens in these actions, aufs ++simply renames the whiteouted name back and returns an error. If all are ++succeeded, aufs registers a function to remove the whiteouted unique ++temporary name completely and asynchronously to the system global ++workqueue. ++ ++ ++Copy-up ++---------------------------------------------------------------------- ++It is a well-known feature or concept. ++When user modifies a file on a readonly branch, aufs operate "copy-up" ++internally and makes change to the new file on the upper writable branch. ++When the trigger systemcall does not update the timestamps of the parent ++dir, aufs reverts it after copy-up. ++ ++ ++Move-down (aufs3.9 and later) ++---------------------------------------------------------------------- ++"Copy-up" is one of the essential feature in aufs. It copies a file from ++the lower readonly branch to the upper writable branch when a user ++changes something about the file. ++"Move-down" is an opposite action of copy-up. Basically this action is ++ran manually instead of automatically and internally. ++For desgin and implementation, aufs has to consider these issues. ++- whiteout for the file may exist on the lower branch. ++- ancestor directories may not exist on the lower branch. ++- diropq for the ancestor directories may exist on the upper branch. ++- free space on the lower branch will reduce. ++- another access to the file may happen during moving-down, including ++ UDBA (see "Revalidate Dentry and UDBA"). ++- the file should not be hard-linked nor pseudo-linked. they should be ++ handled by auplink utility later. ++ ++Sometimes users want to move-down a file from the upper writable branch ++to the lower readonly or writable branch. For instance, ++- the free space of the upper writable branch is going to run out. ++- create a new intermediate branch between the upper and lower branch. ++- etc. ++ ++For this purpose, use "aumvdown" command in aufs-util.git. +diff -Nur linux-4.1.10.orig/Documentation/filesystems/aufs/design/03atomic_open.txt linux-4.1.10/Documentation/filesystems/aufs/design/03atomic_open.txt +--- linux-4.1.10.orig/Documentation/filesystems/aufs/design/03atomic_open.txt 1970-01-01 01:00:00.000000000 +0100 ++++ linux-4.1.10/Documentation/filesystems/aufs/design/03atomic_open.txt 2015-10-22 21:35:53.000000000 +0200 +@@ -0,0 +1,85 @@ ++ ++# Copyright (C) 2015 Junjiro R. Okajima ++# ++# This program is free software; you can redistribute it and/or modify ++# it under the terms of the GNU General Public License as published by ++# the Free Software Foundation; either version 2 of the License, or ++# (at your option) any later version. ++# ++# This program is distributed in the hope that it will be useful, ++# but WITHOUT ANY WARRANTY; without even the implied warranty of ++# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ++# GNU General Public License for more details. ++# ++# You should have received a copy of the GNU General Public License ++# along with this program. If not, see . ++ ++Support for a branch who has its ->atomic_open() ++---------------------------------------------------------------------- ++The filesystems who implement its ->atomic_open() are not majority. For ++example NFSv4 does, and aufs should call NFSv4 ->atomic_open, ++particularly for open(O_CREAT|O_EXCL, 0400) case. Other than ++->atomic_open(), NFSv4 returns an error for this open(2). While I am not ++sure whether all filesystems who have ->atomic_open() behave like this, ++but NFSv4 surely returns the error. ++ ++In order to support ->atomic_open() for aufs, there are a few ++approaches. ++ ++A. Introduce aufs_atomic_open() ++ - calls one of VFS:do_last(), lookup_open() or atomic_open() for ++ branch fs. ++B. Introduce aufs_atomic_open() calling create, open and chmod. this is ++ an aufs user Pip Cet's approach ++ - calls aufs_create(), VFS finish_open() and notify_change(). ++ - pass fake-mode to finish_open(), and then correct the mode by ++ notify_change(). ++C. Extend aufs_open() to call branch fs's ->atomic_open() ++ - no aufs_atomic_open(). ++ - aufs_lookup() registers the TID to an aufs internal object. ++ - aufs_create() does nothing when the matching TID is registered, but ++ registers the mode. ++ - aufs_open() calls branch fs's ->atomic_open() when the matching ++ TID is registered. ++D. Extend aufs_open() to re-try branch fs's ->open() with superuser's ++ credential ++ - no aufs_atomic_open(). ++ - aufs_create() registers the TID to an internal object. this info ++ represents "this process created this file just now." ++ - when aufs gets EACCES from branch fs's ->open(), then confirm the ++ registered TID and re-try open() with superuser's credential. ++ ++Pros and cons for each approach. ++ ++A. ++ - straightforward but highly depends upon VFS internal. ++ - the atomic behavaiour is kept. ++ - some of parameters such as nameidata are hard to reproduce for ++ branch fs. ++ - large overhead. ++B. ++ - easy to implement. ++ - the atomic behavaiour is lost. ++C. ++ - the atomic behavaiour is kept. ++ - dirty and tricky. ++ - VFS checks whether the file is created correctly after calling ++ ->create(), which means this approach doesn't work. ++D. ++ - easy to implement. ++ - the atomic behavaiour is lost. ++ - to open a file with superuser's credential and give it to a user ++ process is a bad idea, since the file object keeps the credential ++ in it. It may affect LSM or something. This approach doesn't work ++ either. ++ ++The approach A is ideal, but it hard to implement. So here is a ++variation of A, which is to be implemented. ++ ++A-1. Introduce aufs_atomic_open() ++ - calls branch fs ->atomic_open() if exists. otherwise calls ++ vfs_create() and finish_open(). ++ - the demerit is that the several checks after branch fs ++ ->atomic_open() are lost. in the ordinary case, the checks are ++ done by VFS:do_last(), lookup_open() and atomic_open(). some can ++ be implemented in aufs, but not all I am afraid. +diff -Nur linux-4.1.10.orig/Documentation/filesystems/aufs/design/03lookup.txt linux-4.1.10/Documentation/filesystems/aufs/design/03lookup.txt +--- linux-4.1.10.orig/Documentation/filesystems/aufs/design/03lookup.txt 1970-01-01 01:00:00.000000000 +0100 ++++ linux-4.1.10/Documentation/filesystems/aufs/design/03lookup.txt 2015-10-22 21:35:53.000000000 +0200 +@@ -0,0 +1,113 @@ ++ ++# Copyright (C) 2005-2015 Junjiro R. Okajima ++# ++# This program is free software; you can redistribute it and/or modify ++# it under the terms of the GNU General Public License as published by ++# the Free Software Foundation; either version 2 of the License, or ++# (at your option) any later version. ++# ++# This program is distributed in the hope that it will be useful, ++# but WITHOUT ANY WARRANTY; without even the implied warranty of ++# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ++# GNU General Public License for more details. ++# ++# You should have received a copy of the GNU General Public License ++# along with this program. If not, see . ++ ++Lookup in a Branch ++---------------------------------------------------------------------- ++Since aufs has a character of sub-VFS (see Introduction), it operates ++lookup for branches as VFS does. It may be a heavy work. But almost all ++lookup operation in aufs is the simplest case, ie. lookup only an entry ++directly connected to its parent. Digging down the directory hierarchy ++is unnecessary. VFS has a function lookup_one_len() for that use, and ++aufs calls it. ++ ++When a branch is a remote filesystem, aufs basically relies upon its ++->d_revalidate(), also aufs forces the hardest revalidate tests for ++them. ++For d_revalidate, aufs implements three levels of revalidate tests. See ++"Revalidate Dentry and UDBA" in detail. ++ ++ ++Test Only the Highest One for the Directory Permission (dirperm1 option) ++---------------------------------------------------------------------- ++Let's try case study. ++- aufs has two branches, upper readwrite and lower readonly. ++ /au = /rw + /ro ++- "dirA" exists under /ro, but /rw. and its mode is 0700. ++- user invoked "chmod a+rx /au/dirA" ++- the internal copy-up is activated and "/rw/dirA" is created and its ++ permission bits are set to world readable. ++- then "/au/dirA" becomes world readable? ++ ++In this case, /ro/dirA is still 0700 since it exists in readonly branch, ++or it may be a natively readonly filesystem. If aufs respects the lower ++branch, it should not respond readdir request from other users. But user ++allowed it by chmod. Should really aufs rejects showing the entries ++under /ro/dirA? ++ ++To be honest, I don't have a good solution for this case. So aufs ++implements 'dirperm1' and 'nodirperm1' mount options, and leave it to ++users. ++When dirperm1 is specified, aufs checks only the highest one for the ++directory permission, and shows the entries. Otherwise, as usual, checks ++every dir existing on all branches and rejects the request. ++ ++As a side effect, dirperm1 option improves the performance of aufs ++because the number of permission check is reduced when the number of ++branch is many. ++ ++ ++Revalidate Dentry and UDBA (User's Direct Branch Access) ++---------------------------------------------------------------------- ++Generally VFS helpers re-validate a dentry as a part of lookup. ++0. digging down the directory hierarchy. ++1. lock the parent dir by its i_mutex. ++2. lookup the final (child) entry. ++3. revalidate it. ++4. call the actual operation (create, unlink, etc.) ++5. unlock the parent dir ++ ++If the filesystem implements its ->d_revalidate() (step 3), then it is ++called. Actually aufs implements it and checks the dentry on a branch is ++still valid. ++But it is not enough. Because aufs has to release the lock for the ++parent dir on a branch at the end of ->lookup() (step 2) and ++->d_revalidate() (step 3) while the i_mutex of the aufs dir is still ++held by VFS. ++If the file on a branch is changed directly, eg. bypassing aufs, after ++aufs released the lock, then the subsequent operation may cause ++something unpleasant result. ++ ++This situation is a result of VFS architecture, ->lookup() and ++->d_revalidate() is separated. But I never say it is wrong. It is a good ++design from VFS's point of view. It is just not suitable for sub-VFS ++character in aufs. ++ ++Aufs supports such case by three level of revalidation which is ++selectable by user. ++1. Simple Revalidate ++ Addition to the native flow in VFS's, confirm the child-parent ++ relationship on the branch just after locking the parent dir on the ++ branch in the "actual operation" (step 4). When this validation ++ fails, aufs returns EBUSY. ->d_revalidate() (step 3) in aufs still ++ checks the validation of the dentry on branches. ++2. Monitor Changes Internally by Inotify/Fsnotify ++ Addition to above, in the "actual operation" (step 4) aufs re-lookup ++ the dentry on the branch, and returns EBUSY if it finds different ++ dentry. ++ Additionally, aufs sets the inotify/fsnotify watch for every dir on branches ++ during it is in cache. When the event is notified, aufs registers a ++ function to kernel 'events' thread by schedule_work(). And the ++ function sets some special status to the cached aufs dentry and inode ++ private data. If they are not cached, then aufs has nothing to ++ do. When the same file is accessed through aufs (step 0-3) later, ++ aufs will detect the status and refresh all necessary data. ++ In this mode, aufs has to ignore the event which is fired by aufs ++ itself. ++3. No Extra Validation ++ This is the simplest test and doesn't add any additional revalidation ++ test, and skip the revalidation in step 4. It is useful and improves ++ aufs performance when system surely hide the aufs branches from user, ++ by over-mounting something (or another method). +diff -Nur linux-4.1.10.orig/Documentation/filesystems/aufs/design/04branch.txt linux-4.1.10/Documentation/filesystems/aufs/design/04branch.txt +--- linux-4.1.10.orig/Documentation/filesystems/aufs/design/04branch.txt 1970-01-01 01:00:00.000000000 +0100 ++++ linux-4.1.10/Documentation/filesystems/aufs/design/04branch.txt 2015-10-22 21:35:53.000000000 +0200 +@@ -0,0 +1,74 @@ ++ ++# Copyright (C) 2005-2015 Junjiro R. Okajima ++# ++# This program is free software; you can redistribute it and/or modify ++# it under the terms of the GNU General Public License as published by ++# the Free Software Foundation; either version 2 of the License, or ++# (at your option) any later version. ++# ++# This program is distributed in the hope that it will be useful, ++# but WITHOUT ANY WARRANTY; without even the implied warranty of ++# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ++# GNU General Public License for more details. ++# ++# You should have received a copy of the GNU General Public License ++# along with this program. If not, see . ++ ++Branch Manipulation ++ ++Since aufs supports dynamic branch manipulation, ie. add/remove a branch ++and changing its permission/attribute, there are a lot of works to do. ++ ++ ++Add a Branch ++---------------------------------------------------------------------- ++o Confirm the adding dir exists outside of aufs, including loopback ++ mount, and its various attributes. ++o Initialize the xino file and whiteout bases if necessary. ++ See struct.txt. ++ ++o Check the owner/group/mode of the directory ++ When the owner/group/mode of the adding directory differs from the ++ existing branch, aufs issues a warning because it may impose a ++ security risk. ++ For example, when a upper writable branch has a world writable empty ++ top directory, a malicious user can create any files on the writable ++ branch directly, like copy-up and modify manually. If something like ++ /etc/{passwd,shadow} exists on the lower readonly branch but the upper ++ writable branch, and the writable branch is world-writable, then a ++ malicious guy may create /etc/passwd on the writable branch directly ++ and the infected file will be valid in aufs. ++ I am afraid it can be a security issue, but aufs can do nothing except ++ producing a warning. ++ ++ ++Delete a Branch ++---------------------------------------------------------------------- ++o Confirm the deleting branch is not busy ++ To be general, there is one merit to adopt "remount" interface to ++ manipulate branches. It is to discard caches. At deleting a branch, ++ aufs checks the still cached (and connected) dentries and inodes. If ++ there are any, then they are all in-use. An inode without its ++ corresponding dentry can be alive alone (for example, inotify/fsnotify case). ++ ++ For the cached one, aufs checks whether the same named entry exists on ++ other branches. ++ If the cached one is a directory, because aufs provides a merged view ++ to users, as long as one dir is left on any branch aufs can show the ++ dir to users. In this case, the branch can be removed from aufs. ++ Otherwise aufs rejects deleting the branch. ++ ++ If any file on the deleting branch is opened by aufs, then aufs ++ rejects deleting. ++ ++ ++Modify the Permission of a Branch ++---------------------------------------------------------------------- ++o Re-initialize or remove the xino file and whiteout bases if necessary. ++ See struct.txt. ++ ++o rw --> ro: Confirm the modifying branch is not busy ++ Aufs rejects the request if any of these conditions are true. ++ - a file on the branch is mmap-ed. ++ - a regular file on the branch is opened for write and there is no ++ same named entry on the upper branch. +diff -Nur linux-4.1.10.orig/Documentation/filesystems/aufs/design/05wbr_policy.txt linux-4.1.10/Documentation/filesystems/aufs/design/05wbr_policy.txt +--- linux-4.1.10.orig/Documentation/filesystems/aufs/design/05wbr_policy.txt 1970-01-01 01:00:00.000000000 +0100 ++++ linux-4.1.10/Documentation/filesystems/aufs/design/05wbr_policy.txt 2015-10-22 21:35:53.000000000 +0200 +@@ -0,0 +1,64 @@ ++ ++# Copyright (C) 2005-2015 Junjiro R. Okajima ++# ++# This program is free software; you can redistribute it and/or modify ++# it under the terms of the GNU General Public License as published by ++# the Free Software Foundation; either version 2 of the License, or ++# (at your option) any later version. ++# ++# This program is distributed in the hope that it will be useful, ++# but WITHOUT ANY WARRANTY; without even the implied warranty of ++# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ++# GNU General Public License for more details. ++# ++# You should have received a copy of the GNU General Public License ++# along with this program. If not, see . ++ ++Policies to Select One among Multiple Writable Branches ++---------------------------------------------------------------------- ++When the number of writable branch is more than one, aufs has to decide ++the target branch for file creation or copy-up. By default, the highest ++writable branch which has the parent (or ancestor) dir of the target ++file is chosen (top-down-parent policy). ++By user's request, aufs implements some other policies to select the ++writable branch, for file creation several policies, round-robin, ++most-free-space, and other policies. For copy-up, top-down-parent, ++bottom-up-parent, bottom-up and others. ++ ++As expected, the round-robin policy selects the branch in circular. When ++you have two writable branches and creates 10 new files, 5 files will be ++created for each branch. mkdir(2) systemcall is an exception. When you ++create 10 new directories, all will be created on the same branch. ++And the most-free-space policy selects the one which has most free ++space among the writable branches. The amount of free space will be ++checked by aufs internally, and users can specify its time interval. ++ ++The policies for copy-up is more simple, ++top-down-parent is equivalent to the same named on in create policy, ++bottom-up-parent selects the writable branch where the parent dir ++exists and the nearest upper one from the copyup-source, ++bottom-up selects the nearest upper writable branch from the ++copyup-source, regardless the existence of the parent dir. ++ ++There are some rules or exceptions to apply these policies. ++- If there is a readonly branch above the policy-selected branch and ++ the parent dir is marked as opaque (a variation of whiteout), or the ++ target (creating) file is whiteout-ed on the upper readonly branch, ++ then the result of the policy is ignored and the target file will be ++ created on the nearest upper writable branch than the readonly branch. ++- If there is a writable branch above the policy-selected branch and ++ the parent dir is marked as opaque or the target file is whiteouted ++ on the branch, then the result of the policy is ignored and the target ++ file will be created on the highest one among the upper writable ++ branches who has diropq or whiteout. In case of whiteout, aufs removes ++ it as usual. ++- link(2) and rename(2) systemcalls are exceptions in every policy. ++ They try selecting the branch where the source exists as possible ++ since copyup a large file will take long time. If it can't be, ++ ie. the branch where the source exists is readonly, then they will ++ follow the copyup policy. ++- There is an exception for rename(2) when the target exists. ++ If the rename target exists, aufs compares the index of the branches ++ where the source and the target exists and selects the higher ++ one. If the selected branch is readonly, then aufs follows the ++ copyup policy. +diff -Nur linux-4.1.10.orig/Documentation/filesystems/aufs/design/06fhsm.txt linux-4.1.10/Documentation/filesystems/aufs/design/06fhsm.txt +--- linux-4.1.10.orig/Documentation/filesystems/aufs/design/06fhsm.txt 1970-01-01 01:00:00.000000000 +0100 ++++ linux-4.1.10/Documentation/filesystems/aufs/design/06fhsm.txt 2015-10-22 21:35:53.000000000 +0200 +@@ -0,0 +1,120 @@ ++ ++# Copyright (C) 2011-2015 Junjiro R. Okajima ++# ++# This program is free software; you can redistribute it and/or modify ++# it under the terms of the GNU General Public License as published by ++# the Free Software Foundation; either version 2 of the License, or ++# (at your option) any later version. ++# ++# This program is distributed in the hope that it will be useful, ++# but WITHOUT ANY WARRANTY; without even the implied warranty of ++# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ++# GNU General Public License for more details. ++# ++# You should have received a copy of the GNU General Public License ++# along with this program; if not, write to the Free Software ++# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA ++ ++ ++File-based Hierarchical Storage Management (FHSM) ++---------------------------------------------------------------------- ++Hierarchical Storage Management (or HSM) is a well-known feature in the ++storage world. Aufs provides this feature as file-based with multiple ++writable branches, based upon the principle of "Colder, the Lower". ++Here the word "colder" means that the less used files, and "lower" means ++that the position in the order of the stacked branches vertically. ++These multiple writable branches are prioritized, ie. the topmost one ++should be the fastest drive and be used heavily. ++ ++o Characters in aufs FHSM story ++- aufs itself and a new branch attribute. ++- a new ioctl interface to move-down and to establish a connection with ++ the daemon ("move-down" is a converse of "copy-up"). ++- userspace tool and daemon. ++ ++The userspace daemon establishes a connection with aufs and waits for ++the notification. The notified information is very similar to struct ++statfs containing the number of consumed blocks and inodes. ++When the consumed blocks/inodes of a branch exceeds the user-specified ++upper watermark, the daemon activates its move-down process until the ++consumed blocks/inodes reaches the user-specified lower watermark. ++ ++The actual move-down is done by aufs based upon the request from ++user-space since we need to maintain the inode number and the internal ++pointer arrays in aufs. ++ ++Currently aufs FHSM handles the regular files only. Additionally they ++must not be hard-linked nor pseudo-linked. ++ ++ ++o Cowork of aufs and the user-space daemon ++ During the userspace daemon established the connection, aufs sends a ++ small notification to it whenever aufs writes something into the ++ writable branch. But it may cost high since aufs issues statfs(2) ++ internally. So user can specify a new option to cache the ++ info. Actually the notification is controlled by these factors. ++ + the specified cache time. ++ + classified as "force" by aufs internally. ++ Until the specified time expires, aufs doesn't send the info ++ except the forced cases. When aufs decide forcing, the info is always ++ notified to userspace. ++ For example, the number of free inodes is generally large enough and ++ the shortage of it happens rarely. So aufs doesn't force the ++ notification when creating a new file, directory and others. This is ++ the typical case which aufs doesn't force. ++ When aufs writes the actual filedata and the files consumes any of new ++ blocks, the aufs forces notifying. ++ ++ ++o Interfaces in aufs ++- New branch attribute. ++ + fhsm ++ Specifies that the branch is managed by FHSM feature. In other word, ++ participant in the FHSM. ++ When nofhsm is set to the branch, it will not be the source/target ++ branch of the move-down operation. This attribute is set ++ independently from coo and moo attributes, and if you want full ++ FHSM, you should specify them as well. ++- New mount option. ++ + fhsm_sec ++ Specifies a second to suppress many less important info to be ++ notified. ++- New ioctl. ++ + AUFS_CTL_FHSM_FD ++ create a new file descriptor which userspace can read the notification ++ (a subset of struct statfs) from aufs. ++- Module parameter 'brs' ++ It has to be set to 1. Otherwise the new mount option 'fhsm' will not ++ be set. ++- mount helpers /sbin/mount.aufs and /sbin/umount.aufs ++ When there are two or more branches with fhsm attributes, ++ /sbin/mount.aufs invokes the user-space daemon and /sbin/umount.aufs ++ terminates it. As a result of remounting and branch-manipulation, the ++ number of branches with fhsm attribute can be one. In this case, ++ /sbin/mount.aufs will terminate the user-space daemon. ++ ++ ++Finally the operation is done as these steps in kernel-space. ++- make sure that, ++ + no one else is using the file. ++ + the file is not hard-linked. ++ + the file is not pseudo-linked. ++ + the file is a regular file. ++ + the parent dir is not opaqued. ++- find the target writable branch. ++- make sure the file is not whiteout-ed by the upper (than the target) ++ branch. ++- make the parent dir on the target branch. ++- mutex lock the inode on the branch. ++- unlink the whiteout on the target branch (if exists). ++- lookup and create the whiteout-ed temporary name on the target branch. ++- copy the file as the whiteout-ed temporary name on the target branch. ++- rename the whiteout-ed temporary name to the original name. ++- unlink the file on the source branch. ++- maintain the internal pointer array and the external inode number ++ table (XINO). ++- maintain the timestamps and other attributes of the parent dir and the ++ file. ++ ++And of course, in every step, an error may happen. So the operation ++should restore the original file state after an error happens. +diff -Nur linux-4.1.10.orig/Documentation/filesystems/aufs/design/06mmap.txt linux-4.1.10/Documentation/filesystems/aufs/design/06mmap.txt +--- linux-4.1.10.orig/Documentation/filesystems/aufs/design/06mmap.txt 1970-01-01 01:00:00.000000000 +0100 ++++ linux-4.1.10/Documentation/filesystems/aufs/design/06mmap.txt 2015-10-22 21:35:53.000000000 +0200 +@@ -0,0 +1,72 @@ ++ ++# Copyright (C) 2005-2015 Junjiro R. Okajima ++# ++# This program is free software; you can redistribute it and/or modify ++# it under the terms of the GNU General Public License as published by ++# the Free Software Foundation; either version 2 of the License, or ++# (at your option) any later version. ++# ++# This program is distributed in the hope that it will be useful, ++# but WITHOUT ANY WARRANTY; without even the implied warranty of ++# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ++# GNU General Public License for more details. ++# ++# You should have received a copy of the GNU General Public License ++# along with this program. If not, see . ++ ++mmap(2) -- File Memory Mapping ++---------------------------------------------------------------------- ++In aufs, the file-mapped pages are handled by a branch fs directly, no ++interaction with aufs. It means aufs_mmap() calls the branch fs's ++->mmap(). ++This approach is simple and good, but there is one problem. ++Under /proc, several entries show the mmapped files by its path (with ++device and inode number), and the printed path will be the path on the ++branch fs's instead of virtual aufs's. ++This is not a problem in most cases, but some utilities lsof(1) (and its ++user) may expect the path on aufs. ++ ++To address this issue, aufs adds a new member called vm_prfile in struct ++vm_area_struct (and struct vm_region). The original vm_file points to ++the file on the branch fs in order to handle everything correctly as ++usual. The new vm_prfile points to a virtual file in aufs, and the ++show-functions in procfs refers to vm_prfile if it is set. ++Also we need to maintain several other places where touching vm_file ++such like ++- fork()/clone() copies vma and the reference count of vm_file is ++ incremented. ++- merging vma maintains the ref count too. ++ ++This is not a good approach. It just fakes the printed path. But it ++leaves all behaviour around f_mapping unchanged. This is surely an ++advantage. ++Actually aufs had adopted another complicated approach which calls ++generic_file_mmap() and handles struct vm_operations_struct. In this ++approach, aufs met a hard problem and I could not solve it without ++switching the approach. ++ ++There may be one more another approach which is ++- bind-mount the branch-root onto the aufs-root internally ++- grab the new vfsmount (ie. struct mount) ++- lazy-umount the branch-root internally ++- in open(2) the aufs-file, open the branch-file with the hidden ++ vfsmount (instead of the original branch's vfsmount) ++- ideally this "bind-mount and lazy-umount" should be done atomically, ++ but it may be possible from userspace by the mount helper. ++ ++Adding the internal hidden vfsmount and using it in opening a file, the ++file path under /proc will be printed correctly. This approach looks ++smarter, but is not possible I am afraid. ++- aufs-root may be bind-mount later. when it happens, another hidden ++ vfsmount will be required. ++- it is hard to get the chance to bind-mount and lazy-umount ++ + in kernel-space, FS can have vfsmount in open(2) via ++ file->f_path, and aufs can know its vfsmount. But several locks are ++ already acquired, and if aufs tries to bind-mount and lazy-umount ++ here, then it may cause a deadlock. ++ + in user-space, bind-mount doesn't invoke the mount helper. ++- since /proc shows dev and ino, aufs has to give vma these info. it ++ means a new member vm_prinode will be necessary. this is essentially ++ equivalent to vm_prfile described above. ++ ++I have to give up this "looks-smater" approach. +diff -Nur linux-4.1.10.orig/Documentation/filesystems/aufs/design/06xattr.txt linux-4.1.10/Documentation/filesystems/aufs/design/06xattr.txt +--- linux-4.1.10.orig/Documentation/filesystems/aufs/design/06xattr.txt 1970-01-01 01:00:00.000000000 +0100 ++++ linux-4.1.10/Documentation/filesystems/aufs/design/06xattr.txt 2015-10-22 21:35:53.000000000 +0200 +@@ -0,0 +1,96 @@ ++ ++# Copyright (C) 2014-2015 Junjiro R. Okajima ++# ++# This program is free software; you can redistribute it and/or modify ++# it under the terms of the GNU General Public License as published by ++# the Free Software Foundation; either version 2 of the License, or ++# (at your option) any later version. ++# ++# This program is distributed in the hope that it will be useful, ++# but WITHOUT ANY WARRANTY; without even the implied warranty of ++# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ++# GNU General Public License for more details. ++# ++# You should have received a copy of the GNU General Public License ++# along with this program; if not, write to the Free Software ++# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA ++ ++ ++Listing XATTR/EA and getting the value ++---------------------------------------------------------------------- ++For the inode standard attributes (owner, group, timestamps, etc.), aufs ++shows the values from the topmost existing file. This behaviour is good ++for the non-dir entries since the bahaviour exactly matches the shown ++information. But for the directories, aufs considers all the same named ++entries on the lower branches. Which means, if one of the lower entry ++rejects readdir call, then aufs returns an error even if the topmost ++entry allows it. This behaviour is necessary to respect the branch fs's ++security, but can make users confused since the user-visible standard ++attributes don't match the behaviour. ++To address this issue, aufs has a mount option called dirperm1 which ++checks the permission for the topmost entry only, and ignores the lower ++entry's permission. ++ ++A similar issue can happen around XATTR. ++getxattr(2) and listxattr(2) families behave as if dirperm1 option is ++always set. Otherwise these very unpleasant situation would happen. ++- listxattr(2) may return the duplicated entries. ++- users may not be able to remove or reset the XATTR forever, ++ ++ ++XATTR/EA support in the internal (copy,move)-(up,down) ++---------------------------------------------------------------------- ++Generally the extended attributes of inode are categorized as these. ++- "security" for LSM and capability. ++- "system" for posix ACL, 'acl' mount option is required for the branch ++ fs generally. ++- "trusted" for userspace, CAP_SYS_ADMIN is required. ++- "user" for userspace, 'user_xattr' mount option is required for the ++ branch fs generally. ++ ++Moreover there are some other categories. Aufs handles these rather ++unpopular categories as the ordinary ones, ie. there is no special ++condition nor exception. ++ ++In copy-up, the support for XATTR on the dst branch may differ from the ++src branch. In this case, the copy-up operation will get an error and ++the original user operation which triggered the copy-up will fail. It ++can happen that even all copy-up will fail. ++When both of src and dst branches support XATTR and if an error occurs ++during copying XATTR, then the copy-up should fail obviously. That is a ++good reason and aufs should return an error to userspace. But when only ++the src branch support that XATTR, aufs should not return an error. ++For example, the src branch supports ACL but the dst branch doesn't ++because the dst branch may natively un-support it or temporary ++un-support it due to "noacl" mount option. Of course, the dst branch fs ++may NOT return an error even if the XATTR is not supported. It is ++totally up to the branch fs. ++ ++Anyway when the aufs internal copy-up gets an error from the dst branch ++fs, then aufs tries removing the just copied entry and returns the error ++to the userspace. The worst case of this situation will be all copy-up ++will fail. ++ ++For the copy-up operation, there two basic approaches. ++- copy the specified XATTR only (by category above), and return the ++ error unconditionally if it happens. ++- copy all XATTR, and ignore the error on the specified category only. ++ ++In order to support XATTR and to implement the correct behaviour, aufs ++chooses the latter approach and introduces some new branch attributes, ++"icexsec", "icexsys", "icextr", "icexusr", and "icexoth". ++They correspond to the XATTR namespaces (see above). Additionally, to be ++convenient, "icex" is also provided which means all "icex*" attributes ++are set (here the word "icex" stands for "ignore copy-error on XATTR"). ++ ++The meaning of these attributes is to ignore the error from setting ++XATTR on that branch. ++Note that aufs tries copying all XATTR unconditionally, and ignores the ++error from the dst branch according to the specified attributes. ++ ++Some XATTR may have its default value. The default value may come from ++the parent dir or the environment. If the default value is set at the ++file creating-time, it will be overwritten by copy-up. ++Some contradiction may happen I am afraid. ++Do we need another attribute to stop copying XATTR? I am unsure. For ++now, aufs implements the branch attributes to ignore the error. +diff -Nur linux-4.1.10.orig/Documentation/filesystems/aufs/design/07export.txt linux-4.1.10/Documentation/filesystems/aufs/design/07export.txt +--- linux-4.1.10.orig/Documentation/filesystems/aufs/design/07export.txt 1970-01-01 01:00:00.000000000 +0100 ++++ linux-4.1.10/Documentation/filesystems/aufs/design/07export.txt 2015-10-22 21:35:53.000000000 +0200 +@@ -0,0 +1,58 @@ ++ ++# Copyright (C) 2005-2015 Junjiro R. Okajima ++# ++# This program is free software; you can redistribute it and/or modify ++# it under the terms of the GNU General Public License as published by ++# the Free Software Foundation; either version 2 of the License, or ++# (at your option) any later version. ++# ++# This program is distributed in the hope that it will be useful, ++# but WITHOUT ANY WARRANTY; without even the implied warranty of ++# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ++# GNU General Public License for more details. ++# ++# You should have received a copy of the GNU General Public License ++# along with this program. If not, see . ++ ++Export Aufs via NFS ++---------------------------------------------------------------------- ++Here is an approach. ++- like xino/xib, add a new file 'xigen' which stores aufs inode ++ generation. ++- iget_locked(): initialize aufs inode generation for a new inode, and ++ store it in xigen file. ++- destroy_inode(): increment aufs inode generation and store it in xigen ++ file. it is necessary even if it is not unlinked, because any data of ++ inode may be changed by UDBA. ++- encode_fh(): for a root dir, simply return FILEID_ROOT. otherwise ++ build file handle by ++ + branch id (4 bytes) ++ + superblock generation (4 bytes) ++ + inode number (4 or 8 bytes) ++ + parent dir inode number (4 or 8 bytes) ++ + inode generation (4 bytes)) ++ + return value of exportfs_encode_fh() for the parent on a branch (4 ++ bytes) ++ + file handle for a branch (by exportfs_encode_fh()) ++- fh_to_dentry(): ++ + find the index of a branch from its id in handle, and check it is ++ still exist in aufs. ++ + 1st level: get the inode number from handle and search it in cache. ++ + 2nd level: if not found in cache, get the parent inode number from ++ the handle and search it in cache. and then open the found parent ++ dir, find the matching inode number by vfs_readdir() and get its ++ name, and call lookup_one_len() for the target dentry. ++ + 3rd level: if the parent dir is not cached, call ++ exportfs_decode_fh() for a branch and get the parent on a branch, ++ build a pathname of it, convert it a pathname in aufs, call ++ path_lookup(). now aufs gets a parent dir dentry, then handle it as ++ the 2nd level. ++ + to open the dir, aufs needs struct vfsmount. aufs keeps vfsmount ++ for every branch, but not itself. to get this, (currently) aufs ++ searches in current->nsproxy->mnt_ns list. it may not be a good ++ idea, but I didn't get other approach. ++ + test the generation of the gotten inode. ++- every inode operation: they may get EBUSY due to UDBA. in this case, ++ convert it into ESTALE for NFSD. ++- readdir(): call lockdep_on/off() because filldir in NFSD calls ++ lookup_one_len(), vfs_getattr(), encode_fh() and others. +diff -Nur linux-4.1.10.orig/Documentation/filesystems/aufs/design/08shwh.txt linux-4.1.10/Documentation/filesystems/aufs/design/08shwh.txt +--- linux-4.1.10.orig/Documentation/filesystems/aufs/design/08shwh.txt 1970-01-01 01:00:00.000000000 +0100 ++++ linux-4.1.10/Documentation/filesystems/aufs/design/08shwh.txt 2015-10-22 21:35:53.000000000 +0200 +@@ -0,0 +1,52 @@ ++ ++# Copyright (C) 2005-2015 Junjiro R. Okajima ++# ++# This program is free software; you can redistribute it and/or modify ++# it under the terms of the GNU General Public License as published by ++# the Free Software Foundation; either version 2 of the License, or ++# (at your option) any later version. ++# ++# This program is distributed in the hope that it will be useful, ++# but WITHOUT ANY WARRANTY; without even the implied warranty of ++# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ++# GNU General Public License for more details. ++# ++# You should have received a copy of the GNU General Public License ++# along with this program. If not, see . ++ ++Show Whiteout Mode (shwh) ++---------------------------------------------------------------------- ++Generally aufs hides the name of whiteouts. But in some cases, to show ++them is very useful for users. For instance, creating a new middle layer ++(branch) by merging existing layers. ++ ++(borrowing aufs1 HOW-TO from a user, Michael Towers) ++When you have three branches, ++- Bottom: 'system', squashfs (underlying base system), read-only ++- Middle: 'mods', squashfs, read-only ++- Top: 'overlay', ram (tmpfs), read-write ++ ++The top layer is loaded at boot time and saved at shutdown, to preserve ++the changes made to the system during the session. ++When larger changes have been made, or smaller changes have accumulated, ++the size of the saved top layer data grows. At this point, it would be ++nice to be able to merge the two overlay branches ('mods' and 'overlay') ++and rewrite the 'mods' squashfs, clearing the top layer and thus ++restoring save