Linux_4.0 - Linux Kernel Newbies

KernelNewbies:

Last updated at 2017-12-30 01:30:21


Linux 4.0 has been released on Sun, 12 Apr 2015.

Summary: This release adds support for live patching the kernel code, aimed primarily at fixing security updates without rebooting; DAX, a way to avoid using the kernel cache when filesystems run on systems with persistent memory storage; kasan, a dynamic memory error detector that allows to find use-after-free and out-of-bounds bugs; lazytime, an alternative to relatime, which causes access, modified and changed time updates to only be made in the cache and written to the disk opportunistically; allow overlayfs to have multiple lower layers, support of Parallel NFS server architecture; and dm-crypt CPU scalability improvements. There are also new drivers and many other small improvements.

Contents

  1. Prominent features

    1. Arbitrary version change
    2. Live patching
    3. DAX - Direct Access, for persistent memory storage
    4. kasan, kernel address sanitizer
    5. "lazytime" option for better update of file timestamps
    6. Multiple lower layers in overlayfs
    7. Support Parallel NFS server, default to NFS v4.2
    8. dm-crypt scalability improvements
  2. Drivers and architectures
  3. File systems
  4. Block
  5. Core (various)
  6. Memory management
  7. Virtualization
  8. Cryptography
  9. Security
  10. Tracing & perf
  11. Networking
  12. List of merges
  13. Other news sites

1. Prominent features

1.1. Arbitrary version change

This release increases the version to 4.0. This switch from 3.x to 4.0 version numbers is, however, entirely meaningless and it should not be associated to any important changes in the kernel. This release could have been 3.20, but Linus Torvalds just got tired of the old number, made a poll, and changed it. Yes, it is frivolous. The less you think about it, the better.

1.2. Live patching

This release introduces "livepatch", a feature for live patching the kernel code, aimed primarily at systems who want to get security updates without needing to reboot. This feature has been born as result of merging kgraft and kpatch, two attempts by SuSE and Red Hat that where started to replace the now propietary ksplice. It's relatively simple and minimalistic, as it's making use of existing kernel infrastructure (namely ftrace) as much as possible. It's also self-contained and it doesn't hook itself in any other kernel subsystems.

In this release livepatch is not feature complete, yet it provides a basic infrastructure for function "live patching" (i.e. code redirection), including API for kernel modules containing the actual patches, and API/ABI for userspace to be able to operate on the patches (look up what patches are applied, enable/disable them, etc). Most CVEs should be safe to apply this way. Only the x86 architecture is supported in this release, others will follow.

For more details see the merge commit

Sample live patching module: commit

Code commit

1.3. DAX - Direct Access, for persistent memory storage

Before being read by programs, files are usually first copied from the disk to the kernel caches, kept in RAM. But the possible advent of persistent non-volatile memory that would be also be used as disk changes radically the way the kernel deals with this process: the kernel cache would become unnecesary overhead.

Linux has had, in fact, support for this kind of setups since 2.6.13. But the code wasn't maintaned and only supported ext2. In this release, Linux adds DAX (Direct Access, the X is for eXciting). DAX removes the extra copy incurred by the buffer by performing reads and writes directly to the persistent-memory storage device. For file mappings, the storage device is mapped directly into userspace. Support for ext4 has been added.

Recommended LWN article: Supporting filesystems in persistent memory

Code: commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit

1.4. kasan, kernel address sanitizer

Kernel Address sanitizer (KASan) is a dynamic memory error detector. It provides fast and comprehensive solution for finding use-after-free and out-of-bounds bugs. Linux already has the kmemcheck feature, but unlike kmemcheck, KASan uses compile-time instrumentation, which makes it significantly faster than kmemcheck.

The main idea of KASAN is to use shadow memory to record whether each byte of memory is safe to access or not, and use compiler's instrumentation to check the shadow memory on each memory access. Address sanitizer uses 1/8 of the memory addressable in kernel for shadow memory and uses direct mapping with a scale and offset to translate a memory address to its corresponding shadow address.

Code: commit, commit, commit, commit, commit

1.5. "lazytime" option for better update of file timestamps

Unix filesystems keep track of information about files, such as the last time a file was accessed or modified. Keeping track of this information is very expensive, specially the time when a file was accessed ("atime"), which encourages many people to disable it with the mount option "noatime". To alleviate this problem, the "relatime" mount option was added, the atime is only updated if the previous value is earlier than the modification time, or if the file was last accessed more than 24 hours ago. This behaviour, however, breaks some programs that rely on accurate access time tracking to work, and it's also against the POSIX standard.

In this release, Linux adds another alternative: "lazytime". Lazytime causes access, modified and changed time updates to only be made in the cache. The times will only be written to the disk if the inode needs to be updated anyway for some non-time related change, if fsync(), syncfs() or sync() are called, or just before an undeleted inode is evicted from memory. This is POSIX compliant, while at the same time improving the performance.

Recommended LWN article: Introducing lazytime

Code: commit, commit, commit

1.6. Multiple lower layers in overlayfs

In overlayfs, multiple lower layers can now be given using the the colon (":") as a separator character between the directory names. For example:

  • mount -t overlay overlay -olowerdir=/lower1:/lower2:/lower3 /merged

The specified lower directories will be stacked beginning from the rightmost one and going left. In the above example lower1 will be the top, lower2 the middle and lower3 the bottom layer. "upperdir=" and "workdir=" may be omitted, in that case the overlay will be read-only.

Code: commit, commit

1.7. Support Parallel NFS server, default to NFS v4.2

Parallel NFS (pNFS) is a part of the NFS v4.1 standard that allows compute clients to access storage devices directly and in parallel. The pNFS architecture eliminates the scalability and performance issues associated with NFS servers deployed today. This is achieved by the separation of data and metadata, and moving the metadata server out of the data path.

This release adds support for pNFS server, and drivers for the block layout with XFS support to use XFS filesystems as a block layout target, and the flexfiles layout.

Also, in this release the NFS server defaults to NFS v4.2.

Code: commit, commit, commit, commit, commit, commit

1.8. dm-crypt scalability improvements

This release significantly increases the dm-crypt CPU scalability performance thanks to changes that enable effective use of an unbound workqueue across all available CPUs. A large battery of tests were performed to validate these changes, summary of results is available here

Merge: commit

2. Drivers and architectures

3. File systems

  • XFS
    • Adds support for sys_renameat2() commit

    • Remove deprecated sysctls xfsbufd_centisecs and age_buffer_centisecs commit

  • EXT4
    • Support "readonly" filesystem flag to mark a FS image as read-only, tunable with tune2fs. It prevents the kernel and e2fsprogs from changing the image commit

  • Btrfs
    • Add code to support file creation time commit

  • NFSv4.1
    • Allow parallel LOCK/LOCKU calls commit

    • Allow parallel OPEN/OPEN_DOWNGRADE/CLOSE commit

  • UBIFS
    • Add security.* XATTR support for the UBIFS commit

    • Add xattr support for symlinks commit

  • OCFS2
    • Add a mount option journal_async_commit on ocfs2 filesystem. When this feature is opened, journal commit block can be written to disk without waiting for descriptor blocks, which can improve journal commit performance. Using the fs_mark benchmark, using journal_async_commit shows a 50% improvement commit

    • Currently in case of append O_DIRECT write (block not allocated yet), ocfs2 will fall back to buffered I/O. This has some disadvantages. In this version, the direct I/O write doesn't fallback to buffer I/O write any more because the allocate blocks are enabled in direct I/O now commit, commit, commit

  • F2FS
    • Introduce a batched trim commit

    • Support "norecovery" mount option, which is mostly same as "disable_roll_forward". The only difference is that "norecovery" should be activated with read-only mount option. This can be used when user wants to check whether f2fs is mountable or not without any recovery process commit

    • Add F2FS_IOC_GETVERSION ioctl for getting i_generation from inode, after that, users can list file's generation number by using "lsattr -v commit

4. Block

  • Ported to blk-multiqueue
  • blk-multiqueue: Add support for tag allocation policies and make libata use this blk-mq tagging, instead of rolling their own commit, commit

  • UBI: Implement UBI_METAONLY, a new open mode for UBI volumes, it indicates that only meta data is being changed commit

5. Core (various)

  • pstore: Add pmsg - user-space accessible pstore object commit

  • rcu: Optionally run grace-period kthreads at real-time priority. Recent testing has shown that under heavy load, running RCU's grace-period kthreads at real-time priority can improve performance and reduce the incidence of RCU CPU stall warnings commit

  • GDB scripts for debugging the kernel. If you load vmlinux into gdb with the option enabled, the helper scripts will be automatically imported by gdb as well, and additional functions are available to analyze a Linux kernel instance. See Documentation/gdb-kernel-debugging.txt for further details commit

  • Remove CONFIG_INIT_FALLBACK commit

6. Memory management

  • cgroups: Per memory cgroup slab shrinkers commit

  • slub: optimize memory alloc/free fastpath by removing preemption on/off commit

  • Add KPF_ZERO_PAGE flag for zero_page, so that userspace processes can detect zero_page in /proc/kpageflags, and then do memory analysis more accurately commit

  • Make /dev/mem an optional device commit

  • Add support for resetting peak RSS, which can be retrieved from the VmHWM field in /proc/pid/status, by writing "5" to /proc/pid/clear_refs commit

  • Show page size in /proc/<pid>/numa_maps as "kernelpagesize_kB" field to help identifying the size of pages that are backing memory areas mapped by a given task. This is specially useful to help differentiating between HUGE and GIGANTIC page backed VMAs commit

  • geneve: Add Geneve GRO support commit

  • zsmalloc: add statistics support commit

  • Incorporate read-only pages into transparent huge pages commit

  • memcontrol cgroup: Introduce the basic control files to account, partition, and limit memory using cgroups in default hierarchy mode. The old interface will be maintained, but a clearer model and improved workload performance should encourage existing users to switch over to the new one eventually commit

  • Replace remap_file_pages() syscall with emulation commit

7. Virtualization

  • KVM: Add generic support for page modification logging, a new feature in Intel "Broadwell" Xeon CPUs that speeds up dirty page tracking commit

  • vfio: Add device request interface indicating that the device should be released commit

  • vmxnet3: Make Rx ring 2 size configurable by adjusting rx-jumbo parameter of ethtool -G commit

  • virtio_net: add software timestamp support commit

  • virtio_pci: modern driver commit, add an options to disable legacy driver commit, commit

8. Cryptography

  • aesni: Add support for 192 & 256 bit keys to AES-NI RFC4106 commit

  • algif_rng: add random number generator support commit

  • octeon: add MD5 module commit

  • qat: add support for CBC(AES) ablkcipher commit

9. Security

  • SELinux : Add security hooks to the Android Binder that enable security modules such as SELinux to implement controls over Binder IPC. The security hooks include support for controlling what process can become the Binder context manager, invoke a binder transaction/IPC to another process, transfer a binder reference to another process , transfer an open file to another process. These hooks have been included in the Android kernel trees since Android 4.3 (commit).

  • SMACK: secmark support for netfilter (commit).

  • TPM 2.0 support (commits: 1, 2, 3).

  • Device class for TPM, sysfs files are moved from /sys/class/misc/tpmX/device/ to /sys/class/tpm/tpmX/device/ (commit).

10. Tracing & perf

  • perf mem: Enable sampling loads and stores simultaneously, it could only do one or the other before yet there was no hardware restriction preventing simultaneous collection commit

  • perf tools: Support parameterized and symbolic events. See links for documentation commit, commit

  • AMD range breakpoints support: breakpoints are extended to support address range through perf event with initial backend support for AMD extended breakpoints. For example set write breakpoint from 0x1000 to 0x1200 (0x1000 + 512): perf record -e mem:0x1000/512:w commit, commit

11. Networking

  • TCP: Add the possibility to define a per route/destination congestion control algorithm. This opens up the possibility for a machine with different links to enforce specific congestion control algorithms with optimal strategies for each of them based on their network characteristics commit

  • Mitigate TCP "ACK loop" DoS scenarios by rate-limiting outgoing duplicate ACKs sent in response to incoming "out of window" segments. For more details, see merge. Code: commit, commit, commit, commit

  • udpv6: Add lockless sendmsg() support, thus allowing multiple threads to send to a single socket more efficiently commit

  • ipv4: Automatically bring up DSA master network devices, which allows DSA slave network devices to be used as valid interfaces for e.g: NFS root booting by allowing kernel IP auto-configuration to succeed on these interfaces commit

  • ipv6: Add sysctl entry(accept_ra_mtu) to disable MTU updates from router advertisements commit

  • vxlan: Implement supports for the Group Policy VXLAN extension to provide a lightweight and simple security label mechanism across network peers based on VXLAN. It allows further mapping to a SELinux context using SECMARK, to implement ACLs directly with nftables, iptables, OVS, tc, etc commit

  • vxlan: Add support for remote checksum offload in VXLAN. It is described here. commit

  • net: openvswitch: Support masked set actions. commit

  • Infiniband: Add support for extensible query device capabilities verb to allow adding new features commit

  • Layer 2 Tunneling Protocol (l2tp): multicast notification to the registered listeners when the tunnels/sessions are created/modified/deleted commit

  • SUNRPC: Set SO_REUSEPORT socket option for TCP connections to bind multiple TCP connections to the same source address+port combination commit

  • tipc: involve namespace infrastructure commit

  • 802.15.4: introduce support for cca settings commit

  • Wireless
    • Add new GCMP, GCMP-256, CCMP-256, BIP-GMAC-128, BIP-GMAC-256, and BIP-CMAC-256 cipher suites. These new cipher suites were defined in IEEE Std 802.11ac-2013 commit, commit, commit, commit, commit

    • New NL80211_ATTR_NETNS_FD which allows to set namespace via nl80211 by fd commit

    • Support per-TID station statistics commit

    • Allow including station info in delete event commit, commit

    • Allow usermode to query wiphy specific regdom commit

  • bridge
    • offload bridge port attributes to switch ASIC if feature flag set commit

    • Support for allowing userspace to pack multiple vlans and VLAN ranges in setlink and dellink requests for improved performance commit

    • Add ability to enable TSO commit

  • Near Field Communication (NFC)
    • HCI over NCI protocol support (Some secure elements only understand HCI and thus we need to send them HCI frames) commit

    • NCI NFCEE (NFC Execution Environment, typically an embedded or external secure element) discovery and enabling/disabling support commit, commit, commit, commit, commit, commit, commit

    • NFC_EVT_TRANSACTION userspace API addition, it is sent through netlink in order for a specific application running on a secure element to notify userspace of an event commit

    • Tx timestamps are looped onto the error queue on top of an skb. This mechanism leaks packet headers to processes unless the no-payload options SOF_TIMESTAMPING_OPT_TSONLY is set. A new sysctl (tstamp_allow_data) optionally drops looped timestamp with data. This only affects processes without CAP_NET_RAW commit, commit, commit

  • Bluetooth
  • tc: add BPF-based action. This action provides a possibility to execute custom BPF code commit

  • net: sched: Introduce connmark action commit

  • Add Transparent Ethernet Bridging GRO support commit

  • netdev: introduce new NETIF_F_HW_SWITCH_OFFLOAD feature flag for switch device offloads commit

  • netfilter: nft_compat: add ebtables support commit

    • network namespace: Add rtnl cmd to add and get peer netns ids. A user can define an id for a peer netns by providing a FD or a PID. These ids are local to the netns where it is added (i.e. valid only into this netns) commit, commit

  • openvswitch: Add support for checksums on UDP tunnels. commit

  • openvswitch: Support VXLAN Group Policy extension commit

David_Li

我还没有学会写个人说明!

相关推荐

优化Linux下的内核TCP参数以提高系统性能

优化Linux下的内核TCP参数以提高系统性能
内核的优化跟服务器的优化一样,应本着稳定安全的原则。下面以Squid服务器为例来说明,待客户端与服务器端建立TCP/IP连接后就会关闭Socket,服务器端连接的端口状态也就变为TIME_WAIT了。那是不是所有执行主动关闭的Socket都会进入TIME_WAIT状态呢?有没有什么情况可使主动关闭的Socket直接进入CLOSED状态呢?答案是主动关闭的一方在发送最后一个ACK后就会进入TIME_WAIT状态,并停留2MSL(报文最大生存)时间,这是TCP/IP必不可少的,也就是说这一点是“解决”不了的。
TCP/IP设计者如此设计,主要原因有两个:

暂无评论