Linux 3.13 was released on Sun, 19 Jan 2014.
Summary: This release includes nftables, the successor of iptables, a revamp of the block layer designed for high-performance SSDs, a power capping framework to cap power consumption in Intel RAPL devices, improved squashfs performance, AMD Radeon power management enabled by default and automatic Radeon GPU switching, improved NUMA performance, improved performance with hugepage workloads, TCP Fast Open enabled by default, support for NFC payments, support for the High-availability Seamless Redundancy protocol, new drivers and many other small improvements.
Traditional hard disks have defined for decades the design that operating systems use to communicate applications with the storage device drivers. With the advent of modern solid-state disks (SSD), past assumptions are no longer valid. Linux had a single coarse lock design for protecting the IO request queue, which can achieve an IO submission rate of around 800.000 IOs per second, regardless of how many cores are used to submit IOs. This was more than enough for traditional magnetic hard disks, whose IO submission rate in random accesses is in the hundreds, but it has become not enough for the most advanced SSD disks, which can achieve a rate close to 1 million, and are improving fast with every new generation. It is also unfit for the modern multicore world.
This release includes a new design for the Linux block layer, based on two levels of queues: one level of per-CPU queues for submitting IO, which then funnel down into a second level of hardware submission queues. The mapping between submission queues and hardware queues might be 1:1 or N:M, depending on hardware support and configuration. Experiments shown that this design can achieve many millions of IOs per second, leveraging the new capabilities of NVM-Express or high-end PCI-E devices and multicore CPUs, while still providing the common interface and convenience features of the block layer.
Paper: Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems
Recommended LWN article: The multiqueue block layer
Code: commit
1.2. nftables, the successor of iptablesiptables has a number of limitations both at the functional and code design level, problems with the system update rules and code duplication, which cause problems for code maintenance and for users. nftables is a new packet filtering framework that solves these problems, while providing backwards compatibility for current iptable users.
The core of the nftables design is a simple pseudo-virtual machine inspired in BPF. A userspace utility interprets the rule-set provided by the user, it compiles it to pseudo-bytecode and then it transfers it to the kernel. This approach can replace thousands of lines of code, since the bytecode instruction set can express the packet selectors for all existing protocols. Because the userspace utility parses the protocols to bytecode, it is no longer necessary a specific extension in kernel-space for each match, which means that users are likely not need to upgrade the kernel to obtain new matches and features, userspace upgrades will provide them. There is also a new library for utilities that need to interact with the firewall.
nftables provides backwards iptables compatibility. There are new iptables/iptables utilities that translate iptables rules to nftables bytecode, and it is also possible to use and add new xtable modules. As a bonus, these new utilities provide features that weren't possible with the old iptables design: notification for changes in tables/chains, better incremental rule update support, and the ability to enable/disable the chains per table.
The new nft utility has a improved syntax. A small how-to is available here (other documentation should be available soon)
Recommended LWN article: The return of nftables
Video talk about nftables: http://youtu.be/P58CCi5Hhl4 (slides)
Project page and utility source code: http://netfilter.org/projects/nftables/
1.3. Radeon: power management enabled by default, automatic GPU switching, R9 290X Hawaii supportPower management enabled by default
Linux 3.11 added power management support for many AMD Radeon devices. The power management support provides improved power consumption, which is critical for battery powered devices, but it is also a requirement to provide good high-end performance, as it provides the ability to reclock to GPU to higher power states in GPUs and APUs that default to slower clock speeds.
This support had to be enabled with a module parameter. This release enables power management by default for lots of AMD Radeon hardware: BTC asics, SI asics, SUMO/PALM APUs, evergreen asics, r7xx asics, hawaii. Code: commit, commit, commit, commit, commit, commit
Linux 3.12 added infrastructure support for automatic GPU switching in laptops with dual GPUs. This release adds support for this feature in AMD Radeon hardware. Code: commit
This release adds support for R9 290X "Hawaii" devices. Code: commit
1.4. Power capping frameworkThis release includes a framework that allow to set power consumption limits to devices that support it. It has been designed around the Intel RAPL (Running Average Power Limit) mechanism available in the latest Intel processors (Sandy Bridge and later, many devices will also be added RAPL support in the future). This framework provides a consistent interface between the kernel and user space that allows power capping drivers to expose their settings to user space in a uniform way. You can see the Documentation here
1.5. Support for the Intel Many Integrated Core ArchitectureThis release adds support for the Intel Many Integrated Core Architecture or MIC, a multiprocessor computer architecture incorporating earlier work on the Larrabee many core architecture, the Teraflops Research Chip multicore chip research project, and the Intel Single-chip Cloud Computer multicore microprocessor. The currently world's fastest supercomputer, the Tianhe-2 at the National Supercomputing Center in Guangzhou, China, utilizes this architecture to achieve 33.86 PetaFLOPS.
The MIC family of PCIe form factor coprocessor devices run a 64-bit Linux OS. The driver manages card OS state and enables communication between host and card. More information about the Intel MIC family as well as the Linux OS and tools for MIC to use with this driver are available here. This release currently supports Intel MIC X100 devices, and includes a sample user space daemon.
1.6. Improved performance in NUMA systemsModern multiprocessors (for example, x86) usually have non-uniform memory access (NUMA) memory designs. In these systems, the performance of a process can be different depending on whether the memory range it accesses it's attached to the local CPU or other CPU. Since performance is different depending on the locality of the memory accesses, it's important that the operating system schedules a process to run in the same CPU whose memory controller is connected to the memory it will access.
The way Linux handles these situations was deficient; Linux 3.8 included a new NUMA foundation that would allow to build smarter NUMA policies in future releases. This release includes many of such policies that attempt to put a process near its memory, and can handle cases such as shared pages between processes or transparent huge pages. New sysctls have been added to enable/disable and tune the NUMA scheduling (see documentation here)
Recommended LWN article: NUMA scheduling progress
1.7. Improved page table access scalability in hugepage workloadsThe Linux kernels tracks information about each memory page in a data structure called page table. In workloads that use hugepages, the lock used to protect some parts of the table has become a lock contention. This release uses finer grained locking for these parts, improving the page table access scalability in threaded hugepage workloads. For more details, see the recommended LWN article.
Recommended LWN article: Split PMD locks
1.8. Squashfs performance improvedSquashfs, the read-only filesystem used by most live distributions, installers, and some embedded Linux distributions, has got important improvements that dramatically increase performance in workloads with multiple parallel reads. One of them is the direct decompression of data into the Linux page cache, which avoids a copy of the data and eliminates the single lock used to protect the intermediate buffer. The other one is multithreaded decompression.
1.9. Applications can cap the rate computed by network transport layerThis release adds a new socket option, SO_MAX_PACING_RATE, which offers applications the ability to cap the rate computed by transport layer. It has been designed as a bufferbloat mechanism to avoid buffers getting filled up with packets, but it can also be used to limit the transmission rate in applications. To be effectively paced, a network flow must use FQ packet scheduler. Note that a packet scheduler takes into account the headers for its computations. For more details, see the:
Recommended LWN article: TSO sizing and the FQ scheduler (5th and 6th paragraph)
Code: commit
1.10. TCP Fast Open enabled by defaultTCP Fast Open is an optimization to the process of stablishing a TCP connection that allows the elimination of one round time trip from certain kinds of TCP conversation, which can improve the load speed of web pages. In Linux 3.6 and Linux 3.7, support was added for this feature, which requires userspace support. This release enables TCP Fast Open by default.
Code: commit
1.11. NFC payments supportThis release implements support for the Secure Element. A netlink API is available to enable, disable and discover NFC attached (embedded or UICC ones) secure elements. With some userspace help, this allows to support NFC payments, used to implement financial transactions. Only the pn544 driver currently supports this API.
Code: commit
1.12. Support for the High-availability Seamless Redundancy protocolHigh-availability Seamless Redundancy (HSR) is a redundancy protocol for Ethernet. It provides instant failover redundancy for such networks. It requires a special network topology where all nodes are connected in a ring (each node having two physical network interfaces). It is suited for applications that demand high availability and very short reaction time.
Code: commit
2. Drivers and architecturesAll the driver and architecture-specific changes can be found in the Linux_3.13-DriversArch page
3. Coreepoll: once we get over 10+ cpus, the scalability of SPECjbb falls over due to the contention on the global 'epmutex', which is taken in on EPOLL_CTL_ADD and EPOLL_CTL_DEL operations. This release improves locking to improve performance: on the 16 socket run the performance went from 35k jOPS to 125k jOPS. In addition the benchmark when from scaling well on 10 sockets to scaling well on just over 40 sockets commit, commit
Allow magic sysrq key functions to be disabled in Kconfig commit
modules: remove rmmod --wait option. commit
iommu: Add event tracing feature to iommu commit
Add option to disable kernel compression commit
gcov: add support for gcc 4.7 gcov format commit
fuse: Implement writepages callback, improving mmaped writeout commit
seqcount: Add lockdep functionality to seqcount/seqlock structures commit
Provide a per-cpu preempt_count implementation commit
/proc/pid/smaps: show VM_SOFTDIRTY flag in VmFlags line commit
Add a generic associative array implementation. commit
RCU'd vfsmounts commit
Changes in the slab have been done to improve the slab memory usage and performance. kmem_caches consisting of objects less than or equal to 128 byte have now one more objects in a slab, and a change to the management of free objects improves the locality of the accesses, which improve performance in some microbenchmarks commit, commit
memcg: support hierarchical memory.numa_stats commit
Introduce movable_node boot option to enable the effects of CONFIG_MOVABLE_NODE commit
thp: khugepaged: add policy for finding target node commit
bcache: Incremental garbage collection. It means that there's less of a latency hit for doing garbage collection, which means bcache can gc more frequently (and do a better job of reclaiming from the cache), and it can coalesce across more btree nodes (improving space efficiency) commit
dm cache: add passthrough mode which is intended to be used when the cache contents are not known to be coherent with the origin device commit
dm cache: add cache block invalidation support commit
dm cache: cache shrinking support commit
virtio_blk: blk-mq support commit
Multi-queue aware null block test driver commit
Add FIEMAP_EXTENT_SHARED fiemap flag: Similar to ocfs2, btrfs also supports that extents can be shared by different inodes, and there are some userspace tools requesting for this kind of 'space shared infomation' commit
Add new btrfs mount options: commit, which sets the interval of periodic commit in seconds, 30 by default, and rescan_uuid_tree, which forces check and rebuild procedure of the UUID tree commit
XFS: For v5 filesystems scale the inode cluster size with the size of the inode so that we keep a constant 32 inodes per cluster ratio for all inode IO commit
F2FS Introduce CONFIG_F2FS_CHECK_FS to disable BUG_ONs which check the file system consistency in runtime and cost performance commit
SMB2/SMB3 Copy offload support (refcopy, aka "cp --reflink") commit
Query File System alignment, and the preferred (for performance) sector size and whether the underlying disk has no seek penalty (like SSD), make it visible in /proc/fs/cifs/DebugData for debugging purposes commit
Query network adapter info commit
Allow setting per-file compression commit, commit
Add a lightweight Berkley Packet Filter-based traffic classifier that can serve as a flexible alternative to ematch-based tree classification, i.e. now that BPF filter engine can also be JITed in the kernel commit
ipv6: Add support for IPsec virtual tunnel interfaces, which provide a routable interface for IPsec tunnel endpoints commit
ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE. Sockets marked with IP_PMTUDISC_INTERFACE won't do path mtu discovery, their sockets won't accept and install new path mtu information and they will always use the interface mtu for outgoing packets. It is guaranteed that the packet is not fragmented locally. The purpose behind this flag is to avoid PMTU attacks, particularly on DNS servers commit
ipv4: Allow unprivileged users to use network namespaces sysctls commit
Create sysfs symlinks for neighbour devices commit
ipv6: sit: add GSO/TSO support commit
ipip: add GSO/TSO support commit
packet scheduler: htb: support of 64-bit rates commit
openvswitch: TCP flags matching support. commit
Add network namespaces commit
Support comments for ipset entries in the core. commit, in bitmap-type ipsets commit, in hash-type ipsets commit, and in the list-type ipset. commit
Enable ipset port set types to match IPv4 package fragments for protocols that doesn't have ports (or the port information isn't supported by ipset)commit
Add hash:net,net set, providing the ability to configure pairs of subnets commit
Add hash:net,port,net set, providing similar functionality to ip,port,net but permits arbitrary size subnets for both the first and last parameter commit
Add NFC digital layer implementation: Most NFC chipsets implement the NFC digital layer in firmware, but others only implement the NFC analog layer and expect the host to implement this layer
Add support for NFC-A technology at 106 kbits/s commit
Add support for NFC-F technology at 212 kbits/s and 424 kbits/s commit
Add initiator NFC-DEP support commit
Add target NFC-DEP support commit
Implement the mechanism used to send commands to the driver in initiator mode commit
Digital Protocol stack implementation commit
Introduce new HCI socket channel that allows user applications to take control over a specific HCI device. The application gains exclusive access to this device and forces the kernel to stay away and not manage it commit, commit
Add support creating virtual AMP controllers commit
Add support for setting Device Under Test mode commit
Add a new mgmt_set_bredr command for enabling/disabling BR/EDR functionality. This can be convenient when one wants to make a dual-mode controller behave like a single-mode one. The command is only available for dual-mode controllers and requires that Bluetooth LE is enabled before using it commit
Add management command for setting static address on dual-mode BR/EDR/LE and LE only controllers where it is possible to configure a random static address commit
Add new management setting for enabling and disabling Bluetooth LE advertising commit, commit
tcp_memcontrol: Remove setting cgroup settings via sysctl, because the code is broken in both design and implementation and does not implement the functionality for which it was written for commit
wifi: implement mesh channel switch userspace API commit
wifi: enable channels 52-64 and 100-144 for world roaming commit
wifi: enable DFS for IBSS mode commit, commit, add support for CSA in IBSS mode commit,
B.A.T.M.A.N.: remove vis functionality (replaced by a userspace program) commit, add per VLAN interface attribute framework commit, add the T(ype) V(ersion) L(ength) V(alue) framework commit, commit, commit, commit
caam: Add platform driver for Job Ring, which are part of Freescale's Cryptographic Accelerator and Assurance Module (CAAM) commit
random: Our mixing functions were analyzed, they suggested a slight change to improve our mixing functions which has been implemented commit
random32: upgrade taus88 generator to taus113 from errata paper commit
hyperv: fb: add blanking support commit, add PCI stub, the hyperv framebuffer driver will bind to the PCI device then, so Linux kernel and userspace know there is a proper kernel driver for the device active commit
kvm: Add VFIO device commit
xen-netback: add support for IPv6 checksum offload to guest commit
xen-netback: enable IPv6 TCP GSO to the guest commit
xen-netfront: convert to GRO API commit
SELinux: Enable setting security contexts on rootfs (ramfs) inodes. commit
SELinux: Reduce overhead that harmed the high_systime workload of the AIM7 benchmark commit
SELinux: Add the always_check_network policy capability which, when enabled, treats SECMARK as enabled, even if there are no netfilter SECMARK rules and treats peer labeling as enabled, even if there is no Netlabel or labeled IPsec configuration.policy capability for always checking packet and peer classescommit
Smack treats setting a file read lock as the write operation that it is. Unfortunately, many programs assume that setting a read lock is a read operation, and don't work well in the Smack environment. This release implements a new access mode (lock) to address this problem commit
Smack: When the ptrace security hooks were split the addition of a mode parameter was not taken advantage of in the Smack ptrace access check. This changes the access check from always looking for read and write access to using the passed mode commit
audit: new feature which only grants processes with CAP_AUDIT_CONTROL the ability to unset their loginuid commit
audit: feature which allows userspace to set it such that the loginuid is absolutely immutable, even if you have CAP_AUDIT_CONTROL. CONFIG_AUDIT_LOGINUID_IMMUTABLE has been removed commit, commit
keys: Expand the capacity of a keyring commit
keys: Implement a big key type that can save to tmpfs commit
keys: Add per-user namespace registers for persistent per-UID kerberos caches commit
ima: add audit log support for larger hashes commit
ima: enable support for larger default filedata hash algorithms commit
ima: new templates management mechanism commit
perf record: Add option --force-per-cpu to force per-cpu mmaps commit
perf record: Add abort_tx,no_tx,in_tx branch filter options to perf record -j commit
perf report: Add --max-stack option to limit callchain stack scan commit
perf top: Add new option --ignore-vmlinux to explicitly indicate that we do not want to use these kernels and just use what we have (kallsyms commit
perf tools: Add possibility to specify mmap size via -m/--mmap-pages by appending unit size character (B/K/M/G) to the number commit
perf top: Add --max-stack option to limit callchain stack scan commit
perf stat: Add perf stat --transaction to print the basic transactional execution statistics commit
perf: Add support for recording transaction flags commit
perf: Support sorting by in_tx or abort branch flags commit
ftrace: Add set_graph_notrace filter, analogous to set_ftrace_notrace and can be used for eliminating uninteresting part of function graph trace output commit
perf bench sched: Add --threaded option, allow the measurement of thread versus process context switch performance commit
perf buildid-cache: Add ability to add kcore to the cache commit
German language: heise.de Kernel-Log Was 3.13 bringt (1): Dateisysteme und Storage, (2) Netzwerk, (3) Infrastruktur , (4) Treiber, (5) Grafiktreiber