libbpf

mirror of https://github.com/netdata/libbpf.git synced 2026-06-16 19:49:08 +08:00

Author	SHA1	Message	Date
Vladimir Kobal	673424c561	Add fallback to an old attaching method v0.0.9_netdata-1	2020-07-30 13:38:27 +03:00
Vladimir Kobal	d2feaff998	Skip probing for loading	2020-07-30 13:37:26 +03:00
Vladimir Kobal	0d4b75d30e	Skip kernel version check	2020-07-30 13:35:35 +03:00
Andrii Nakryiko	d7b2934cf9	sync: latest libbpf changes from kernel Syncing latest libbpf commits from kernel repository. Baseline bpf-next commit: 69119673bd50b176ded34032fadd41530fb5af21 Checkpoint bpf-next commit: 4e15507fea70c0c312d79610efa46b6853ccf8e0 Baseline bpf commit: 6903cdae9f9f08d61e49c16cbef11c293e33a615 Checkpoint bpf commit: 4e15507fea70c0c312d79610efa46b6853ccf8e0 Andrii Nakryiko (1): libbpf: Forward-declare bpf_stats_type for systems with outdated UAPI headers src/bpf.h \| 2 ++ 1 file changed, 2 insertions(+) -- 2.24.1 v0.0.9	2020-06-22 15:43:44 -07:00
Andrii Nakryiko	c83d2166e8	libbpf: Forward-declare bpf_stats_type for systems with outdated UAPI headers Systems that doesn't yet have the very latest linux/bpf.h header, enum bpf_stats_type will be undefined, causing compilation warnings. Prevents this by forward-declaring enum. Fixes: 0bee106716cf ("libbpf: Add support for command BPF_ENABLE_STATS") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200621031159.2279101-1-andriin@fb.com	2020-06-22 15:43:44 -07:00
Andrii Nakryiko	fb27968bf1	vmtests: blacklist 5.5 test and temporary blacklist core_reloc test Permanently blacklist load_bytes_relative test on 5.5 due to missing functionality. Also temporarily blacklist core_reloc test due to failure on latest kernel. Signed-off-by: Andrii Nakryiko <andriin@fb.com>	2020-06-17 11:48:22 -07:00
Andrii Nakryiko	d6ae406429	sync: latest libbpf changes from kernel Syncing latest libbpf commits from kernel repository. Baseline bpf-next commit: cb8e59cc87201af93dfbb6c3dccc8fcad72a09c2 Checkpoint bpf-next commit: 69119673bd50b176ded34032fadd41530fb5af21 Baseline bpf commit: 47f6bc4ce1ff70d7ba0924c2f1c218c96cd585fb Checkpoint bpf commit: 6903cdae9f9f08d61e49c16cbef11c293e33a615 Andrii Nakryiko (2): libbpf: Support pre-initializing .bss global variables bpf: Fix definition of bpf_ringbuf_output() helper in UAPI comments include/uapi/linux/bpf.h \| 2 +- src/libbpf.c \| 4 ---- 2 files changed, 1 insertion(+), 5 deletions(-) -- 2.24.1	2020-06-17 11:48:22 -07:00
Andrii Nakryiko	cb174c5b8d	sync: auto-generate latest BPF helpers Latest changes to BPF helper definitions.	2020-06-17 11:48:22 -07:00
Andrii Nakryiko	17f747ed38	bpf: Fix definition of bpf_ringbuf_output() helper in UAPI comments Fix definition of bpf_ringbuf_output() in UAPI header comments, which is used to generate libbpf's bpf_helper_defs.h header. Return value is a number (error code), not a pointer. Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200615214926.3638836-1-andriin@fb.com	2020-06-17 11:48:22 -07:00
Andrii Nakryiko	bf34234885	libbpf: Support pre-initializing .bss global variables Remove invalid assumption in libbpf that .bss map doesn't have to be updated in kernel. With addition of skeleton and memory-mapped initialization image, .bss doesn't have to be all zeroes when BPF map is created, because user-code might have initialized those variables from user-space. Fixes: eba9c5f498a1 ("libbpf: Refactor global data map initialization") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200612194504.557844-1-andriin@fb.com	2020-06-17 11:48:22 -07:00
Andrii Nakryiko	46c272f9b4	sync: don't check and warn about non-empty merges anymore Initial versions of sync script couldn't handle non-empty merges. But since then, script became smarter, more interactive and thus more powerful and can handle some complicated situations easily on its own, while falling back to human intervention for even more complicated situations. This non-empty merge check has outlived its purpose and is just an annoying bump in sync process. Drop it. Signed-off-by: Andrii Nakryiko <andriin@fb.com>	2020-06-10 13:59:07 -07:00
Andrii Nakryiko	40e69c9538	vmtests: un-blacklist ringbuf and cls_redirect selftests Both tests should be fixed now. Signed-off-by: Andrii Nakryiko <andriin@fb.com>	2020-06-10 13:58:45 -07:00
Andrii Nakryiko	a975d8ea28	sync: latest libbpf changes from kernel Syncing latest libbpf commits from kernel repository. Baseline bpf-next commit: 9bc499befeef07a4d79f4924bfca05634ad8fc97 Checkpoint bpf-next commit: cb8e59cc87201af93dfbb6c3dccc8fcad72a09c2 Baseline bpf commit: bdc48fa11e46f867ea4d75fa59ee87a7f48be144 Checkpoint bpf commit: 47f6bc4ce1ff70d7ba0924c2f1c218c96cd585fb Andrii Nakryiko (1): libbpf: Handle GCC noreturn-turned-volatile quirk Arnaldo Carvalho de Melo (1): libbpf: Define __WORDSIZE if not available Jesper Dangaard Brouer (1): bpf: Selftests and tools use struct bpf_devmap_val from uapi include/uapi/linux/bpf.h \| 13 +++++++++++++ src/btf_dump.c \| 33 ++++++++++++++++++++++++--------- src/hashmap.h \| 7 +++---- 3 files changed, 40 insertions(+), 13 deletions(-) -- 2.24.1	2020-06-10 13:58:45 -07:00
Andrii Nakryiko	45f7113925	libbpf: Handle GCC noreturn-turned-volatile quirk Handle a GCC quirk of emitting extra volatile modifier in DWARF (and subsequently preserved in BTF by pahole) for function pointers marked as __attribute__((noreturn)). This was the way to mark such functions before GCC 2.5 added noreturn attribute. Drop such func_proto modifiers, similarly to how it's done for array (also to handle GCC quirk/bug). Such volatile attribute is emitted by GCC only, so existing selftests can't express such test. Simple repro is like this (compiled with GCC + BTF generated by pahole): struct my_struct { void __attribute__((noreturn)) (fn)(int); }; struct my_struct a; Without this fix, output will be: struct my_struct { voidvolatile (fn)(int); }; With the fix: struct my_struct { void (*fn)(int); }; Fixes: 351131b51c7a ("libbpf: add btf_dump API for BTF-to-C conversion") Reported-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Link: https://lore.kernel.org/bpf/20200610052335.2862559-1-andriin@fb.com	2020-06-10 13:58:45 -07:00
Arnaldo Carvalho de Melo	6816734203	libbpf: Define __WORDSIZE if not available Some systems, such as Android, don't have a define for __WORDSIZE, do it in terms of __SIZEOF_LONG__, as done in perf since 2012: http://git.kernel.org/torvalds/c/3f34f6c0233ae055b5 For reference: https://gcc.gnu.org/onlinedocs/cpp/Common-Predefined-Macros.html I build tested it here and Andrii did some Travis CI build tests too. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200608161150.GA3073@kernel.org	2020-06-10 13:58:45 -07:00
Jesper Dangaard Brouer	11d2a59689	bpf: Selftests and tools use struct bpf_devmap_val from uapi Sync tools uapi bpf.h header file and update selftests that use struct bpf_devmap_val. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/159170951195.2102545.1833108712124273987.stgit@firesoul	2020-06-10 13:58:45 -07:00
Andrii Nakryiko	8c7527ea88	travis-ci: fix travis_terminate invocation travis_terminate expects integer argument for exit code. Add it. Signed-off-by: Andrii Nakryiko <andriin@fb.com>	2020-06-10 12:12:01 -07:00
Toke Høiland-Jørgensen	c569e03985	README: Add BTF and Clang information for Arch Linux Arch recently added BTF to their distribution kernels - see https://bugs.archlinux.org/task/66260 Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>	2020-06-08 09:33:59 -07:00
Andrii Nakryiko	1862741fb0	vmtest: disable ringbuf test on latest for now ringbuf selftest is flaky, disable it for now. Signed-off-by: Andrii Nakryiko <andriin@fb.com>	2020-06-04 10:48:08 -07:00
Andrii Nakryiko	6a269cf458	README: add OpenSUSE BTF availability info Add note about OpenSUSE Tumbleweed and BTF.	2020-06-04 10:42:40 -07:00
Andrii Nakryiko	6e15a022db	README: add BTF and CO-RE info Add list of Linux distributions with kernel BTF built-in. Give few useful links to BPF CO-RE-related material to help users get started.	2020-06-03 11:26:00 -07:00
Andrii Nakryiko	20d9816471	vmtest: temporary blacklist changes to make CI green Coarse-grained blacklisting until test_progs blacklisting w/ subtests works better. Signed-off-by: Andrii Nakryiko <andriin@fb.com>	2020-06-02 18:09:36 -07:00
Andrii Nakryiko	538b3f4ce7	sync: latest libbpf changes from kernel Syncing latest libbpf commits from kernel repository. Baseline bpf-next commit: 9a25c1df24a6fea9dc79eec950453c4e00f707fd Checkpoint bpf-next commit: 9bc499befeef07a4d79f4924bfca05634ad8fc97 Baseline bpf commit: bdc48fa11e46f867ea4d75fa59ee87a7f48be144 Checkpoint bpf commit: bdc48fa11e46f867ea4d75fa59ee87a7f48be144 Daniel Borkmann (2): bpf: Fix up bpf_skb_adjust_room helper's skb csum setting bpf: Add csum_level helper for fixing up csum levels include/uapi/linux/bpf.h \| 51 +++++++++++++++++++++++++++++++++++++++- 1 file changed, 50 insertions(+), 1 deletion(-) -- 2.24.1	2020-06-02 18:09:36 -07:00
Andrii Nakryiko	f2610ca9cf	sync: auto-generate latest BPF helpers Latest changes to BPF helper definitions.	2020-06-02 18:09:36 -07:00
Daniel Borkmann	adb5dd203c	bpf: Add csum_level helper for fixing up csum levels Add a bpf_csum_level() helper which BPF programs can use in combination with bpf_skb_adjust_room() when they pass in BPF_F_ADJ_ROOM_NO_CSUM_RESET flag to the latter to avoid falling back to CHECKSUM_NONE. The bpf_csum_level() allows to adjust CHECKSUM_UNNECESSARY skb->csum_levels via BPF_CSUM_LEVEL_{INC,DEC} which calls __skb_{incr,decr}_checksum_unnecessary() on the skb. The helper also allows a BPF_CSUM_LEVEL_RESET which sets the skb's csum to CHECKSUM_NONE as well as a BPF_CSUM_LEVEL_QUERY to just return the current level. Without this helper, there is no way to otherwise adjust the skb->csum_level. I did not add an extra dummy flags as there is plenty of free bitspace in level argument itself iff ever needed in future. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Lorenz Bauer <lmb@cloudflare.com> Link: https://lore.kernel.org/bpf/279ae3717cb3d03c0ffeb511493c93c450a01e1a.1591108731.git.daniel@iogearbox.net	2020-06-02 18:09:36 -07:00
Daniel Borkmann	3aadd91e97	bpf: Fix up bpf_skb_adjust_room helper's skb csum setting Lorenz recently reported: In our TC classifier cls_redirect [0], we use the following sequence of helper calls to decapsulate a GUE (basically IP + UDP + custom header) encapsulated packet: bpf_skb_adjust_room(skb, -encap_len, BPF_ADJ_ROOM_MAC, BPF_F_ADJ_ROOM_FIXED_GSO) bpf_redirect(skb->ifindex, BPF_F_INGRESS) It seems like some checksums of the inner headers are not validated in this case. For example, a TCP SYN packet with invalid TCP checksum is still accepted by the network stack and elicits a SYN ACK. [...] That is, we receive the following packet from the driver: \| ETH \| IP \| UDP \| GUE \| IP \| TCP \| skb->ip_summed == CHECKSUM_UNNECESSARY ip_summed is CHECKSUM_UNNECESSARY because our NICs do rx checksum offloading. On this packet we run skb_adjust_room_mac(-encap_len), and get the following: \| ETH \| IP \| TCP \| skb->ip_summed == CHECKSUM_UNNECESSARY Note that ip_summed is still CHECKSUM_UNNECESSARY. After bpf_redirect()'ing into the ingress, we end up in tcp_v4_rcv(). There, skb_checksum_init() is turned into a no-op due to CHECKSUM_UNNECESSARY. The bpf_skb_adjust_room() helper is not aware of protocol specifics. Internally, it handles the CHECKSUM_COMPLETE case via skb_postpull_rcsum(), but that does not cover CHECKSUM_UNNECESSARY. In this case skb->csum_level of the original skb prior to bpf_skb_adjust_room() call was 0, that is, covering UDP. Right now there is no way to adjust the skb->csum_level. NICs that have checksum offload disabled (CHECKSUM_NONE) or that support CHECKSUM_COMPLETE are not affected. Use a safe default for CHECKSUM_UNNECESSARY by resetting to CHECKSUM_NONE and add a flag to the helper called BPF_F_ADJ_ROOM_NO_CSUM_RESET that allows users from opting out. Opting out is useful for the case where we don't remove/add full protocol headers, or for the case where a user wants to adjust the csum level manually e.g. through bpf_csum_level() helper that is added in subsequent patch. The bpf_skb_proto_{4_to_6,6_to_4}() for NAT64/46 translation from the BPF bpf_skb_change_proto() helper uses bpf_skb_net_hdr_{push,pop}() pair internally as well but doesn't change layers, only transitions between v4 to v6 and vice versa, therefore no adoption is required there. [0] https://lore.kernel.org/bpf/20200424185556.7358-1-lmb@cloudflare.com/ Fixes: 2be7e212d541 ("bpf: add bpf_skb_adjust_room helper") Reported-by: Lorenz Bauer <lmb@cloudflare.com> Reported-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Link: https://lore.kernel.org/bpf/CACAyw9-uU_52esMd1JjuA80fRPHJv5vsSg8GnfW3t_qDU4aVKQ@mail.gmail.com/ Link: https://lore.kernel.org/bpf/11a90472e7cce83e76ddbfce81fdfce7bfc68808.1591108731.git.daniel@iogearbox.net	2020-06-02 18:09:36 -07:00
Andrii Nakryiko	1206ab0e75	vmtest: optionally adjust selftest files depending on kernel version Some selftests can't be compiled on older kernels. This allows to fix these problems, if necessary. Signed-off-by: Andrii Nakryiko <andriin@fb.com>	2020-06-01 22:22:32 -07:00
Andrii Nakryiko	70eac9941d	Makefile: add ringbuf.o to the list of object files Add newly added ringbuf.o to the list of OBJS. Signed-off-by: Andrii Nakryiko <andriin@fb.com>	2020-06-01 22:22:32 -07:00
Andrii Nakryiko	2fdbf42f98	sync: latest libbpf changes from kernel Syncing latest libbpf commits from kernel repository. Baseline bpf-next commit: dda18a5c0b75461d1ed228f80b59c67434b8d601 Checkpoint bpf-next commit: 9a25c1df24a6fea9dc79eec950453c4e00f707fd Baseline bpf commit: f85c1598ddfe83f61d0656bd1d2025fa3b148b99 Checkpoint bpf commit: bdc48fa11e46f867ea4d75fa59ee87a7f48be144 Alexei Starovoitov (1): tools/bpf: sync bpf.h Andrii Nakryiko (3): bpf: Implement BPF ring buffer and verifier support for it libbpf: Add BPF ring buffer support libbpf: Add _GNU_SOURCE for reallocarray to ringbuf.c David Ahern (3): bpf: Add support to attach bpf program to a devmap entry xdp: Add xdp_txq_info to xdp_buff libbpf: Add SEC name for xdp programs attached to device map Eelco Chaudron (2): libbpf: Add API to consume the perf ring buffer content libbpf: Fix perf_buffer__free() API for sparse allocs Jakub Sitnicki (2): bpf: Add link-based BPF program attachment to network namespace libbpf: Add support for bpf_link-based netns attachment John Fastabend (1): bpf, sk_msg: Add get socket storage helpers include/uapi/linux/bpf.h \| 95 ++++++++++++- src/libbpf.c \| 49 ++++++- src/libbpf.h \| 24 ++++ src/libbpf.map \| 7 + src/libbpf_probes.c \| 5 + src/ringbuf.c \| 288 +++++++++++++++++++++++++++++++++++++++ 6 files changed, 461 insertions(+), 7 deletions(-) create mode 100644 src/ringbuf.c -- 2.24.1	2020-06-01 22:22:32 -07:00
Andrii Nakryiko	365e4805a1	sync: auto-generate latest BPF helpers Latest changes to BPF helper definitions.	2020-06-01 22:22:32 -07:00
Jakub Sitnicki	890f25520a	libbpf: Add support for bpf_link-based netns attachment Add bpf_program__attach_nets(), which uses LINK_CREATE subcommand to create an FD-based kernel bpf_link, for attach types tied to network namespace, that is BPF_FLOW_DISSECTOR for the moment. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200531082846.2117903-7-jakub@cloudflare.com	2020-06-01 22:22:32 -07:00
Jakub Sitnicki	fbdee96fa1	bpf: Add link-based BPF program attachment to network namespace Extend bpf() syscall subcommands that operate on bpf_link, that is LINK_CREATE, LINK_UPDATE, OBJ_GET_INFO, to accept attach types tied to network namespaces (only flow dissector at the moment). Link-based and prog-based attachment can be used interchangeably, but only one can exist at a time. Attempts to attach a link when a prog is already attached directly, and the other way around, will be met with -EEXIST. Attempts to detach a program when link exists result in -EINVAL. Attachment of multiple links of same attach type to one netns is not supported with the intention to lift the restriction when a use-case presents itself. Because of that link create returns -E2BIG when trying to create another netns link, when one already exists. Link-based attachments to netns don't keep a netns alive by holding a ref to it. Instead links get auto-detached from netns when the latter is being destroyed, using a pernet pre_exit callback. When auto-detached, link lives in defunct state as long there are open FDs for it. -ENOLINK is returned if a user tries to update a defunct link. Because bpf_link to netns doesn't hold a ref to struct net, special care is taken when releasing, updating, or filling link info. The netns might be getting torn down when any of these link operations are in progress. That is why auto-detach and update/release/fill_info are synchronized by the same mutex. Also, link ops have to always check if auto-detach has not happened yet and if netns is still alive (refcnt > 0). Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200531082846.2117903-5-jakub@cloudflare.com	2020-06-01 22:22:32 -07:00
Andrii Nakryiko	f54c56be0d	libbpf: Add _GNU_SOURCE for reallocarray to ringbuf.c On systems with recent enough glibc, reallocarray compat won't kick in, so reallocarray() itself has to come from stdlib.h include. But _GNU_SOURCE is necessary to enable it. So add it. Fixes: bf99c936f947 ("libbpf: Add BPF ring buffer support") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200601202601.2139477-1-andriin@fb.com	2020-06-01 22:22:32 -07:00
Alexei Starovoitov	8dc4b38871	tools/bpf: sync bpf.h Sync bpf.h into tool/include/uapi/ Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-06-01 22:22:32 -07:00
David Ahern	ed023acd35	libbpf: Add SEC name for xdp programs attached to device map Support SEC("xdp_devmap*") as a short cut for loading the program with type BPF_PROG_TYPE_XDP and expected attach type BPF_XDP_DEVMAP. Signed-off-by: David Ahern <dsahern@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20200529220716.75383-5-dsahern@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-06-01 22:22:32 -07:00
David Ahern	ff3116bfcb	xdp: Add xdp_txq_info to xdp_buff Add xdp_txq_info as the Tx counterpart to xdp_rxq_info. At the moment only the device is added. Other fields (queue_index) can be added as use cases arise. >From a UAPI perspective, add egress_ifindex to xdp context for bpf programs to see the Tx device. Update the verifier to only allow accesses to egress_ifindex by XDP programs with BPF_XDP_DEVMAP expected attach type. Signed-off-by: David Ahern <dsahern@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20200529220716.75383-4-dsahern@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-06-01 22:22:32 -07:00
David Ahern	65f4b3ba4c	bpf: Add support to attach bpf program to a devmap entry Add BPF_XDP_DEVMAP attach type for use with programs associated with a DEVMAP entry. Allow DEVMAPs to associate a program with a device entry by adding a bpf_prog.fd to 'struct bpf_devmap_val'. Values read show the program id, so the fd and id are a union. bpf programs can get access to the struct via vmlinux.h. The program associated with the fd must have type XDP with expected attach type BPF_XDP_DEVMAP. When a program is associated with a device index, the program is run on an XDP_REDIRECT and before the buffer is added to the per-cpu queue. At this point rxq data is still valid; the next patch adds tx device information allowing the prorgam to see both ingress and egress device indices. XDP generic is skb based and XDP programs do not work with skb's. Block the use case by walking maps used by a program that is to be attached via xdpgeneric and fail if any of them are DEVMAP / DEVMAP_HASH with Block attach of BPF_XDP_DEVMAP programs to devices. Signed-off-by: David Ahern <dsahern@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20200529220716.75383-3-dsahern@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-06-01 22:22:32 -07:00
Andrii Nakryiko	e1bf7a787e	libbpf: Add BPF ring buffer support Declaring and instantiating BPF ring buffer doesn't require any changes to libbpf, as it's just another type of maps. So using existing BTF-defined maps syntax with __uint(type, BPF_MAP_TYPE_RINGBUF) and __uint(max_elements, <size-of-ring-buf>) is all that's necessary to create and use BPF ring buffer. This patch adds BPF ring buffer consumer to libbpf. It is very similar to perf_buffer implementation in terms of API, but also attempts to fix some minor problems and inconveniences with existing perf_buffer API. ring_buffer support both single ring buffer use case (with just using ring_buffer__new()), as well as allows to add more ring buffers, each with its own callback and context. This allows to efficiently poll and consume multiple, potentially completely independent, ring buffers, using single epoll instance. The latter is actually a problem in practice for applications that are using multiple sets of perf buffers. They have to create multiple instances for struct perf_buffer and poll them independently or in a loop, each approach having its own problems (e.g., inability to use a common poll timeout). struct ring_buffer eliminates this problem by aggregating many independent ring buffer instances under the single "ring buffer manager". Second, perf_buffer's callback can't return error, so applications that need to stop polling due to error in data or data signalling the end, have to use extra mechanisms to signal that polling has to stop. ring_buffer's callback can return error, which will be passed through back to user code and can be acted upon appropariately. Two APIs allow to consume ring buffer data: - ring_buffer__poll(), which will wait for data availability notification and will consume data only from reported ring buffer(s); this API allows to efficiently use resources by reading data only when it becomes available; - ring_buffer__consume(), will attempt to read new records regardless of data availablity notification sub-system. This API is useful for cases when lowest latency is required, in expense of burning CPU resources. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200529075424.3139988-3-andriin@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-06-01 22:22:32 -07:00
Andrii Nakryiko	17a6d61898	bpf: Implement BPF ring buffer and verifier support for it This commit adds a new MPSC ring buffer implementation into BPF ecosystem, which allows multiple CPUs to submit data to a single shared ring buffer. On the consumption side, only single consumer is assumed. Motivation ---------- There are two distinctive motivators for this work, which are not satisfied by existing perf buffer, which prompted creation of a new ring buffer implementation. - more efficient memory utilization by sharing ring buffer across CPUs; - preserving ordering of events that happen sequentially in time, even across multiple CPUs (e.g., fork/exec/exit events for a task). These two problems are independent, but perf buffer fails to satisfy both. Both are a result of a choice to have per-CPU perf ring buffer. Both can be also solved by having an MPSC implementation of ring buffer. The ordering problem could technically be solved for perf buffer with some in-kernel counting, but given the first one requires an MPSC buffer, the same solution would solve the second problem automatically. Semantics and APIs ------------------ Single ring buffer is presented to BPF programs as an instance of BPF map of type BPF_MAP_TYPE_RINGBUF. Two other alternatives considered, but ultimately rejected. One way would be to, similar to BPF_MAP_TYPE_PERF_EVENT_ARRAY, make BPF_MAP_TYPE_RINGBUF could represent an array of ring buffers, but not enforce "same CPU only" rule. This would be more familiar interface compatible with existing perf buffer use in BPF, but would fail if application needed more advanced logic to lookup ring buffer by arbitrary key. HASH_OF_MAPS addresses this with current approach. Additionally, given the performance of BPF ringbuf, many use cases would just opt into a simple single ring buffer shared among all CPUs, for which current approach would be an overkill. Another approach could introduce a new concept, alongside BPF map, to represent generic "container" object, which doesn't necessarily have key/value interface with lookup/update/delete operations. This approach would add a lot of extra infrastructure that has to be built for observability and verifier support. It would also add another concept that BPF developers would have to familiarize themselves with, new syntax in libbpf, etc. But then would really provide no additional benefits over the approach of using a map. BPF_MAP_TYPE_RINGBUF doesn't support lookup/update/delete operations, but so doesn't few other map types (e.g., queue and stack; array doesn't support delete, etc). The approach chosen has an advantage of re-using existing BPF map infrastructure (introspection APIs in kernel, libbpf support, etc), being familiar concept (no need to teach users a new type of object in BPF program), and utilizing existing tooling (bpftool). For common scenario of using a single ring buffer for all CPUs, it's as simple and straightforward, as would be with a dedicated "container" object. On the other hand, by being a map, it can be combined with ARRAY_OF_MAPS and HASH_OF_MAPS map-in-maps to implement a wide variety of topologies, from one ring buffer for each CPU (e.g., as a replacement for perf buffer use cases), to a complicated application hashing/sharding of ring buffers (e.g., having a small pool of ring buffers with hashed task's tgid being a look up key to preserve order, but reduce contention). Key and value sizes are enforced to be zero. max_entries is used to specify the size of ring buffer and has to be a power of 2 value. There are a bunch of similarities between perf buffer (BPF_MAP_TYPE_PERF_EVENT_ARRAY) and new BPF ring buffer semantics: - variable-length records; - if there is no more space left in ring buffer, reservation fails, no blocking; - memory-mappable data area for user-space applications for ease of consumption and high performance; - epoll notifications for new incoming data; - but still the ability to do busy polling for new data to achieve the lowest latency, if necessary. BPF ringbuf provides two sets of APIs to BPF programs: - bpf_ringbuf_output() allows to copy data from one place to a ring buffer, similarly to bpf_perf_event_output(); - bpf_ringbuf_reserve()/bpf_ringbuf_commit()/bpf_ringbuf_discard() APIs split the whole process into two steps. First, a fixed amount of space is reserved. If successful, a pointer to a data inside ring buffer data area is returned, which BPF programs can use similarly to a data inside array/hash maps. Once ready, this piece of memory is either committed or discarded. Discard is similar to commit, but makes consumer ignore the record. bpf_ringbuf_output() has disadvantage of incurring extra memory copy, because record has to be prepared in some other place first. But it allows to submit records of the length that's not known to verifier beforehand. It also closely matches bpf_perf_event_output(), so will simplify migration significantly. bpf_ringbuf_reserve() avoids the extra copy of memory by providing a memory pointer directly to ring buffer memory. In a lot of cases records are larger than BPF stack space allows, so many programs have use extra per-CPU array as a temporary heap for preparing sample. bpf_ringbuf_reserve() avoid this needs completely. But in exchange, it only allows a known constant size of memory to be reserved, such that verifier can verify that BPF program can't access memory outside its reserved record space. bpf_ringbuf_output(), while slightly slower due to extra memory copy, covers some use cases that are not suitable for bpf_ringbuf_reserve(). The difference between commit and discard is very small. Discard just marks a record as discarded, and such records are supposed to be ignored by consumer code. Discard is useful for some advanced use-cases, such as ensuring all-or-nothing multi-record submission, or emulating temporary malloc()/free() within single BPF program invocation. Each reserved record is tracked by verifier through existing reference-tracking logic, similar to socket ref-tracking. It is thus impossible to reserve a record, but forget to submit (or discard) it. bpf_ringbuf_query() helper allows to query various properties of ring buffer. Currently 4 are supported: - BPF_RB_AVAIL_DATA returns amount of unconsumed data in ring buffer; - BPF_RB_RING_SIZE returns the size of ring buffer; - BPF_RB_CONS_POS/BPF_RB_PROD_POS returns current logical possition of consumer/producer, respectively. Returned values are momentarily snapshots of ring buffer state and could be off by the time helper returns, so this should be used only for debugging/reporting reasons or for implementing various heuristics, that take into account highly-changeable nature of some of those characteristics. One such heuristic might involve more fine-grained control over poll/epoll notifications about new data availability in ring buffer. Together with BPF_RB_NO_WAKEUP/BPF_RB_FORCE_WAKEUP flags for output/commit/discard helpers, it allows BPF program a high degree of control and, e.g., more efficient batched notifications. Default self-balancing strategy, though, should be adequate for most applications and will work reliable and efficiently already. Design and implementation ------------------------- This reserve/commit schema allows a natural way for multiple producers, either on different CPUs or even on the same CPU/in the same BPF program, to reserve independent records and work with them without blocking other producers. This means that if BPF program was interruped by another BPF program sharing the same ring buffer, they will both get a record reserved (provided there is enough space left) and can work with it and submit it independently. This applies to NMI context as well, except that due to using a spinlock during reservation, in NMI context, bpf_ringbuf_reserve() might fail to get a lock, in which case reservation will fail even if ring buffer is not full. The ring buffer itself internally is implemented as a power-of-2 sized circular buffer, with two logical and ever-increasing counters (which might wrap around on 32-bit architectures, that's not a problem): - consumer counter shows up to which logical position consumer consumed the data; - producer counter denotes amount of data reserved by all producers. Each time a record is reserved, producer that "owns" the record will successfully advance producer counter. At that point, data is still not yet ready to be consumed, though. Each record has 8 byte header, which contains the length of reserved record, as well as two extra bits: busy bit to denote that record is still being worked on, and discard bit, which might be set at commit time if record is discarded. In the latter case, consumer is supposed to skip the record and move on to the next one. Record header also encodes record's relative offset from the beginning of ring buffer data area (in pages). This allows bpf_ringbuf_commit()/bpf_ringbuf_discard() to accept only the pointer to the record itself, without requiring also the pointer to ring buffer itself. Ring buffer memory location will be restored from record metadata header. This significantly simplifies verifier, as well as improving API usability. Producer counter increments are serialized under spinlock, so there is a strict ordering between reservations. Commits, on the other hand, are completely lockless and independent. All records become available to consumer in the order of reservations, but only after all previous records where already committed. It is thus possible for slow producers to temporarily hold off submitted records, that were reserved later. Reservation/commit/consumer protocol is verified by litmus tests in Documentation/litmus-test/bpf-rb. One interesting implementation bit, that significantly simplifies (and thus speeds up as well) implementation of both producers and consumers is how data area is mapped twice contiguously back-to-back in the virtual memory. This allows to not take any special measures for samples that have to wrap around at the end of the circular buffer data area, because the next page after the last data page would be first data page again, and thus the sample will still appear completely contiguous in virtual memory. See comment and a simple ASCII diagram showing this visually in bpf_ringbuf_area_alloc(). Another feature that distinguishes BPF ringbuf from perf ring buffer is a self-pacing notifications of new data being availability. bpf_ringbuf_commit() implementation will send a notification of new record being available after commit only if consumer has already caught up right up to the record being committed. If not, consumer still has to catch up and thus will see new data anyways without needing an extra poll notification. Benchmarks (see tools/testing/selftests/bpf/benchs/bench_ringbuf.c) show that this allows to achieve a very high throughput without having to resort to tricks like "notify only every Nth sample", which are necessary with perf buffer. For extreme cases, when BPF program wants more manual control of notifications, commit/discard/output helpers accept BPF_RB_NO_WAKEUP and BPF_RB_FORCE_WAKEUP flags, which give full control over notifications of data availability, but require extra caution and diligence in using this API. Comparison to alternatives -------------------------- Before considering implementing BPF ring buffer from scratch existing alternatives in kernel were evaluated, but didn't seem to meet the needs. They largely fell into few categores: - per-CPU buffers (perf, ftrace, etc), which don't satisfy two motivations outlined above (ordering and memory consumption); - linked list-based implementations; while some were multi-producer designs, consuming these from user-space would be very complicated and most probably not performant; memory-mapping contiguous piece of memory is simpler and more performant for user-space consumers; - io_uring is SPSC, but also requires fixed-sized elements. Naively turning SPSC queue into MPSC w/ lock would have subpar performance compared to locked reserve + lockless commit, as with BPF ring buffer. Fixed sized elements would be too limiting for BPF programs, given existing BPF programs heavily rely on variable-sized perf buffer already; - specialized implementations (like a new printk ring buffer, [0]) with lots of printk-specific limitations and implications, that didn't seem to fit well for intended use with BPF programs. [0] https://lwn.net/Articles/779550/ Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200529075424.3139988-2-andriin@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-06-01 22:22:32 -07:00
Eelco Chaudron	ff2322b879	libbpf: Fix perf_buffer__free() API for sparse allocs In case the cpu_bufs are sparsely allocated they are not all free'ed. These changes will fix this. Fixes: fb84b8224655 ("libbpf: add perf buffer API") Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/159056888305.330763.9684536967379110349.stgit@ebuild Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-06-01 22:22:32 -07:00
John Fastabend	ab1b4f3844	bpf, sk_msg: Add get socket storage helpers Add helpers to use local socket storage. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/159033907577.12355.14740125020572756560.stgit@john-Precision-5820-Tower Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-06-01 22:22:32 -07:00
Eelco Chaudron	df9a526f99	libbpf: Add API to consume the perf ring buffer content This new API, perf_buffer__consume, can be used as follows: - When you have a perf ring where wakeup_events is higher than 1, and you have remaining data in the rings you would like to pull out on exit (or maybe based on a timeout). - For low latency cases where you burn a CPU that constantly polls the queues. Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/159048487929.89441.7465713173442594608.stgit@ebuild Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-06-01 22:22:32 -07:00
Andrii Nakryiko	3b23942542	ci: blacklist bpf_iter tests Disable a bunch of new kernel selftests that can't succeed on 5.5 kernel. Flatten Travis tests into a single stage to parallelize and speed them up. Signed-off-by: Andrii Nakryiko <andriin@fb.com>	2020-05-20 01:00:06 -07:00
Andrii Nakryiko	90941cde5f	sync: latest libbpf changes from kernel Syncing latest libbpf commits from kernel repository. Baseline bpf-next commit: c321022244708aec4675de4f032ef1ba9ff0c640 Checkpoint bpf-next commit: dda18a5c0b75461d1ed228f80b59c67434b8d601 Baseline bpf commit: 7f645462ca01d01abb94d75e6768c8b3ed3a188b Checkpoint bpf commit: f85c1598ddfe83f61d0656bd1d2025fa3b148b99 Alexei Starovoitov (1): tools/bpf: sync bpf.h Andrey Ignatov (2): bpf: Support narrow loads from bpf_sock_addr.user_port bpf: Introduce bpf_sk_{, ancestor_}cgroup_id helpers Daniel Borkmann (2): bpf: Add get{peer, sock}name attach types for sock_addr bpf, libbpf: Enable get{peer, sock}name attach types Eelco Chaudron (1): libbpf: Fix probe code to return EPERM if encountered Gustavo A. R. Silva (1): bpf, libbpf: Replace zero-length array with flexible-array Horatiu Vultur (1): net: bridge: Add port attribute IFLA_BRPORT_MRP_RING_OPEN Ian Rogers (2): libbpf, hashmap: Remove unused #include libbpf, hashmap: Fix signedness warnings Quentin Monnet (1): tools, bpf: Synchronise BPF UAPI header with tools Song Liu (2): bpf: Sharing bpf runtime stats with BPF_ENABLE_STATS libbpf: Add support for command BPF_ENABLE_STATS Stanislav Fomichev (2): bpf: Bpf_{g,s}etsockopt for struct bpf_sock_addr bpf: Allow any port in bpf_bind helper Sumanth Korikkar (1): libbpf: Fix register naming in PT_REGS s390 macros Yonghong Song (7): bpf: Allow loading of a bpf_iter program bpf: Support bpf tracing/iter programs for BPF_LINK_CREATE bpf: Create anonymous bpf iterator bpf: Add bpf_seq_printf and bpf_seq_write helpers tools/libbpf: Add bpf_iter support tools/libpf: Add offsetof/container_of macro in bpf_helpers.h bpf: Change btf_iter func proto prefix to "bpf_iter_" include/uapi/linux/bpf.h \| 208 +++++++++++++++++++++++++++-------- include/uapi/linux/if_link.h \| 1 + src/bpf.c \| 20 ++++ src/bpf.h \| 3 + src/bpf_helpers.h \| 14 +++ src/bpf_tracing.h \| 20 +++- src/hashmap.c \| 5 +- src/hashmap.h \| 1 - src/libbpf.c \| 98 +++++++++++++++-- src/libbpf.h \| 9 ++ src/libbpf.map \| 3 + src/libbpf_internal.h \| 2 +- 12 files changed, 322 insertions(+), 62 deletions(-) -- 2.24.1	2020-05-20 01:00:06 -07:00
Andrii Nakryiko	97a0d1e7b5	sync: auto-generate latest BPF helpers Latest changes to BPF helper definitions.	2020-05-20 01:00:06 -07:00
Alexei Starovoitov	d650751a9b	tools/bpf: sync bpf.h Sync tools/include/uapi/linux/bpf.h from include/uapi. Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-05-20 01:00:06 -07:00
Daniel Borkmann	dcb0c5ac44	bpf, libbpf: Enable get{peer, sock}name attach types Trivial patch to add the new get{peer,sock}name attach types to the section definitions in order to hook them up to sock_addr cgroup program type. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Andrey Ignatov <rdna@fb.com> Link: https://lore.kernel.org/bpf/7fcd4b1e41a8ebb364754a5975c75a7795051bd2.1589841594.git.daniel@iogearbox.net	2020-05-20 01:00:06 -07:00
Daniel Borkmann	2c892f1aa1	bpf: Add get{peer, sock}name attach types for sock_addr As stated in 983695fa6765 ("bpf: fix unconnected udp hooks"), the objective for the existing cgroup connect/sendmsg/recvmsg/bind BPF hooks is to be transparent to applications. In Cilium we make use of these hooks [0] in order to enable E-W load balancing for existing Kubernetes service types for all Cilium managed nodes in the cluster. Those backends can be local or remote. The main advantage of this approach is that it operates as close as possible to the socket, and therefore allows to avoid packet-based NAT given in connect/sendmsg/recvmsg hooks we only need to xlate sock addresses. This also allows to expose NodePort services on loopback addresses in the host namespace, for example. As another advantage, this also efficiently blocks bind requests for applications in the host namespace for exposed ports. However, one missing item is that we also need to perform reverse xlation for inet{,6}_getname() hooks such that we can return the service IP/port tuple back to the application instead of the remote peer address. The vast majority of applications does not bother about getpeername(), but in a few occasions we've seen breakage when validating the peer's address since it returns unexpectedly the backend tuple instead of the service one. Therefore, this trivial patch allows to customise and adds a getpeername() as well as getsockname() BPF cgroup hook for both IPv4 and IPv6 in order to address this situation. Simple example: # ./cilium/cilium service list ID Frontend Service Type Backend 1 1.2.3.4:80 ClusterIP 1 => 10.0.0.10:80 Before; curl's verbose output example, no getpeername() reverse xlation: # curl --verbose 1.2.3.4 * Rebuilt URL to: 1.2.3.4/ * Trying 1.2.3.4... * TCP_NODELAY set * Connected to 1.2.3.4 (10.0.0.10) port 80 (#0) > GET / HTTP/1.1 > Host: 1.2.3.4 > User-Agent: curl/7.58.0 > Accept: / [...] After; with getpeername() reverse xlation: # curl --verbose 1.2.3.4 * Rebuilt URL to: 1.2.3.4/ * Trying 1.2.3.4... * TCP_NODELAY set * Connected to 1.2.3.4 (1.2.3.4) port 80 (#0) > GET / HTTP/1.1 > Host: 1.2.3.4 > User-Agent: curl/7.58.0 > Accept: / [...] Originally, I had both under a BPF_CGROUP_INET{4,6}_GETNAME type and exposed peer to the context similar as in inet{,6}_getname() fashion, but API-wise this is suboptimal as it always enforces programs having to test for ctx->peer which can easily be missed, hence BPF_CGROUP_INET{4,6}_GET{PEER,SOCK}NAME split. Similarly, the checked return code is on tnum_range(1, 1), but if a use case comes up in future, it can easily be changed to return an error code instead. Helper and ctx member access is the same as with connect/sendmsg/etc hooks. [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Andrey Ignatov <rdna@fb.com> Link: https://lore.kernel.org/bpf/61a479d759b2482ae3efb45546490bacd796a220.1589841594.git.daniel@iogearbox.net	2020-05-20 01:00:06 -07:00
Ian Rogers	46407182c7	libbpf, hashmap: Fix signedness warnings Fixes the following warnings: hashmap.c: In function ‘hashmap__clear’: hashmap.h:150:20: error: comparison of integer expressions of different signedness: ‘int’ and ‘size_t’ {aka ‘long unsigned int’} [-Werror=sign-compare] 150 \| for (bkt = 0; bkt < map->cap; bkt++) \ hashmap.c: In function ‘hashmap_grow’: hashmap.h:150:20: error: comparison of integer expressions of different signedness: ‘int’ and ‘size_t’ {aka ‘long unsigned int’} [-Werror=sign-compare] 150 \| for (bkt = 0; bkt < map->cap; bkt++) \ Signed-off-by: Ian Rogers <irogers@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200515165007.217120-4-irogers@google.com	2020-05-20 01:00:06 -07:00
Ian Rogers	a00d463bb9	libbpf, hashmap: Remove unused #include Remove #include of libbpf_internal.h that is unused. Discussed in this thread: https://lore.kernel.org/lkml/CAEf4BzZRmiEds_8R8g4vaAeWvJzPb4xYLnpF0X2VNY8oTzkphQ@mail.gmail.com/ Signed-off-by: Ian Rogers <irogers@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200515165007.217120-3-irogers@google.com	2020-05-20 01:00:06 -07:00

1 2 3 4 5 ...

598 Commits