We do an ugly copying of options in bpf_object__open_skeleton() just to
be able to set object name from skeleton's recorded name (while still
allowing user to override it through opts->object_name).
This is not just ugly, but it also is broken due to memcpy() that
doesn't take into account potential skel_opts' and user-provided opts'
sizes differences due to backward and forward compatibility. This leads
to copying over extra bytes and then failing to validate options
properly. It could, technically, lead also to SIGSEGV, if we are unlucky.
So just get rid of that memory copy completely and instead pass
default object name into bpf_object_open() directly, simplifying all
this significantly. The rule now is that obj_name should be non-NULL for
bpf_object_open() when called with in-memory buffer, so validate that
explicitly as well.
We adopt bpf_object__open_mem() to this as well and generate default
name (based on buffer memory address and size) outside of bpf_object_open().
Fixes: d66562fba1ce ("libbpf: Add BPF object skeleton support")
Reported-by: Daniel Müller <deso@posteo.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Daniel Müller <deso@posteo.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240827203721.1145494-1-andrii@kernel.org
This adds a kfunc wrapper around strncpy_from_user,
which can be called from sleepable BPF programs.
This matches the non-sleepable 'bpf_probe_read_user_str'
helper except it includes an additional 'flags'
param, which allows consumers to clear the entire
destination buffer on success or failure.
Signed-off-by: Jordan Rome <linux@jordanrome.com>
Link: https://lore.kernel.org/r/20240823195101.3621028-1-linux@jordanrome.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In `elf_close`, we get this with GCC 15 -O3 (at least):
```
In function ‘elf_close’,
inlined from ‘elf_close’ at elf.c:53:6,
inlined from ‘elf_find_func_offset_from_file’ at elf.c:384:2:
elf.c:57:9: warning: ‘elf_fd.elf’ may be used uninitialized [-Wmaybe-uninitialized]
57 | elf_end(elf_fd->elf);
| ^~~~~~~~~~~~~~~~~~~~
elf.c: In function ‘elf_find_func_offset_from_file’:
elf.c:377:23: note: ‘elf_fd.elf’ was declared here
377 | struct elf_fd elf_fd;
| ^~~~~~
In function ‘elf_close’,
inlined from ‘elf_close’ at elf.c:53:6,
inlined from ‘elf_find_func_offset_from_file’ at elf.c:384:2:
elf.c:58:9: warning: ‘elf_fd.fd’ may be used uninitialized [-Wmaybe-uninitialized]
58 | close(elf_fd->fd);
| ^~~~~~~~~~~~~~~~~
elf.c: In function ‘elf_find_func_offset_from_file’:
elf.c:377:23: note: ‘elf_fd.fd’ was declared here
377 | struct elf_fd elf_fd;
| ^~~~~~
```
In reality, our use is fine, it's just that GCC doesn't model errno
here (see linked GCC bug). Suppress -Wmaybe-uninitialized accordingly
by initializing elf_fd.fd to -1 and elf_fd.elf to NULL.
I've done this in two other functions as well given it could easily
occur there too (same access/use pattern).
Signed-off-by: Sam James <sam@gentoo.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://gcc.gnu.org/PR114952
Link: https://lore.kernel.org/bpf/14ec488a1cac02794c2fa2b83ae0cef1bce2cb36.1723578546.git.sam@gentoo.org
In struct bpf_struct_ops, we have take a pointer to a BTF type name, and
a struct btf_type. This was presumably done for convenience, but can
actually result in subtle and confusing bugs given that BTF data can be
invalidated before a program is loaded. For example, in sched_ext, we
may sometimes resize a data section after a skeleton has been opened,
but before the struct_ops scheduler map has been loaded. This may cause
the BTF data to be realloc'd, which can then cause a UAF when loading
the program because the struct_ops map has pointers directly into the
BTF data.
We're already storing the BTF type_id in struct bpf_struct_ops. Because
type_id is stable, we can therefore just update the places where we were
looking at those pointers to instead do the lookups we need from the
type_id.
Fixes: 590a00888250 ("bpf: libbpf: Add STRUCT_OPS support")
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240724171459.281234-1-void@manifault.com
For all these years libbpf's BTF dumper has been emitting not strictly
valid syntax for function prototypes that have no input arguments.
Instead of `int (*blah)()` we should emit `int (*blah)(void)`.
This is not normally a problem, but it manifests when we get kfuncs in
vmlinux.h that have no input arguments. Due to compiler internal
specifics, we get no BTF information for such kfuncs, if they are not
declared with proper `(void)`.
The fix is trivial. We also need to adjust a few ancient tests that
happily assumed `()` is correct.
Fixes: 351131b51c7a ("libbpf: add btf_dump API for BTF-to-C conversion")
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://lore.kernel.org/bpf/20240712224442.282823-1-andrii@kernel.org
A new PEBS data source format is introduced for the p-core of Lunar
Lake. The data source field is extended to 8 bits with new encodings.
A new layout is introduced into the union intel_x86_pebs_dse.
Introduce the lnl_latency_data() to parse the new format.
Enlarge the pebs_data_source[] accordingly to include new encodings.
Only the mem load and the mem store events can generate the data source.
Introduce INTEL_HYBRID_LDLAT_CONSTRAINT and
INTEL_HYBRID_STLAT_CONSTRAINT to mark them.
Add two new bits for the new cache-related data src, L2_MHB and MSC.
The L2_MHB is short for L2 Miss Handling Buffer, which is similar to
LFB (Line Fill Buffer), but to track the L2 Cache misses.
The MSC stands for the memory-side cache.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lkml.kernel.org/r/20240626143545.480761-6-kan.liang@linux.intel.com
As this will change to a Ubuntu 24.04 runner, we want this to automatically detect
which ubuntu version it is running on.
Signed-off-by: Manu Bretelle <chantr4@gmail.com>
I am working on upgrading to 24.04 runners. In order to make sure that current jobs are scheduled
on Ubuntu 20.04, we need to ask for runners with tag `docker-main`, which is currently
set by those old runners.
Later, we will be able to switch this tag to `docker-noble-main` which are Ubuntu 24.04 runners.
Signed-off-by: Manu Bretelle <chantr4@gmail.com>
Update .mailmap based on libbpf's list of contributors and on the latest
.mailmap version in the upstream repository.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Improve how we handle old BPF skeletons when it comes to BPF map
auto-attachment. Emit one warn-level message per each struct_ops map
that could have been auto-attached, if user provided recent enough BPF
skeleton version. Don't spam log if there are no relevant struct_ops
maps, though.
This should help users realize that they probably need to regenerate BPF
skeleton header with more recent bpftool/libbpf-cargo (or whatever other
means of BPF skeleton generation).
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240708204540.4188946-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
BPF skeleton was designed from day one to be extensible. Generated BPF
skeleton code specifies actual sizes of map/prog/variable skeletons for
that reason and libbpf is supposed to work with newer/older versions
correctly.
Unfortunately, it was missed that we implicitly embed hard-coded most
up-to-date (according to libbpf's version of libbpf.h header used to
compile BPF skeleton header) sizes of those structs, which can differ
from the actual sizes at runtime when libbpf is used as a shared
library.
We have a few places were we just index array of maps/progs/vars, which
implicitly uses these potentially invalid sizes of structs.
This patch aims to fix this problem going forward. Once this lands,
we'll backport these changes in Github repo to create patched releases
for older libbpfs.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Fixes: d66562fba1ce ("libbpf: Add BPF object skeleton support")
Fixes: 430025e5dca5 ("libbpf: Add subskeleton scaffolding")
Fixes: 08ac454e258e ("libbpf: Auto-attach struct_ops BPF maps in BPF skeleton")
Co-developed-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240708204540.4188946-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In the current state, an erroneous call to
bpf_object__find_map_by_name(NULL, ...) leads to a segmentation
fault through the following call chain:
bpf_object__find_map_by_name(obj = NULL, ...)
-> bpf_object__for_each_map(pos, obj = NULL)
-> bpf_object__next_map((obj = NULL), NULL)
-> return (obj = NULL)->maps
While calling bpf_object__find_map_by_name with obj = NULL is
obviously incorrect, this should not lead to a segmentation
fault but rather be handled gracefully.
As __bpf_map__iter already handles this situation correctly, we
can delegate the check for the regular case there and only add
a check in case the prev or next parameter is NULL.
Signed-off-by: Andreas Ziegler <ziegler.andreas@siemens.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240703083436.505124-1-ziegler.andreas@siemens.com
Update .mailmap based on libbpf's list of contributors and on the latest
.mailmap version in the upstream repository.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
When building with clang for ARCH=i386, the following errors are
observed:
CC kernel/bpf/btf_relocate.o
./tools/lib/bpf/btf_relocate.c:206:23: error: implicit truncation from 'int' to a one-bit wide bit-field changes value from 1 to -1 [-Werror,-Wsingle-bit-bitfield-constant-conversion]
206 | info[id].needs_size = true;
| ^ ~
./tools/lib/bpf/btf_relocate.c:256:25: error: implicit truncation from 'int' to a one-bit wide bit-field changes value from 1 to -1 [-Werror,-Wsingle-bit-bitfield-constant-conversion]
256 | base_info.needs_size = true;
| ^ ~
2 errors generated.
The problem is we use 1-bit, 31-bit bitfields in a signed int.
Changing to
bool needs_size: 1;
unsigned int size:31;
...resolves the error and pahole reports that 4 bytes are used
for the underlying representation:
$ pahole btf_name_info tools/lib/bpf/btf_relocate.o
struct btf_name_info {
const char * name; /* 0 8 */
unsigned int needs_size:1; /* 8: 0 4 */
unsigned int size:31; /* 8: 1 4 */
__u32 id; /* 12 4 */
/* size: 16, cachelines: 1, members: 4 */
/* last cacheline: 16 bytes */
};
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240624192903.854261-1-alan.maguire@oracle.com
When upgrading to libbpf 1.3 we noticed a big performance hit while
loading programs using CORE on non base-BTF symbols. This was tracked
down to the new BTF sanity check logic. The issue is the base BTF
definitions are checked first for the base BTF and then again for every
module BTF.
Loading 5 dummy programs (using libbpf-rs) that are using CORE on a
non-base BTF symbol on my system:
- Before this fix: 3s.
- With this fix: 0.1s.
Fix this by only checking the types starting at the BTF start id. This
should ensure the base BTF is still checked as expected but only once
(btf->start_id == 1 when creating the base BTF), and then only
additional types are checked for each module BTF.
Fixes: 3903802bb99a ("libbpf: Add basic BTF sanity validation")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Link: https://lore.kernel.org/bpf/20240624090908.171231-1-atenart@kernel.org
Share relocation implementation with the kernel. As part of this,
we also need the type/string iteration functions so also share
btf_iter.c file. Relocation code in kernel and userspace is identical
save for the impementation of the reparenting of split BTF to the
relocated base BTF and retrieval of the BTF header from "struct btf";
these small functions need separate user-space and kernel implementations
for the separate "struct btf"s they operate upon.
One other wrinkle on the kernel side is we have to map .BTF.ids in
modules as they were generated with the type ids used at BTF encoding
time. btf_relocate() optionally returns an array mapping from old BTF
ids to relocated ids, so we use that to fix up these references where
needed for kfuncs.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240620091733.1967885-5-alan.maguire@oracle.com
I encountered an issue when building the test_progs from the repository [1]:
$ pwd
/work/Qemu/x86_64/linux-6.10-rc2/tools/testing/selftests/bpf/
$ make test_progs V=1
[...]
./tools/sbin/bpftool gen object ./ip_check_defrag.bpf.linked2.o ./ip_check_defrag.bpf.linked1.o
libbpf: failed to find symbol for variable 'bpf_dynptr_slice' in section '.ksyms'
Error: failed to link './ip_check_defrag.bpf.linked1.o': No such file or directory (2)
[...]
Upon investigation, I discovered that the btf_types referenced in the '.ksyms'
section had a kind of BTF_KIND_FUNC instead of BTF_KIND_VAR:
$ bpftool btf dump file ./ip_check_defrag.bpf.linked1.o
[...]
[2] DATASEC '.ksyms' size=0 vlen=2
type_id=16 offset=0 size=0 (FUNC 'bpf_dynptr_from_skb')
type_id=17 offset=0 size=0 (FUNC 'bpf_dynptr_slice')
[...]
[16] FUNC 'bpf_dynptr_from_skb' type_id=82 linkage=extern
[17] FUNC 'bpf_dynptr_slice' type_id=85 linkage=extern
[...]
For a detailed analysis, please refer to [2]. We can add a kind checking to
fix the issue.
[1] https://github.com/eddyz87/bpf/tree/binsort-btf-dedup
[2] https://lore.kernel.org/all/0c0ef20c-c05e-4db9-bad7-2cbc0d6dfae7@oracle.com/
Fixes: 8fd27bf69b86 ("libbpf: Add BPF static linker BTF and BTF.ext support")
Signed-off-by: Donglin Peng <dolinux.peng@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240619122355.426405-1-dolinux.peng@gmail.com
Update btf_parse_elf() to check if .BTF.base section is present.
The logic is as follows:
if .BTF.base section exists:
distilled_base := btf_new(.BTF.base)
if distilled_base:
btf := btf_new(.BTF, .base_btf=distilled_base)
if base_btf:
btf_relocate(btf, base_btf)
else:
btf := btf_new(.BTF)
return btf
In other words:
- if .BTF.base section exists, load BTF from it and use it as a base
for .BTF load;
- if base_btf is specified and .BTF.base section exist, relocate newly
loaded .BTF against base_btf.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240613095014.357981-6-alan.maguire@oracle.com
Map distilled base BTF type ids referenced in split BTF and their
references to the base BTF passed in, and if the mapping succeeds,
reparent the split BTF to the base BTF.
Relocation is done by first verifying that distilled base BTF
only consists of named INT, FLOAT, ENUM, FWD, STRUCT and
UNION kinds; then we sort these to speed lookups. Once sorted,
the base BTF is iterated, and for each relevant kind we check
for an equivalent in distilled base BTF. When found, the
mapping from distilled -> base BTF id and string offset is recorded.
In establishing mappings, we need to ensure we check STRUCT/UNION
size when the STRUCT/UNION is embedded in a split BTF STRUCT/UNION,
and when duplicate names exist for the same STRUCT/UNION. Otherwise
size is ignored in matching STRUCT/UNIONs.
Once all mappings are established, we can update type ids
and string offsets in split BTF and reparent it to the new base.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240613095014.357981-4-alan.maguire@oracle.com
To support more robust split BTF, adding supplemental context for the
base BTF type ids that split BTF refers to is required. Without such
references, a simple shuffling of base BTF type ids (without any other
significant change) invalidates the split BTF. Here the attempt is made
to store additional context to make split BTF more robust.
This context comes in the form of distilled base BTF providing minimal
information (name and - in some cases - size) for base INTs, FLOATs,
STRUCTs, UNIONs, ENUMs and ENUM64s along with modified split BTF that
points at that base and contains any additional types needed (such as
TYPEDEF, PTR and anonymous STRUCT/UNION declarations). This
information constitutes the minimal BTF representation needed to
disambiguate or remove split BTF references to base BTF. The rules
are as follows:
- INT, FLOAT, FWD are recorded in full.
- if a named base BTF STRUCT or UNION is referred to from split BTF, it
will be encoded as a zero-member sized STRUCT/UNION (preserving
size for later relocation checks). Only base BTF STRUCT/UNIONs
that are either embedded in split BTF STRUCT/UNIONs or that have
multiple STRUCT/UNION instances of the same name will _need_ size
checks at relocation time, but as it is possible a different set of
types will be duplicates in the later to-be-resolved base BTF,
we preserve size information for all named STRUCT/UNIONs.
- if an ENUM[64] is named, a ENUM forward representation (an ENUM
with no values) of the same size is used.
- in all other cases, the type is added to the new split BTF.
Avoiding struct/union/enum/enum64 expansion is important to keep the
distilled base BTF representation to a minimum size.
When successful, new representations of the distilled base BTF and new
split BTF that refers to it are returned. Both need to be freed by the
caller.
So to take a simple example, with split BTF with a type referring
to "struct sk_buff", we will generate distilled base BTF with a
0-member STRUCT sk_buff of the appropriate size, and the split BTF
will refer to it instead.
Tools like pahole can utilize such split BTF to populate the .BTF
section (split BTF) and an additional .BTF.base section. Then
when the split BTF is loaded, the distilled base BTF can be used
to relocate split BTF to reference the current (and possibly changed)
base BTF.
So for example if "struct sk_buff" was id 502 when the split BTF was
originally generated, we can use the distilled base BTF to see that
id 502 refers to a "struct sk_buff" and replace instances of id 502
with the current (relocated) base BTF sk_buff type id.
Distilled base BTF is small; when building a kernel with all modules
using distilled base BTF as a test, overall module size grew by only
5.3Mb total across ~2700 modules.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240613095014.357981-2-alan.maguire@oracle.com
Similarly to `bpf_program`, support `bpf_map` automatic attachment in
`bpf_object__attach_skeleton`. Currently only struct_ops maps could be
attached.
On bpftool side, code-generate links in skeleton struct for struct_ops maps.
Similarly to `bpf_program_skeleton`, set links in `bpf_map_skeleton`.
On libbpf side, extend `bpf_map` with new `autoattach` field to support
enabling or disabling autoattach functionality, introducing
getter/setter for this field.
`bpf_object__(attach|detach)_skeleton` is extended with
attaching/detaching struct_ops maps logic.
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240605175135.117127-1-yatsenko@meta.com
Implement iterator-based type ID and string offset BTF field iterator.
This is used extensively in BTF-handling code and BPF linker code for
various sanity checks, rewriting IDs/offsets, etc. Currently this is
implemented as visitor pattern calling custom callbacks, which makes the
logic (especially in simple cases) unnecessarily obscure and harder to
follow.
Having equivalent functionality using iterator pattern makes for simpler
to understand and maintain code. As we add more code for BTF processing
logic in libbpf, it's best to switch to iterator pattern before adding
more callback-based code.
The idea for iterator-based implementation is to record offsets of
necessary fields within fixed btf_type parts (which should be iterated
just once), and, for kinds that have multiple members (based on vlen
field), record where in each member necessary fields are located.
Generic iteration code then just keeps track of last offset that was
returned and handles N members correctly. Return type is just u32
pointer, where NULL is returned when all relevant fields were already
iterated.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20240605001629.4061937-2-andrii@kernel.org
pahole staging workflow is using the same old VM image as BPF selftests
stages. It doesn't have recent enough glibc, so we can't yet switch to
newer Ubuntu, unfortunately.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Recent commit 0cfe71f45f42 ("netdev: add queue stats") added
a lot of useful stats, but only those immediately needed by virtio.
Presumably virtio does not support CHECKSUM_COMPLETE,
so statistic for that form of checksumming wasn't included.
Other drivers will definitely need it, in fact we expect it
to be needed in net-next soon (mlx5). So let's add the definition
of the counter for CHECKSUM_COMPLETE to uAPI in net already,
so that the counters are in a more natural order (all subsequent
counters have not been present in any released kernel, yet).
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Fixes: 0cfe71f45f42 ("netdev: add queue stats")
Link: https://lore.kernel.org/r/20240529163547.3693194-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Make sure to preserve and/or enforce FD_CLOEXEC flag on duped FDs.
Use dup3() with O_CLOEXEC flag for that.
Without this fix libbpf effectively clears FD_CLOEXEC flag on each of BPF
map/prog FD, which is definitely not the right or expected behavior.
Reported-by: Lennart Poettering <lennart@poettering.net>
Fixes: bc308d011ab8 ("libbpf: call dup2() syscall directly")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20240529223239.504241-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Track ubuntu-latest where relevant and possible.
We can't update to ubuntu-latest when building and running BPF
selftests, though, because our QEMU image has too old of an GLIBC.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Update .mailmap based on libbpf's list of contributors and on the latest
.mailmap version in the upstream repository.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Libbpf is automatically (and transparently to user) detecting
multi-uprobe support in the kernel, and, if supported, uses
multi-uprobes to improve USDT attachment speed.
USDTs can be attached system-wide or for the specific process by PID. In
the latter case, we rely on correct kernel logic of not triggering USDT
for unrelated processes.
As such, on older kernels that do support multi-uprobes, but still have
broken PID filtering logic, we need to fall back to singular uprobes.
Unfortunately, whether user is using PID filtering or not is known at
the attachment time, which happens after relevant BPF programs were
loaded into the kernel. Also unfortunately, we need to make a call
whether to use multi-uprobes or singular uprobe for SEC("usdt") programs
during BPF object load time, at which point we have no information about
possible PID filtering.
The distinction between single and multi-uprobes is small, but important
for the kernel. Multi-uprobes get BPF_TRACE_UPROBE_MULTI attach type,
and kernel internally substitiute different implementation of some of
BPF helpers (e.g., bpf_get_attach_cookie()) depending on whether uprobe
is multi or singular. So, multi-uprobes and singular uprobes cannot be
intermixed.
All the above implies that we have to make an early and conservative
call about the use of multi-uprobes. And so this patch modifies libbpf's
existing feature detector for multi-uprobe support to also check correct
PID filtering. If PID filtering is not yet fixed, we fall back to
singular uprobes for USDTs.
This extension to feature detection is simple thanks to kernel's -EINVAL
addition for pid < 0.
Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240521163401.3005045-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
tstamp_type is now set based on actual clockid_t compressed
into 2 bits.
To make the design scalable for future needs this commit bring in
the change to extend the tstamp_type:1 to tstamp_type:2 to support
other clockid_t timestamp.
We now support CLOCK_TAI as part of tstamp_type as part of this
commit with existing support CLOCK_MONOTONIC and CLOCK_REALTIME.
Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240509211834.3235191-3-quic_abchauha@quicinc.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
These were used to build perf to provide defines not available in older
distros, but this was back in 2017, nowadays all the distros that are
supported and I have build containers for work using just the system
headers, so ditch them.
Some of these older distros may not have things that are used in 'perf
trace', but then they also don't have libtraceevent packages, so don't
build 'perf trace'.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/lkml/20240315204835.748716-5-acme@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Adjust `union bpf_attr` size passed to kernel in two feature-detecting
functions to take into account prog_token_fd field.
Libbpf is avoiding memset()'ing entire `union bpf_attr` by only using
minimal set of bpf_attr's fields. Two places have been missed when
wiring BPF token support in libbpf's feature detection logic.
Fix them trivially.
Fixes: f3dcee938f48 ("libbpf: Wire up token_fd into feature probing logic")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240513180804.403775-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
[Changes from V1:
- Use a default branch in the switch statement to initialize `val'.]
GCC warns that `val' may be used uninitialized in the
BPF_CRE_READ_BITFIELD macro, defined in bpf_core_read.h as:
[...]
unsigned long long val; \
[...] \
switch (__CORE_RELO(s, field, BYTE_SIZE)) { \
case 1: val = *(const unsigned char *)p; break; \
case 2: val = *(const unsigned short *)p; break; \
case 4: val = *(const unsigned int *)p; break; \
case 8: val = *(const unsigned long long *)p; break; \
} \
[...]
val; \
} \
This patch adds a default entry in the switch statement that sets
`val' to zero in order to avoid the warning, and random values to be
used in case __builtin_preserve_field_info returns unexpected values
for BPF_FIELD_BYTE_SIZE.
Tested in bpf-next master.
No regressions.
Signed-off-by: Jose E. Marchesi <jose.marchesi@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240508101313.16662-1-jose.marchesi@oracle.com
Extend libbpf's pre-load checks for BPF programs, detecting more typical
conditions that are destinated to cause BPF program failure. This is an
opportunity to provide more helpful and actionable error message to
users, instead of potentially very confusing BPF verifier log and/or
error.
In this case, we detect struct_ops BPF program that was not referenced
anywhere, but still attempted to be loaded (according to libbpf logic).
Suggest that the program might need to be used in some struct_ops
variable. User will get a message of the following kind:
libbpf: prog 'test_1_forgotten': SEC("struct_ops") program isn't referenced anywhere, did you forget to use it?
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240507001335.1445325-6-andrii@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
strerror_r(), used from libbpf-specific libbpf_strerror_r() wrapper is
documented to return error in two different ways, depending on glibc
version. Take that into account when handling strerror_r()'s own errors,
which happens when we pass some non-standard (internal) kernel error to
it. Before this patch we'd have "ERROR: strerror_r(524)=22", which is
quite confusing. Now for the same situation we'll see a bit less
visually scary "unknown error (-524)".
At least we won't confuse user with irrelevant EINVAL (22).
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240507001335.1445325-5-andrii@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
There is yet another corner case where user can set STRUCT_OPS program
reference in STRUCT_OPS map to NULL, but libbpf will fail to disable
autoload for such BPF program. This time it's the case of "new" kernel
which has type information about callback field, but user explicitly
nulled-out program reference from user-space after opening BPF object.
Fix, hopefully, the last remaining unhandled case.
Fixes: 0737df6de946 ("libbpf: better fix for handling nulled-out struct_ops program")
Fixes: f973fccd43d3 ("libbpf: handle nulled-out program in struct_ops correctly")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240507001335.1445325-3-andrii@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
libbpf ensures that BPF program references set in map->st_ops->progs[i]
during open phase are always valid STRUCT_OPS programs. This is done in
bpf_object__collect_st_ops_relos(). So there is no need to double-check
that in bpf_map__init_kern_struct_ops().
Simplify the code by removing unnecessary check. Also, we avoid using
local prog variable to keep code similar to the upcoming fix, which adds
similar logic in another part of bpf_map__init_kern_struct_ops().
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240507001335.1445325-2-andrii@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
[Differences from V1:
- Do not introduce a global typedef, as this is a public header.
- Keep the void* casts in BPF_KPROBE_READ_RET_IP and
BPF_KRETPROBE_READ_RET_IP, as these are necessary
for converting to a const void* argument of
bpf_probe_read_kernel.]
The BPF_PROG, BPF_KPROBE and BPF_KSYSCALL macros defined in
tools/lib/bpf/bpf_tracing.h use a clever hack in order to provide a
convenient way to define entry points for BPF programs as if they were
normal C functions that get typed actual arguments, instead of as
elements in a single "context" array argument.
For example, PPF_PROGS allows writing:
SEC("struct_ops/cwnd_event")
void BPF_PROG(cwnd_event, struct sock *sk, enum tcp_ca_event event)
{
bbr_cwnd_event(sk, event);
dctcp_cwnd_event(sk, event);
cubictcp_cwnd_event(sk, event);
}
That expands into a pair of functions:
void ____cwnd_event (unsigned long long *ctx, struct sock *sk, enum tcp_ca_event event)
{
bbr_cwnd_event(sk, event);
dctcp_cwnd_event(sk, event);
cubictcp_cwnd_event(sk, event);
}
void cwnd_event (unsigned long long *ctx)
{
_Pragma("GCC diagnostic push")
_Pragma("GCC diagnostic ignored \"-Wint-conversion\"")
return ____cwnd_event(ctx, (void*)ctx[0], (void*)ctx[1]);
_Pragma("GCC diagnostic pop")
}
Note how the 64-bit unsigned integers in the incoming CTX get casted
to a void pointer, and then implicitly converted to whatever type of
the actual argument in the wrapped function. In this case:
Arg1: unsigned long long -> void * -> struct sock *
Arg2: unsigned long long -> void * -> enum tcp_ca_event
The behavior of GCC and clang when facing such conversions differ:
pointer -> pointer
Allowed by the C standard.
GCC: no warning nor error.
clang: no warning nor error.
pointer -> integer type
[C standard says the result of this conversion is implementation
defined, and it may lead to unaligned pointer etc.]
GCC: error: integer from pointer without a cast [-Wint-conversion]
clang: error: incompatible pointer to integer conversion [-Wint-conversion]
pointer -> enumerated type
GCC: error: incompatible types in assigment (*)
clang: error: incompatible pointer to integer conversion [-Wint-conversion]
These macros work because converting pointers to pointers is allowed,
and converting pointers to integers also works provided a suitable
integer type even if it is implementation defined, much like casting a
pointer to uintptr_t is guaranteed to work by the C standard. The
conversion errors emitted by both compilers by default are silenced by
the pragmas.
However, the GCC error marked with (*) above when assigning a pointer
to an enumerated value is not associated with the -Wint-conversion
warning, and it is not possible to turn it off.
This is preventing building the BPF kernel selftests with GCC.
This patch fixes this by avoiding intermediate casts to void*,
replaced with casts to `unsigned long long', which is an integer type
capable of safely store a BPF pointer, much like the standard
uintptr_t.
Testing performed in bpf-next master:
- vmtest.sh -- ./test_verifier
- vmtest.sh -- ./test_progs
- make M=samples/bpf
No regressions.
Signed-off-by: Jose E. Marchesi <jose.marchesi@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240502170925.3194-1-jose.marchesi@oracle.com
The macro bpf_ksym_exists is defined in bpf_helpers.h as:
#define bpf_ksym_exists(sym) ({ \
_Static_assert(!__builtin_constant_p(!!sym), #sym " should be marked as __weak"); \
!!sym; \
})
The purpose of the macro is to determine whether a given symbol has
been defined, given the address of the object associated with the
symbol. It also has a compile-time check to make sure the object
whose address is passed to the macro has been declared as weak, which
makes the check on `sym' meaningful.
As it happens, the check for weak doesn't work in GCC in all cases,
because __builtin_constant_p not always folds at parse time when
optimizing. This is because optimizations that happen later in the
compilation process, like inlining, may make a previously non-constant
expression a constant. This results in errors like the following when
building the selftests with GCC:
bpf_helpers.h:190:24: error: expression in static assertion is not constant
190 | _Static_assert(!__builtin_constant_p(!!sym), #sym " should be marked as __weak"); \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fortunately recent versions of GCC support a __builtin_has_attribute
that can be used to directly check for the __weak__ attribute. This
patch changes bpf_helpers.h to use that builtin when building with a
recent enough GCC, and to omit the check if GCC is too old to support
the builtin.
The macro used for GCC becomes:
#define bpf_ksym_exists(sym) ({ \
_Static_assert(__builtin_has_attribute (*sym, __weak__), #sym " should be marked as __weak"); \
!!sym; \
})
Note that since bpf_ksym_exists is designed to get the address of the
object associated with symbol SYM, we pass *sym to
__builtin_has_attribute instead of sym. When an expression is passed
to __builtin_has_attribute then it is the type of the passed
expression that is checked for the specified attribute. The
expression itself is not evaluated. This accommodates well with the
existing usages of the macro:
- For function objects:
struct task_struct *bpf_task_acquire(struct task_struct *p) __ksym __weak;
[...]
bpf_ksym_exists(bpf_task_acquire)
- For variable objects:
extern const struct rq runqueues __ksym __weak; /* typed */
[...]
bpf_ksym_exists(&runqueues)
Note also that BPF support was added in GCC 10 and support for
__builtin_has_attribute in GCC 9.
Locally tested in bpf-next master branch.
No regressions.
Signed-of-by: Jose E. Marchesi <jose.marchesi@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20240428112559.10518-1-jose.marchesi@oracle.com
ringbuf_process_ring() return int64_t, while ring__consume_n() assigns
it to int. It's highly unlikely, but possible for ringbuf_process_ring()
to return value larger than INT_MAX, so use int64_t. ring__consume_n()
does check INT_MAX before returning int result to the user.
Fixes: 4d22ea94ea33 ("libbpf: Add ring__consume_n / ring_buffer__consume_n")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20240430201952.888293-1-andrii@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
In commit 4794f18bf4 ("sync: Sync .mailmap entries"), we updated the
sync-up script to automatically update libbpf's .mailmap; however, the
script would not take care of committing the changes. Let's address
this.
The code is copied and adapted from the part where we commit changes to
src/bpf_helper_defs.h.
Signed-off-by: Quentin Monnet <qmo@kernel.org>
Previous attempt to fix the handling of nulled-out (from skeleton)
struct_ops program is working well only if struct_ops program is defined
as non-autoloaded by default (i.e., has SEC("?struct_ops") annotation,
with question mark).
Unfortunately, that fix is incomplete due to how
bpf_object_adjust_struct_ops_autoload() is marking referenced or
non-referenced struct_ops program as autoloaded (or not). Because
bpf_object_adjust_struct_ops_autoload() is run after
bpf_map__init_kern_struct_ops() step, which sets program slot to NULL,
such programs won't be considered "referenced", and so its autoload
property won't be changed.
This all sounds convoluted and it is, but the desire is to have as
natural behavior (as far as struct_ops usage is concerned) as possible.
This fix is redoing the original fix but makes it work for
autoloaded-by-default struct_ops programs as well. We achieve this by
forcing prog->autoload to false if prog was declaratively set for some
struct_ops map, but then nulled-out from skeleton (programmatically).
This achieves desired effect of not autoloading it. If such program is
still referenced somewhere else (different struct_ops map or different
callback field), it will get its autoload property adjusted by
bpf_object_adjust_struct_ops_autoload() later.
We also fix selftest, which accidentally used SEC("?struct_ops")
annotation. It was meant to use autoload-by-default program from the
very beginning.
Fixes: f973fccd43d3 ("libbpf: handle nulled-out program in struct_ops correctly")
Cc: Kui-Feng Lee <thinker.li@gmail.com>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240501041706.3712608-1-andrii@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
In some situations, it is useful to explicitly specify a kernel module
to search for a tracing program target (e.g. when a function of the same
name exists in multiple modules or in vmlinux).
This patch enables that by allowing the "module:function" syntax for the
find_kernel_btf_id function. Thanks to this, the syntax can be used both
from a SEC macro (i.e. `SEC(fentry/module:function)`) and via the
bpf_program__set_attach_target API call.
Signed-off-by: Viktor Malik <vmalik@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/9085a8cb9a552de98e554deb22ff7e977d025440.1714469650.git.vmalik@redhat.com
Adding support to attach program in kprobe session mode
with bpf_program__attach_kprobe_multi_opts function.
Adding session bool to bpf_kprobe_multi_opts struct that allows
to load and attach the bpf program via kprobe session.
the attachment to create kprobe multi session.
Also adding new program loader section that allows:
SEC("kprobe.session/bpf_fentry_test*")
and loads/attaches kprobe program as kprobe session.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240430112830.1184228-5-jolsa@kernel.org
Adding support to attach bpf program for entry and return probe
of the same function. This is common use case which at the moment
requires to create two kprobe multi links.
Adding new BPF_TRACE_KPROBE_SESSION attach type that instructs
kernel to attach single link program to both entry and exit probe.
It's possible to control execution of the bpf program on return
probe simply by returning zero or non zero from the entry bpf
program execution to execute or not the bpf program on return
probe respectively.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240430112830.1184228-2-jolsa@kernel.org
If struct_ops has one of program callbacks set declaratively and host
kernel is old and doesn't support this callback, libbpf will allow to
load such struct_ops as long as that callback was explicitly nulled-out
(presumably through skeleton). This is all working correctly, except we
won't reset corresponding program slot to NULL before bailing out, which
will lead to libbpf not detecting that BPF program has to be not
auto-loaded. Fix this by unconditionally resetting corresponding program
slot to NULL.
Fixes: c911fc61a7ce ("libbpf: Skip zeroed or null fields if not found in the kernel type.")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240428030954.3918764-1-andrii@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
The definition of bpf_tail_call_static in tools/lib/bpf/bpf_helpers.h
is guarded by a preprocessor check to assure that clang is recent
enough to support it. This patch updates the guard so the function is
compiled when using GCC 13 or later as well.
Tested in bpf-next master. No regressions.
Signed-off-by: Jose E. Marchesi <jose.marchesi@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240426145158.14409-1-jose.marchesi@oracle.com
Two important arguments in RTT estimation, mrtt and srtt, are passed to
tcp_bpf_rtt(), so that bpf programs get more information about RTT
computation in BPF_SOCK_OPS_RTT_CB.
The difference between bpf_sock_ops->srtt_us and the srtt here is: the
former is an old rtt before update, while srtt passed by tcp_bpf_rtt()
is that after update.
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240425161724.73707-2-lulie@linux.alibaba.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
The kernel repository has a .mailmap file to remap author names and
email addresses to their desired format in Git logs (for details, see
gitmailmap documentation [0]). Alas, this is only visible for author
information when looking at the logs locally, as GitHub does not support
mailmaps at the moment [1].
This commit adds a .mailmap file for libbpf, automatically generated
from the kernel's version. The script to generate the .mailmap is added,
too: it works by grepping email addresses from authors in the
repository, and collecting all lines ending with this address in the
kernel's .mailmap - in other words, all lines where this address is used
as a pattern for a remapping.
To keep the .mailmap up-to-date, add a call to the script to
sync-kernel.sh.
[0] https://git-scm.com/docs/gitmailmap
[1] https://github.com/orgs/community/discussions/22518
Signed-off-by: Quentin Monnet <qmo@kernel.org>
When dumping a character array, libbpf will watch for a '\0' and set
is_array_terminated=true if found. This prevents libbpf from printing
the remaining characters of the array, treating it as a nul-terminated
string.
However, once this flag is set, it's never reset, leading to subsequent
characters array not being printed properly:
.str_multi = (__u8[2][16])[
[
'H',
'e',
'l',
],
],
This patch saves the is_array_terminated flag and restores its
default (false) value before looping over the elements of an array,
then restores it afterward. This way, libbpf's behavior is unchanged
when dumping the characters of an array, but subsequent arrays are
printed properly:
.str_multi = (__u8[2][16])[
[
'H',
'e',
'l',
],
[
'l',
'o',
],
],
Signed-off-by: Quentin Deslandes <qde@naccy.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240413211258.134421-3-qde@naccy.de
In btf_dump_array_data(), libbpf will call btf_dump_dump_type_data() for
each element. For an array of characters, each element will be
processed the following way:
- btf_dump_dump_type_data() is called to print the character
- btf_dump_data_pfx() prefixes the current line with the proper number
of indentations
- btf_dump_int_data() is called to print the character
- After the last character is printed, btf_dump_dump_type_data() calls
btf_dump_data_pfx() before writing the closing bracket
However, for an array containing characters, btf_dump_int_data() won't
print any '\0' and subsequent characters. This leads to situations where
the line prefix is written, no character is added, then the prefix is
written again before adding the closing bracket:
(struct sk_metadata){
.str_array = (__u8[14])[
'H',
'e',
'l',
'l',
'o',
],
This change solves this issue by printing the '\0' character, which
has two benefits:
- The bracket closing the array is properly aligned
- It's clear from a user point of view that libbpf uses '\0' as a
terminator for arrays of characters.
Signed-off-by: Quentin Deslandes <qde@naccy.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240413211258.134421-2-qde@naccy.de
Add bpf_link support for sk_msg and sk_skb programs. We have an
internal request to support bpf_link for sk_msg programs so user
space can have a uniform handling with bpf_link based libbpf
APIs. Using bpf_link based libbpf API also has a benefit which
makes system robust by decoupling prog life cycle and
attachment life cycle.
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240410043527.3737160-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In some cases, instead of always consuming all items from ring buffers
in a greedy way, we may want to consume up to a certain amount of items,
for example when we need to copy items from the BPF ring buffer to a
limited user buffer.
This change allows to set an upper limit to the amount of items consumed
from one or more ring buffers.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240406092005.92399-3-andrea.righi@canonical.com
The struct bpf_fib_lookup is supposed to be of size 64. A recent commit
59b418c7063d ("bpf: Add a check for struct bpf_fib_lookup size") added
a static assertion to check this property so that future changes to the
structure will not accidentally break this assumption.
As it immediately turned out, on some 32-bit arm systems, when AEABI=n,
the total size of the structure was equal to 68, see [1]. This happened
because the bpf_fib_lookup structure contains a union of two 16-bit
fields:
union {
__u16 tot_len;
__u16 mtu_result;
};
which was supposed to compile to a 16-bit-aligned 16-bit field. On the
aforementioned setups it was instead both aligned and padded to 32-bits.
Declare this inner union as __attribute__((packed, aligned(2))) such
that it always is of size 2 and is aligned to 16 bits.
[1] https://lore.kernel.org/all/CA+G9fYtsoP51f-oP_Sp5MOq-Ffv8La2RztNpwvE6+R1VtFiLrw@mail.gmail.com/#t
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Fixes: e1850ea9bd9e ("bpf: bpf_fib_lookup return MTU value as output when looked up")
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240403123303.1452184-1-aspsk@isovalent.com
With CONFIG_LTO_CLANG_THIN enabled, with some of previous
version of kernel code base ([1]), I hit the following
error:
test_ksyms:PASS:kallsyms_fopen 0 nsec
test_ksyms:FAIL:ksym_find symbol 'bpf_link_fops' not found
#118 ksyms:FAIL
The reason is that 'bpf_link_fops' is renamed to
bpf_link_fops.llvm.8325593422554671469
Due to cross-file inlining, the static variable 'bpf_link_fops'
in syscall.c is used by a function in another file. To avoid
potential duplicated names, the llvm added suffix
'.llvm.<hash>' ([2]) to 'bpf_link_fops' variable.
Such renaming caused a problem in libbpf if 'bpf_link_fops'
is used in bpf prog as a ksym but 'bpf_link_fops' does not
match any symbol in /proc/kallsyms.
To fix this issue, libbpf needs to understand that suffix '.llvm.<hash>'
is caused by clang lto kernel and to process such symbols properly.
With latest bpf-next code base built with CONFIG_LTO_CLANG_THIN,
I cannot reproduce the above failure any more. But such an issue
could happen with other symbols or in the future for bpf_link_fops symbol.
For example, with my current kernel, I got the following from
/proc/kallsyms:
ffffffff84782154 d __func__.net_ratelimit.llvm.6135436931166841955
ffffffff85f0a500 d tk_core.llvm.726630847145216431
ffffffff85fdb960 d __fs_reclaim_map.llvm.10487989720912350772
ffffffff864c7300 d fake_dst_ops.llvm.54750082607048300
I could not easily create a selftest to test newly-added
libbpf functionality with a static C test since I do not know
which symbol is cross-file inlined. But based on my particular kernel,
the following test change can run successfully.
> diff --git a/tools/testing/selftests/bpf/prog_tests/ksyms.c b/tools/testing/selftests/bpf/prog_tests/ksyms.c
> index 6a86d1f07800..904a103f7b1d 100644
> --- a/tools/testing/selftests/bpf/prog_tests/ksyms.c
> +++ b/tools/testing/selftests/bpf/prog_tests/ksyms.c
> @@ -42,6 +42,7 @@ void test_ksyms(void)
> ASSERT_EQ(data->out__bpf_link_fops, link_fops_addr, "bpf_link_fops");
> ASSERT_EQ(data->out__bpf_link_fops1, 0, "bpf_link_fops1");
> ASSERT_EQ(data->out__btf_size, btf_size, "btf_size");
> + ASSERT_NEQ(data->out__fake_dst_ops, 0, "fake_dst_ops");
> ASSERT_EQ(data->out__per_cpu_start, per_cpu_start_addr, "__per_cpu_start");
>
> cleanup:
> diff --git a/tools/testing/selftests/bpf/progs/test_ksyms.c b/tools/testing/selftests/bpf/progs/test_ksyms.c
> index 6c9cbb5a3bdf..fe91eef54b66 100644
> --- a/tools/testing/selftests/bpf/progs/test_ksyms.c
> +++ b/tools/testing/selftests/bpf/progs/test_ksyms.c
> @@ -9,11 +9,13 @@ __u64 out__bpf_link_fops = -1;
> __u64 out__bpf_link_fops1 = -1;
> __u64 out__btf_size = -1;
> __u64 out__per_cpu_start = -1;
> +__u64 out__fake_dst_ops = -1;
>
> extern const void bpf_link_fops __ksym;
> extern const void __start_BTF __ksym;
> extern const void __stop_BTF __ksym;
> extern const void __per_cpu_start __ksym;
> +extern const void fake_dst_ops __ksym;
> /* non-existing symbol, weak, default to zero */
> extern const void bpf_link_fops1 __ksym __weak;
>
> @@ -23,6 +25,7 @@ int handler(const void *ctx)
> out__bpf_link_fops = (__u64)&bpf_link_fops;
> out__btf_size = (__u64)(&__stop_BTF - &__start_BTF);
> out__per_cpu_start = (__u64)&__per_cpu_start;
> + out__fake_dst_ops = (__u64)&fake_dst_ops;
>
> out__bpf_link_fops1 = (__u64)&bpf_link_fops1;
This patch fixed the issue in libbpf such that
the suffix '.llvm.<hash>' will be ignored during comparison of
bpf prog ksym vs. symbols in /proc/kallsyms, this resolved the issue.
Currently, only static variables in /proc/kallsyms are checked
with '.llvm.<hash>' suffix since in bpf programs function ksyms
with '.llvm.<hash>' suffix are most likely kfunc's and unlikely
to be cross-file inlined.
Note that currently kernel does not support gcc build with lto.
[1] https://lore.kernel.org/bpf/20240302165017.1627295-1-yonghong.song@linux.dev/
[2] https://github.com/llvm/llvm-project/blob/release/18.x/llvm/include/llvm/IR/ModuleSummaryIndex.h#L1714-L1719
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240326041458.1198161-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
BPF verifier emits "unknown func" message when given BPF program type
does not support BPF helper. This message may be confusing for users, as
important context that helper is unknown only to current program type is
not provided.
This patch changes message to "program of this type cannot use helper "
and aligns dependent code in libbpf and tests. Any suggestions on
improving/changing this message are welcome.
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Quentin Monnet <qmo@kernel.org>
Link: https://lore.kernel.org/r/20240325152210.377548-1-yatsenko@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Since its going directly to the syscall to avoid not having
memfd_create() available in some systems, do the same for its
MFD_CLOEXEC flags, defining it if not available.
This fixes the build in those systems, noticed while building perf on a
set of build containers.
Fixes: 9fa5e1a180aa639f ("libbpf: Call memfd_create() syscall directly")
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/ZfxZ9nCyKvwmpKkE@x1
It's been reported that (void *)map->map_extra is causing compilation
warnings on 32-bit architectures. It's easy enough to fix this by
casting to long first.
Fixes: 79ff13e99169 ("libbpf: Add support for bpf_arena.")
Reported-by: Ryan Eatmon <reatmon@ti.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Message-ID: <20240319215143.1279312-1-andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The selftests use
to tell LLVM about special pointers. For LLVM there is nothing "arena"
about them. They are simply pointers in a different address space.
Hence LLVM diff https://github.com/llvm/llvm-project/pull/85161 renamed:
. macro __BPF_FEATURE_ARENA_CAST -> __BPF_FEATURE_ADDR_SPACE_CAST
. global variables in __attribute__((address_space(N))) are now
placed in section named ".addr_space.N" instead of ".arena.N".
Adjust libbpf, bpftool, and selftests to match LLVM.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20240315021834.62988-3-alexei.starovoitov@gmail.com
Wire up BPF cookie for raw tracepoint programs (both BTF and non-BTF
aware variants). This brings them up to part w.r.t. BPF cookie usage
with classic tracepoint and fentry/fexit programs.
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Message-ID: <20240319233852.1977493-4-andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
libbpf creates bpf_program/bpf_map structs for each program/map that
user defines, but it allows to disable creating/loading those objects in
kernel, in that case they won't have associated file descriptor
(fd < 0). Such functionality is used for backward compatibility
with some older kernels.
Nothing prevents users from passing these maps or programs with no
kernel counterpart to libbpf APIs. This change introduces explicit
checks for kernel objects existence, aiming to improve visibility of
those edge cases and provide meaningful warnings to users.
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240318131808.95959-1-yatsenko@meta.com
Accept additional fields of a struct_ops type with all zero values even if
these fields are not in the corresponding type in the kernel. This provides
a way to be backward compatible. User space programs can use the same map
on a machine running an old kernel by clearing fields that do not exist in
the kernel.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240313214139.685112-2-thinker.li@gmail.com
In bpf_objec_load_prog(), there's no guarantee that obj->btf is non-NULL
when passing it to btf__fd(), and this function does not perform any
check before dereferencing its argument (as bpf_object__btf_fd() used to
do). As a consequence, we get segmentation fault errors in bpftool (for
example) when trying to load programs that come without BTF information.
v2: Keep btf__fd() in the fix instead of reverting to bpf_object__btf_fd().
Fixes: df7c3f7d3a3d ("libbpf: make uniform use of btf__fd() accessor inside libbpf")
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Quentin Monnet <qmo@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240314150438.232462-1-qmo@kernel.org
LLVM automatically places __arena variables into ".arena.1" ELF section.
In order to use such global variables bpf program must include definition
of arena map in ".maps" section, like:
struct {
__uint(type, BPF_MAP_TYPE_ARENA);
__uint(map_flags, BPF_F_MMAPABLE);
__uint(max_entries, 1000); /* number of pages */
__ulong(map_extra, 2ull << 44); /* start of mmap() region */
} arena SEC(".maps");
libbpf recognizes both uses of arena and creates single `struct bpf_map *`
instance in libbpf APIs.
".arena.1" ELF section data is used as initial data image, which is exposed
through skeleton and bpf_map__initial_value() to the user, if they need to tune
it before the load phase. During load phase, this initial image is copied over
into mmap()'ed region corresponding to arena, and discarded.
Few small checks here and there had to be added to make sure this
approach works with bpf_map__initial_value(), mostly due to hard-coded
assumption that map->mmaped is set up with mmap() syscall and should be
munmap()'ed. For arena, .arena.1 can be (much) smaller than maximum
arena size, so this smaller data size has to be tracked separately.
Given it is enforced that there is only one arena for entire bpf_object
instance, we just keep it in a separate field. This can be generalized
if necessary later.
All global variables from ".arena.1" section are accessible from user space
via skel->arena->name_of_var.
For bss/data/rodata the skeleton/libbpf perform the following sequence:
1. addr = mmap(MAP_ANONYMOUS)
2. user space optionally modifies global vars
3. map_fd = bpf_create_map()
4. bpf_update_map_elem(map_fd, addr) // to store values into the kernel
5. mmap(addr, MAP_FIXED, map_fd)
after step 5 user spaces see the values it wrote at step 2 at the same addresses
arena doesn't support update_map_elem. Hence skeleton/libbpf do:
1. addr = malloc(sizeof SEC ".arena.1")
2. user space optionally modifies global vars
3. map_fd = bpf_create_map(MAP_TYPE_ARENA)
4. real_addr = mmap(map->map_extra, MAP_SHARED | MAP_FIXED, map_fd)
5. memcpy(real_addr, addr) // this will fault-in and allocate pages
At the end look and feel of global data vs __arena global data is the same from
bpf prog pov.
Another complication is:
struct {
__uint(type, BPF_MAP_TYPE_ARENA);
} arena SEC(".maps");
int __arena foo;
int bar;
ptr1 = &foo; // relocation against ".arena.1" section
ptr2 = &arena; // relocation against ".maps" section
ptr3 = &bar; // relocation against ".bss" section
Fo the kernel ptr1 and ptr2 has point to the same arena's map_fd
while ptr3 points to a different global array's map_fd.
For the verifier:
ptr1->type == unknown_scalar
ptr2->type == const_ptr_to_map
ptr3->type == ptr_to_map_value
After verification, from JIT pov all 3 ptr-s are normal ld_imm64 insns.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20240308010812.89848-11-alexei.starovoitov@gmail.com
mmap() bpf_arena right after creation, since the kernel needs to
remember the address returned from mmap. This is user_vm_start.
LLVM will generate bpf_arena_cast_user() instructions where
necessary and JIT will add upper 32-bit of user_vm_start
to such pointers.
Fix up bpf_map_mmap_sz() to compute mmap size as
map->value_size * map->max_entries for arrays and
PAGE_SIZE * map->max_entries for arena.
Don't set BTF at arena creation time, since it doesn't support it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240308010812.89848-9-alexei.starovoitov@gmail.com
Introduce bpf_arena, which is a sparse shared memory region between the bpf
program and user space.
Use cases:
1. User space mmap-s bpf_arena and uses it as a traditional mmap-ed
anonymous region, like memcached or any key/value storage. The bpf
program implements an in-kernel accelerator. XDP prog can search for
a key in bpf_arena and return a value without going to user space.
2. The bpf program builds arbitrary data structures in bpf_arena (hash
tables, rb-trees, sparse arrays), while user space consumes it.
3. bpf_arena is a "heap" of memory from the bpf program's point of view.
The user space may mmap it, but bpf program will not convert pointers
to user base at run-time to improve bpf program speed.
Initially, the kernel vm_area and user vma are not populated. User space
can fault in pages within the range. While servicing a page fault,
bpf_arena logic will insert a new page into the kernel and user vmas. The
bpf program can allocate pages from that region via
bpf_arena_alloc_pages(). This kernel function will insert pages into the
kernel vm_area. The subsequent fault-in from user space will populate that
page into the user vma. The BPF_F_SEGV_ON_FAULT flag at arena creation time
can be used to prevent fault-in from user space. In such a case, if a page
is not allocated by the bpf program and not present in the kernel vm_area,
the user process will segfault. This is useful for use cases 2 and 3 above.
bpf_arena_alloc_pages() is similar to user space mmap(). It allocates pages
either at a specific address within the arena or allocates a range with the
maple tree. bpf_arena_free_pages() is analogous to munmap(), which frees
pages and removes the range from the kernel vm_area and from user process
vmas.
bpf_arena can be used as a bpf program "heap" of up to 4GB. The speed of
bpf program is more important than ease of sharing with user space. This is
use case 3. In such a case, the BPF_F_NO_USER_CONV flag is recommended.
It will tell the verifier to treat the rX = bpf_arena_cast_user(rY)
instruction as a 32-bit move wX = wY, which will improve bpf prog
performance. Otherwise, bpf_arena_cast_user is translated by JIT to
conditionally add the upper 32 bits of user vm_start (if the pointer is not
NULL) to arena pointers before they are stored into memory. This way, user
space sees them as valid 64-bit pointers.
Diff https://github.com/llvm/llvm-project/pull/84410 enables LLVM BPF
backend generate the bpf_addr_space_cast() instruction to cast pointers
between address_space(1) which is reserved for bpf_arena pointers and
default address space zero. All arena pointers in a bpf program written in
C language are tagged as __attribute__((address_space(1))). Hence, clang
provides helpful diagnostics when pointers cross address space. Libbpf and
the kernel support only address_space == 1. All other address space
identifiers are reserved.
rX = bpf_addr_space_cast(rY, /* dst_as */ 1, /* src_as */ 0) tells the
verifier that rX->type = PTR_TO_ARENA. Any further operations on
PTR_TO_ARENA register have to be in the 32-bit domain. The verifier will
mark load/store through PTR_TO_ARENA with PROBE_MEM32. JIT will generate
them as kern_vm_start + 32bit_addr memory accesses. The behavior is similar
to copy_from_kernel_nofault() except that no address checks are necessary.
The address is guaranteed to be in the 4GB range. If the page is not
present, the destination register is zeroed on read, and the operation is
ignored on write.
rX = bpf_addr_space_cast(rY, 0, 1) tells the verifier that rX->type =
unknown scalar. If arena->map_flags has BPF_F_NO_USER_CONV set, then the
verifier converts such cast instructions to mov32. Otherwise, JIT will emit
native code equivalent to:
rX = (u32)rY;
if (rY)
rX |= clear_lo32_bits(arena->user_vm_start); /* replace hi32 bits in rX */
After such conversion, the pointer becomes a valid user pointer within
bpf_arena range. The user process can access data structures created in
bpf_arena without any additional computations. For example, a linked list
built by a bpf program can be walked natively by user space.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Barret Rhoden <brho@google.com>
Link: https://lore.kernel.org/bpf/20240308010812.89848-2-alexei.starovoitov@gmail.com
__uint() macro that is used to specify map attributes like:
__uint(type, BPF_MAP_TYPE_ARRAY);
__uint(map_flags, BPF_F_MMAPABLE);
It is limited to 32-bit, since BTF_KIND_ARRAY has u32 "number of elements"
field in "struct btf_array".
Introduce __ulong() macro that allows specifying values bigger than 32-bit.
In map definition "map_extra" is the only u64 field, so far.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20240307031228.42896-5-alexei.starovoitov@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
The ethtool-nl family does a good job exposing various protocol
related and IEEE/IETF statistics which used to get dumped under
ethtool -S, with creative names. Queue stats don't have a netlink
API, yet, and remain a lion's share of ethtool -S output for new
drivers. Not only is that bad because the names differ driver to
driver but it's also bug-prone. Intuitively drivers try to report
only the stats for active queues, but querying ethtool stats
involves multiple system calls, and the number of stats is
read separately from the stats themselves. Worse still when user
space asks for values of the stats, it doesn't inform the kernel
how big the buffer is. If number of stats increases in the meantime
kernel will overflow user buffer.
Add a netlink API for dumping queue stats. Queue information is
exposed via the netdev-genl family, so add the stats there.
Support per-queue and sum-for-device dumps. Latter will be useful
when subsequent patches add more interesting common stats than
just bytes and packets.
The API does not currently distinguish between HW and SW stats.
The expectation is that the source of the stats will either not
matter much (good packets) or be obvious (skb alloc errors).
Acked-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Amritha Nambiar <amritha.nambiar@intel.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240306195509.1502746-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
bpf tree has fixes for xdp_bonding selftests which are not yet in
bpf-next, so add them as temporary CI-only patches.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Optional struct_ops maps are defined using question mark at the start
of the section name, e.g.:
SEC("?.struct_ops")
struct test_ops optional_map = { ... };
This commit teaches libbpf to detect if kernel allows '?' prefix
in datasec names, and if it doesn't then to rewrite such names
by replacing '?' with '_', e.g.:
DATASEC ?.struct_ops -> DATASEC _.struct_ops
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240306104529.6453-13-eddyz87@gmail.com
The next patch would add two new section names for struct_ops maps.
To make working with multiple struct_ops sections more convenient:
- remove fields like elf_state->st_ops_{shndx,link_shndx};
- mark section descriptions hosting struct_ops as
elf_sec_desc->sec_type == SEC_ST_OPS;
After these changes struct_ops sections could be processed uniformly
by iterating bpf_object->efile.secs entries.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240306104529.6453-11-eddyz87@gmail.com
Automatically select which struct_ops programs to load depending on
which struct_ops maps are selected for automatic creation.
E.g. for the BPF code below:
SEC("struct_ops/test_1") int BPF_PROG(foo) { ... }
SEC("struct_ops/test_2") int BPF_PROG(bar) { ... }
SEC(".struct_ops.link")
struct test_ops___v1 A = {
.foo = (void *)foo
};
SEC(".struct_ops.link")
struct test_ops___v2 B = {
.foo = (void *)foo,
.bar = (void *)bar,
};
And the following libbpf API calls:
bpf_map__set_autocreate(skel->maps.A, true);
bpf_map__set_autocreate(skel->maps.B, false);
The autoload would be enabled for program 'foo' and disabled for
program 'bar'.
During load, for each struct_ops program P, referenced from some
struct_ops map M:
- set P.autoload = true if M.autocreate is true for some M;
- set P.autoload = false if M.autocreate is false for all M;
- don't change P.autoload, if P is not referenced from any map.
Do this after bpf_object__init_kern_struct_ops_maps()
to make sure that shadow vars assignment is done.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240306104529.6453-9-eddyz87@gmail.com
Skip load steps for struct_ops maps not marked for automatic creation.
This should allow to load bpf object in situations like below:
SEC("struct_ops/foo") int BPF_PROG(foo) { ... }
SEC("struct_ops/bar") int BPF_PROG(bar) { ... }
struct test_ops___v1 {
int (*foo)(void);
};
struct test_ops___v2 {
int (*foo)(void);
int (*does_not_exist)(void);
};
SEC(".struct_ops.link")
struct test_ops___v1 map_for_old = {
.test_1 = (void *)foo
};
SEC(".struct_ops.link")
struct test_ops___v2 map_for_new = {
.test_1 = (void *)foo,
.does_not_exist = (void *)bar
};
Suppose program is loaded on old kernel that does not have definition
for 'does_not_exist' struct_ops member. After this commit it would be
possible to load such object file after the following tweaks:
bpf_program__set_autoload(skel->progs.bar, false);
bpf_map__set_autocreate(skel->maps.map_for_new, false);
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20240306104529.6453-4-eddyz87@gmail.com
Enforce the following existing limitation on struct_ops programs based
on kernel BTF id instead of program-local BTF id:
struct_ops BPF prog can be re-used between multiple .struct_ops &
.struct_ops.link as long as it's the same struct_ops struct
definition and the same function pointer field
This allows reusing same BPF program for versioned struct_ops map
definitions, e.g.:
SEC("struct_ops/test")
int BPF_PROG(foo) { ... }
struct some_ops___v1 { int (*test)(void); };
struct some_ops___v2 { int (*test)(void); };
SEC(".struct_ops.link") struct some_ops___v1 a = { .test = foo }
SEC(".struct_ops.link") struct some_ops___v2 b = { .test = foo }
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240306104529.6453-3-eddyz87@gmail.com
E.g. allow the following struct_ops definitions:
struct bpf_testmod_ops___v1 { int (*test)(void); };
struct bpf_testmod_ops___v2 { int (*test)(void); };
SEC(".struct_ops.link")
struct bpf_testmod_ops___v1 a = { .test = ... }
SEC(".struct_ops.link")
struct bpf_testmod_ops___v2 b = { .test = ... }
Where both bpf_testmod_ops__v1 and bpf_testmod_ops__v2 would be
resolved as 'struct bpf_testmod_ops' from kernel BTF.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240306104529.6453-2-eddyz87@gmail.com
Introduce may_goto instruction that from the verifier pov is similar to
open coded iterators bpf_for()/bpf_repeat() and bpf_loop() helper, but it
doesn't iterate any objects.
In assembly 'may_goto' is a nop most of the time until bpf runtime has to
terminate the program for whatever reason. In the current implementation
may_goto has a hidden counter, but other mechanisms can be used.
For programs written in C the later patch introduces 'cond_break' macro
that combines 'may_goto' with 'break' statement and has similar semantics:
cond_break is a nop until bpf runtime has to break out of this loop.
It can be used in any normal "for" or "while" loop, like
for (i = zero; i < cnt; cond_break, i++) {
The verifier recognizes that may_goto is used in the program, reserves
additional 8 bytes of stack, initializes them in subprog prologue, and
replaces may_goto instruction with:
aux_reg = *(u64 *)(fp - 40)
if aux_reg == 0 goto pc+off
aux_reg -= 1
*(u64 *)(fp - 40) = aux_reg
may_goto instruction can be used by LLVM to implement __builtin_memcpy,
__builtin_strcmp.
may_goto is not a full substitute for bpf_for() macro.
bpf_for() doesn't have induction variable that verifiers sees,
so 'i' in bpf_for(i, 0, 100) is seen as imprecise and bounded.
But when the code is written as:
for (i = 0; i < 100; cond_break, i++)
the verifier see 'i' as precise constant zero,
hence cond_break (aka may_goto) doesn't help to converge the loop.
A static or global variable can be used as a workaround:
static int zero = 0;
for (i = zero; i < 100; cond_break, i++) // works!
may_goto works well with arena pointers that don't need to be bounds
checked on access. Load/store from arena returns imprecise unbounded
scalar and loops with may_goto pass the verifier.
Reserve new opcode BPF_JMP | BPF_JCOND for may_goto insn.
JCOND stands for conditional pseudo jump.
Since goto_or_nop insn was proposed, it may use the same opcode.
may_goto vs goto_or_nop can be distinguished by src_reg:
code = BPF_JMP | BPF_JCOND
src_reg = 0 - may_goto
src_reg = 1 - goto_or_nop
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240306031929.42666-2-alexei.starovoitov@gmail.com
Convert st_ops->data to the shadow type of the struct_ops map. The shadow
type of a struct_ops type is a variant of the original struct type
providing a way to access/change the values in the maps of the struct_ops
type.
bpf_map__initial_value() will return st_ops->data for struct_ops types. The
skeleton is going to use it as the pointer to the shadow type of the
original struct type.
One of the main differences between the original struct type and the shadow
type is that all function pointers of the shadow type are converted to
pointers of struct bpf_program. Users can replace these bpf_program
pointers with other BPF programs. The st_ops->progs[] will be updated
before updating the value of a map to reflect the changes made by users.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240229064523.2091270-3-thinker.li@gmail.com
For a struct_ops map, btf_value_type_id is the type ID of it's struct
type. This value is required by bpftool to generate skeleton including
pointers of shadow types. The code generator gets the type ID from
bpf_map__btf_value_type_id() in order to get the type information of the
struct type of a map.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240229064523.2091270-2-thinker.li@gmail.com
Replace deprecated 0-length array in struct bpf_lpm_trie_key with
flexible array. Found with GCC 13:
../kernel/bpf/lpm_trie.c:207:51: warning: array subscript i is outside array bounds of 'const __u8[0]' {aka 'const unsigned char[]'} [-Warray-bounds=]
207 | *(__be16 *)&key->data[i]);
| ^~~~~~~~~~~~~
../include/uapi/linux/swab.h:102:54: note: in definition of macro '__swab16'
102 | #define __swab16(x) (__u16)__builtin_bswap16((__u16)(x))
| ^
../include/linux/byteorder/generic.h:97:21: note: in expansion of macro '__be16_to_cpu'
97 | #define be16_to_cpu __be16_to_cpu
| ^~~~~~~~~~~~~
../kernel/bpf/lpm_trie.c:206:28: note: in expansion of macro 'be16_to_cpu'
206 | u16 diff = be16_to_cpu(*(__be16 *)&node->data[i]
^
| ^~~~~~~~~~~
In file included from ../include/linux/bpf.h:7:
../include/uapi/linux/bpf.h:82:17: note: while referencing 'data'
82 | __u8 data[0]; /* Arbitrary size */
| ^~~~
And found at run-time under CONFIG_FORTIFY_SOURCE:
UBSAN: array-index-out-of-bounds in kernel/bpf/lpm_trie.c:218:49
index 0 is out of range for type '__u8 [*]'
Changing struct bpf_lpm_trie_key is difficult since has been used by
userspace. For example, in Cilium:
struct egress_gw_policy_key {
struct bpf_lpm_trie_key lpm_key;
__u32 saddr;
__u32 daddr;
};
While direct references to the "data" member haven't been found, there
are static initializers what include the final member. For example,
the "{}" here:
struct egress_gw_policy_key in_key = {
.lpm_key = { 32 + 24, {} },
.saddr = CLIENT_IP,
.daddr = EXTERNAL_SVC_IP & 0Xffffff,
};
To avoid the build time and run time warnings seen with a 0-sized
trailing array for struct bpf_lpm_trie_key, introduce a new struct
that correctly uses a flexible array for the trailing bytes,
struct bpf_lpm_trie_key_u8. As part of this, include the "header"
portion (which is just the "prefixlen" member), so it can be used
by anything building a bpf_lpr_trie_key that has trailing members that
aren't a u8 flexible array (like the self-test[1]), which is named
struct bpf_lpm_trie_key_hdr.
Unfortunately, C++ refuses to parse the __struct_group() helper, so
it is not possible to define struct bpf_lpm_trie_key_hdr directly in
struct bpf_lpm_trie_key_u8, so we must open-code the union directly.
Adjust the kernel code to use struct bpf_lpm_trie_key_u8 through-out,
and for the selftest to use struct bpf_lpm_trie_key_hdr. Add a comment
to the UAPI header directing folks to the two new options.
Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Closes: https://paste.debian.net/hidden/ca500597/
Link: https://lore.kernel.org/all/202206281009.4332AA33@keescook/ [1]
Link: https://lore.kernel.org/bpf/20240222155612.it.533-kees@kernel.org
Add support for the independent control state machine per IEEE
802.1AX-2008 5.4.15 in addition to the existing implementation of the
coupled control state machine.
Introduces two new states, AD_MUX_COLLECTING and AD_MUX_DISTRIBUTING in
the LACP MUX state machine for separated handling of an initial
Collecting state before the Collecting and Distributing state. This
enables a port to be in a state where it can receive incoming packets
while not still distributing. This is useful for reducing packet loss when
a port begins distributing before its partner is able to collect.
Added new functions such as bond_set_slave_tx_disabled_flags and
bond_set_slave_rx_enabled_flags to precisely manage the port's collecting
and distributing states. Previously, there was no dedicated method to
disable TX while keeping RX enabled, which this patch addresses.
Note that the regular flow process in the kernel's bonding driver remains
unaffected by this patch. The extension requires explicit opt-in by the
user (in order to ensure no disruptions for existing setups) via netlink
support using the new bonding parameter coupled_control. The default value
for coupled_control is set to 1 so as to preserve existing behaviour.
Signed-off-by: Aahil Awatramani <aahila@google.com>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20240202175858.1573852-1-aahila@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The batch lookup and lookup_and_delete APIs have two parameters,
in_batch and out_batch, to facilitate iterative
lookup/lookup_and_deletion operations for supported maps. Except NULL
for in_batch at the start of these two batch operations, both parameters
need to point to memory equal or larger than the respective map key
size, except for various hashmaps (hash, percpu_hash, lru_hash,
lru_percpu_hash) where the in_batch/out_batch memory size should be
at least 4 bytes.
Document these semantics to clarify the API.
Signed-off-by: Martin Kelly <martin.kelly@crowdstrike.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240221211838.1241578-1-martin.kelly@crowdstrike.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
In some situations, if you fail to zero-initialize the
bpf_{prog,map,btf,link}_info structs supplied to the set of LIBBPF
helpers bpf_{prog,map,btf,link}_get_info_by_fd(), you can expect the
helper to return an error. This can possibly leave people in a
situation where they're scratching their heads for an unnnecessary
amount of time. Make an explicit remark about the requirement of
zero-initializing the supplied bpf_{prog,map,btf,link}_info structs
for the respective LIBBPF helpers.
Internally, LIBBPF helpers bpf_{prog,map,btf,link}_get_info_by_fd()
call into bpf_obj_get_info_by_fd() where the bpf(2)
BPF_OBJ_GET_INFO_BY_FD command is used. This specific command is
effectively backed by restrictions enforced by the
bpf_check_uarg_tail_zero() helper. This function ensures that if the
size of the supplied bpf_{prog,map,btf,link}_info structs are larger
than what the kernel can handle, trailing bits are zeroed. This can be
a problem when compiling against UAPI headers that don't necessarily
match the sizes of the same underlying types known to the kernel.
Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/ZcyEb8x4VbhieWsL@google.com
Due to internal differences between LLVM and GCC the current
implementation for the CO-RE macros does not fit GCC parser, as it will
optimize those expressions even before those would be accessible by the
BPF backend.
As examples, the following would be optimized out with the original
definitions:
- As enums are converted to their integer representation during
parsing, the IR would not know how to distinguish an integer
constant from an actual enum value.
- Types need to be kept as temporary variables, as the existing type
casts of the 0 address (as expanded for LLVM), are optimized away by
the GCC C parser, never really reaching GCCs IR.
Although, the macros appear to add extra complexity, the expanded code
is removed from the compilation flow very early in the compilation
process, not really affecting the quality of the generated assembly.
Signed-off-by: Cupertino Miranda <cupertino.miranda@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240213173543.1397708-1-cupertino.miranda@oracle.com
We get:
libbpf: struct_ops init_kern: struct bpf_dummy_ops is not found in kernel BTF
So even though it's irrelevant to the subtests we do want to test,
entire test has to be skipped, unfortunately.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
When the feature_flags and xdp_zc_max_segs fields were added to the libbpf
bpf_xdp_query_opts, the code writing them did not use the OPTS_SET() macro.
This causes libbpf to write to those fields unconditionally, which means
that programs compiled against an older version of libbpf (with a smaller
size of the bpf_xdp_query_opts struct) will have its stack corrupted by
libbpf writing out of bounds.
The patch adding the feature_flags field has an early bail out if the
feature_flags field is not part of the opts struct (via the OPTS_HAS)
macro, but the patch adding xdp_zc_max_segs does not. For consistency, this
fix just changes the assignments to both fields to use the OPTS_SET()
macro.
Fixes: 13ce2daa259a ("xsk: add new netlink attribute dedicated for ZC max frags")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240206125922.1992815-1-toke@redhat.com
If PERF_EVENT program has __arg_ctx argument with matching
architecture-specific pt_regs/user_pt_regs/user_regs_struct pointer
type, libbpf should still perform type rewrite for old kernels, but not
emit the warning. Fix copy/paste from kernel code where 0 is meant to
signify "no error" condition. For libbpf we need to return "true" to
proceed with type rewrite (which for PERF_EVENT program will be
a canonical `struct bpf_perf_event_data *` type).
Fixes: 9eea8fafe33e ("libbpf: fix __arg_ctx type enforcement for perf_event programs")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240206002243.1439450-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Allowlist test_global_funcs/arg_tag_ctx* and a few of
verifier_global_subprogs subtests that validate libbpf's logic for
rewriting __arg_ctx globl subprog argument types on kernels that don't
natively support __arg_ctx.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Another API that was declared in libbpf.map but actual implementation
was missing. btf_ext__get_raw_data() was intended as a discouraged alias
to consistently-named btf_ext__raw_data(), so make this an actuality.
Fixes: 20eccf29e297 ("libbpf: hide and discourage inconsistently named getters")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240201172027.604869-5-andrii@kernel.org
LIBBPF_API annotation seems missing on libbpf_set_memlock_rlim API, so
add it to make this API callable from libbpf's shared library version.
Fixes: e542f2c4cd16 ("libbpf: Auto-bump RLIMIT_MEMLOCK if kernel needs it for BPF")
Fixes: ab9a5a05dc48 ("libbpf: fix up few libbpf.map problems")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240201172027.604869-3-andrii@kernel.org
After recent changes, Coverity complained about inconsistent null checks
in kernel_supports() function:
kernel_supports(const struct bpf_object *obj, ...)
[...]
// var_compare_op: Comparing obj to null implies that obj might be null
if (obj && obj->gen_loader)
return true;
// var_deref_op: Dereferencing null pointer obj
if (obj->token_fd)
return feat_supported(obj->feat_cache, feat_id);
[...]
- The original null check was introduced by commit [0], which introduced
a call `kernel_supports(NULL, ...)` in function bump_rlimit_memlock();
- This call was refactored to use `feat_supported(NULL, ...)` in commit [1].
Looking at all places where kernel_supports() is called:
- There is either `obj->...` access before the call;
- Or `obj` comes from `prog->obj` expression, where `prog` comes from
enumeration of programs in `obj`;
- Or `obj` comes from `prog->obj`, where `prog` is a parameter to one
of the API functions:
- bpf_program__attach_kprobe_opts;
- bpf_program__attach_kprobe;
- bpf_program__attach_ksyscall.
Assuming correct API usage, it appears that `obj` can never be null when
passed to kernel_supports(). Silence the Coverity warning by removing
redundant null check.
[0] e542f2c4cd16 ("libbpf: Auto-bump RLIMIT_MEMLOCK if kernel needs it for BPF")
[1] d6dd1d49367a ("libbpf: Further decouple feature checking logic from bpf_object")
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240131212615.20112-1-eddyz87@gmail.com
Add bpf_core_cast() macro that wraps bpf_rdonly_cast() kfunc. It's more
ergonomic than kfunc, as it automatically extracts btf_id with
bpf_core_type_id_kernel(), and works with type names. It also casts result
to (T *) pointer. See the definition of the macro, it's self-explanatory.
libbpf declares bpf_rdonly_cast() extern as __weak __ksym and should be
safe to not conflict with other possible declarations in user code.
But we do have a conflict with current BPF selftests that declare their
externs with first argument as `void *obj`, while libbpf opts into more
permissive `const void *obj`. This causes conflict, so we fix up BPF
selftests uses in the same patch.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240130212023.183765-2-andrii@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Add __arg_trusted to annotate global func args that accept trusted
PTR_TO_BTF_ID arguments.
Also add __arg_nullable to combine with __arg_trusted (and maybe other
tags in the future) to force global subprog itself (i.e., callee) to do
NULL checks, as opposed to default non-NULL semantics (and thus caller's
responsibility to ensure non-NULL values).
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240130000648.2144827-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
As CONFIG_DEBUG_INFO_BTF is default off the existing "failed to find
valid kernel BTF" message makes diagnosing the kernel build issue somewhat
cryptic. Add a little more detail with the hope of helping users.
Before:
```
libbpf: failed to find valid kernel BTF
libbpf: Error loading vmlinux BTF: -3
```
After not accessible:
```
libbpf: kernel BTF is missing at '/sys/kernel/btf/vmlinux', was CONFIG_DEBUG_INFO_BTF enabled?
libbpf: failed to find valid kernel BTF
libbpf: Error loading vmlinux BTF: -3
```
After not readable:
```
libbpf: failed to read kernel BTF from (/sys/kernel/btf/vmlinux): -1
```
Closes: https://lore.kernel.org/bpf/CAP-5=fU+DN_+Y=Y4gtELUsJxKNDDCOvJzPHvjUVaUoeFAzNnig@mail.gmail.com/
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240125231840.1647951-1-irogers@google.com
This replicates kernel upstream setup and brings READ_ONCE() and
WRITE_ONCE() macros anywhere where linux/kernel.h is included, which is
assumption libbpf code makes.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Libbpf got new source code file, features.c, we need to add it to
Makefile here on Github version as well.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Add BPF_CALL_REL() macro implementation into include/linux/filter.h
header, which is now used by libbpf code for feature detection.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
To allow external admin authority to override default BPF FS location
(/sys/fs/bpf) for implicit BPF token creation, teach libbpf to recognize
LIBBPF_BPF_TOKEN_PATH envvar. If it is specified and user application
didn't explicitly specify bpf_token_path option, it will be treated
exactly like bpf_token_path option, overriding default /sys/fs/bpf
location and making BPF token mandatory.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-29-andrii@kernel.org
Add BPF token support to BPF object-level functionality.
BPF token is supported by BPF object logic either as an explicitly
provided BPF token from outside (through BPF FS path), or implicitly
(unless prevented through bpf_object_open_opts).
Implicit mode is assumed to be the most common one for user namespaced
unprivileged workloads. The assumption is that privileged container
manager sets up default BPF FS mount point at /sys/fs/bpf with BPF token
delegation options (delegate_{cmds,maps,progs,attachs} mount options).
BPF object during loading will attempt to create BPF token from
/sys/fs/bpf location, and pass it for all relevant operations
(currently, map creation, BTF load, and program load).
In this implicit mode, if BPF token creation fails due to whatever
reason (BPF FS is not mounted, or kernel doesn't support BPF token,
etc), this is not considered an error. BPF object loading sequence will
proceed with no BPF token.
In explicit BPF token mode, user provides explicitly custom BPF FS mount
point path. In such case, BPF object will attempt to create BPF token
from provided BPF FS location. If BPF token creation fails, that is
considered a critical error and BPF object load fails with an error.
Libbpf provides a way to disable implicit BPF token creation, if it
causes any troubles (BPF token is designed to be completely optional and
shouldn't cause any problems even if provided, but in the world of BPF
LSM, custom security logic can be installed that might change outcome
depending on the presence of BPF token). To disable libbpf's default BPF
token creation behavior user should provide either invalid BPF token FD
(negative), or empty bpf_token_path option.
BPF token presence can influence libbpf's feature probing, so if BPF
object has associated BPF token, feature probing is instructed to use
BPF object-specific feature detection cache and token FD.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-26-andrii@kernel.org
Adjust feature probing callbacks to take into account optional token_fd.
In unprivileged contexts, some feature detectors would fail to detect
kernel support just because BPF program, BPF map, or BTF object can't be
loaded due to privileged nature of those operations. So when BPF object
is loaded with BPF token, this token should be used for feature probing.
This patch is setting support for this scenario, but we don't yet pass
non-zero token FD. This will be added in the next patch.
We also switched BPF cookie detector from using kprobe program to
tracepoint one, as tracepoint is somewhat less dangerous BPF program
type and has higher likelihood of being allowed through BPF token in the
future. This change has no effect on detection behavior.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-25-andrii@kernel.org
Add feat_supported() helper that accepts feature cache instead of
bpf_object. This allows low-level code in bpf.c to not know or care
about higher-level concept of bpf_object, yet it will be able to utilize
custom feature checking in cases where BPF token might influence the
outcome.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-23-andrii@kernel.org
Allow user to specify token_fd for bpf_btf_load() API that wraps
kernel's BPF_BTF_LOAD command. This allows loading BTF from unprivileged
process as long as it has BPF token allowing BPF_BTF_LOAD command, which
can be created and delegated by privileged process.
Wire through new btf_flags as well, so that user can provide
BPF_F_TOKEN_FD flag, if necessary.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-15-andrii@kernel.org
Add basic support of BPF token to BPF_PROG_LOAD. BPF_F_TOKEN_FD flag
should be set in prog_flags field when providing prog_token_fd.
Wire through a set of allowed BPF program types and attach types,
derived from BPF FS at BPF token creation time. Then make sure we
perform bpf_token_capable() checks everywhere where it's relevant.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-7-andrii@kernel.org
Accept BPF token FD in BPF_BTF_LOAD command to allow BTF data loading
through delegated BPF token. BPF_F_TOKEN_FD flag has to be specified
when passing BPF token FD. Given BPF_BTF_LOAD command didn't have flags
field before, we also add btf_flags field.
BTF loading is a pretty straightforward operation, so as long as BPF
token is created with allow_cmds granting BPF_BTF_LOAD command, kernel
proceeds to parsing BTF data and creating BTF object.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-6-andrii@kernel.org
Allow providing token_fd for BPF_MAP_CREATE command to allow controlled
BPF map creation from unprivileged process through delegated BPF token.
New BPF_F_TOKEN_FD flag is added to specify together with BPF token FD
for BPF_MAP_CREATE command.
Wire through a set of allowed BPF map types to BPF token, derived from
BPF FS at BPF token creation time. This, in combination with allowed_cmds
allows to create a narrowly-focused BPF token (controlled by privileged
agent) with a restrictive set of BPF maps that application can attempt
to create.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-5-andrii@kernel.org
Add new kind of BPF kernel object, BPF token. BPF token is meant to
allow delegating privileged BPF functionality, like loading a BPF
program or creating a BPF map, from privileged process to a *trusted*
unprivileged process, all while having a good amount of control over which
privileged operations could be performed using provided BPF token.
This is achieved through mounting BPF FS instance with extra delegation
mount options, which determine what operations are delegatable, and also
constraining it to the owning user namespace (as mentioned in the
previous patch).
BPF token itself is just a derivative from BPF FS and can be created
through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
FS FD, which can be attained through open() API by opening BPF FS mount
point. Currently, BPF token "inherits" delegated command, map types,
prog type, and attach type bit sets from BPF FS as is. In the future,
having an BPF token as a separate object with its own FD, we can allow
to further restrict BPF token's allowable set of things either at the
creation time or after the fact, allowing the process to guard itself
further from unintentionally trying to load undesired kind of BPF
programs. But for now we keep things simple and just copy bit sets as is.
When BPF token is created from BPF FS mount, we take reference to the
BPF super block's owning user namespace, and then use that namespace for
checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
capabilities that are normally only checked against init userns (using
capable()), but now we check them using ns_capable() instead (if BPF
token is provided). See bpf_token_capable() for details.
Such setup means that BPF token in itself is not sufficient to grant BPF
functionality. User namespaced process has to *also* have necessary
combination of capabilities inside that user namespace. So while
previously CAP_BPF was useless when granted within user namespace, now
it gains a meaning and allows container managers and sys admins to have
a flexible control over which processes can and need to use BPF
functionality within the user namespace (i.e., container in practice).
And BPF FS delegation mount options and derived BPF tokens serve as
a per-container "flag" to grant overall ability to use bpf() (plus further
restrict on which parts of bpf() syscalls are treated as namespaced).
Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
within the BPF FS owning user namespace, rounding up the ns_capable()
story of BPF token. Also creating BPF token in init user namespace is
currently not supported, given BPF token doesn't have any effect in init
user namespace anyways.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-4-andrii@kernel.org
The commit 9e926acda0c2 ("libbpf: Find correct module BTFs for struct_ops maps and progs.")
sets a newly added field (value_type_btf_obj_fd) to -1 in libbpf when
the caller of the libbpf's bpf_map_create did not define this field by
passing a NULL "opts" or passing in a "opts" that does not cover this
new field. OPT_HAS(opts, field) is used to decide if the field is
defined or not:
((opts) && opts->sz >= offsetofend(typeof(*(opts)), field))
Once OPTS_HAS decided the field is not defined, that field should
be set to 0. For this particular new field (value_type_btf_obj_fd),
its corresponding map_flags "BPF_F_VTYPE_BTF_OBJ_FD" is not set.
Thus, the kernel does not treat it as an fd field.
Fixes: 9e926acda0c2 ("libbpf: Find correct module BTFs for struct_ops maps and progs.")
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240124224418.2905133-1-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Locate the module BTFs for struct_ops maps and progs and pass them to the
kernel. This ensures that the kernel correctly resolves type IDs from the
appropriate module BTFs.
For the map of a struct_ops object, the FD of the module BTF is set to
bpf_map to keep a reference to the module BTF. The FD is passed to the
kernel as value_type_btf_obj_fd when the struct_ops object is loaded.
For a bpf_struct_ops prog, attach_btf_obj_fd of bpf_prog is the FD of a
module BTF in the kernel.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240119225005.668602-13-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Pass the fd of a btf from the userspace to the bpf() syscall, and then
convert the fd into a btf. The btf is generated from the module that
defines the target BPF struct_ops type.
In order to inform the kernel about the module that defines the target
struct_ops type, the userspace program needs to provide a btf fd for the
respective module's btf. This btf contains essential information on the
types defined within the module, including the target struct_ops type.
A btf fd must be provided to the kernel for struct_ops maps and for the bpf
programs attached to those maps.
In the case of the bpf programs, the attach_btf_obj_fd parameter is passed
as part of the bpf_attr and is converted into a btf. This btf is then
stored in the prog->aux->attach_btf field. Here, it just let the verifier
access attach_btf directly.
In the case of struct_ops maps, a btf fd is passed as value_type_btf_obj_fd
of bpf_attr. The bpf_struct_ops_map_alloc() function converts the fd to a
btf and stores it as st_map->btf. A flag BPF_F_VTYPE_BTF_OBJ_FD is added
for map_flags to indicate that the value of value_type_btf_obj_fd is set.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-9-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Include btf object id (btf_obj_id) in bpf_map_info so that tools (ex:
bpftools struct_ops dump) know the correct btf from the kernel to look up
type information of struct_ops types.
Since struct_ops types can be defined and registered in a module. The
type information of a struct_ops type are defined in the btf of the
module defining it. The userspace tools need to know which btf is for
the module defining a struct_ops type.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-7-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
At the moment we don't store cookie for perf_event probes,
while we do that for the rest of the probes.
Adding cookie fields to struct bpf_link_info perf event
probe records:
perf_event.uprobe
perf_event.kprobe
perf_event.tracepoint
perf_event.perf_event
And the code to store that in bpf_link_info struct.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20240119110505.400573-2-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
We've ran into issues with using dup2() API in production setting, where
libbpf is linked into large production environment and ends up calling
unintended custom implementations of dup2(). These custom implementations
don't provide atomic FD replacement guarantees of dup2() syscall,
leading to subtle and hard to debug issues.
To prevent this in the future and guarantee that no libc implementation
will do their own custom non-atomic dup2() implementation, call dup2()
syscall directly with syscall(SYS_dup2).
Note that some architectures don't seem to provide dup2 and have dup3
instead. Try to detect and pick best syscall.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240119210201.1295511-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch allows to auto create BPF_MAP_TYPE_ARRAY_OF_MAPS and
BPF_MAP_TYPE_HASH_OF_MAPS with values of BPF_MAP_TYPE_PERF_EVENT_ARRAY
by bpf_object__load().
Previous behaviour created a zero filled btf_map_def for inner maps and
tried to use it for a map creation but the linux kernel forbids to create
a BPF_MAP_TYPE_PERF_EVENT_ARRAY map with max_entries=0.
Fixes: 646f02ffdd49 ("libbpf: Add BTF-defined map-in-map support")
Signed-off-by: Andrey Grafin <conquistador@yandex-team.ru>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20240117130619.9403-1-conquistador@yandex-team.ru
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
On kernel that don't support arg:ctx tag, before adjusting global
subprog BTF information to match kernel's expected canonical type names,
make sure that types used by user are meaningful, and if not, warn and
don't do BTF adjustments.
This is similar to checks that kernel performs, but narrower in scope,
as only a small subset of BPF program types can be accommodated by
libbpf using canonical type names.
Libbpf unconditionally allows `struct pt_regs *` for perf_event program
types, unlike kernel, which supports that conditionally on architecture.
This is done to keep things simple and not cause unnecessary false
positives. This seems like a minor and harmless deviation, which in
real-world programs will be caught by kernels with arg:ctx tag support
anyways. So KISS principle.
This logic is hard to test (especially on latest kernels), so manual
testing was performed instead. Libbpf emitted the following warning for
perf_event program with wrong context argument type:
libbpf: prog 'arg_tag_ctx_perf': subprog 'subprog_ctx_tag' arg#0 is expected to be of `struct bpf_perf_event_data *` type
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240118033143.3384355-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add feature detector of kernel-side arg:ctx (__arg_ctx) tag support. If
this is detected, libbpf will avoid doing any __arg_ctx-related BTF
rewriting and checks in favor of letting kernel handle this completely.
test_global_funcs/ctx_arg_rewrite subtest is adjusted to do the same
feature detection (albeit in much simpler, though round-about and
inefficient, way), and skip the tests. This is done to still be able to
execute this test on older kernels (like in libbpf CI).
Note, BPF token series ([0]) does a major refactor and code moving of
libbpf-internal feature detection "framework", so to avoid unnecessary
conflicts we keep newly added feature detection stand-alone with ad-hoc
result caching. Once things settle, there will be a small follow up to
re-integrate everything back and move code into its final place in
newly-added (by BPF token series) features.c file.
[0] https://patchwork.kernel.org/project/netdevbpf/list/?series=814209&state=*
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240118033143.3384355-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The branch counters logging (A.K.A LBR event logging) introduces a
per-counter indication of precise event occurrences in LBRs. It can
provide a means to attribute exposed retirement latency to combinations
of events across a block of instructions. It also provides a means of
attributing Timed LBR latencies to events.
The feature is first introduced on SRF/GRR. It is an enhancement of the
ARCH LBR. It adds new fields in the LBR_INFO MSRs to log the occurrences
of events on the GP counters. The information is displayed by the order
of counters.
The design proposed in this patch requires that the events which are
logged must be in a group with the event that has LBR. If there are
more than one LBR group, the counters logging information only from the
current group (overflowed) are stored for the perf tool, otherwise the
perf tool cannot know which and when other groups are scheduled
especially when multiplexing is triggered. The user can ensure it uses
the maximum number of counters that support LBR info (4 by now) by
making the group large enough.
The HW only logs events by the order of counters. The order may be
different from the order of enabling which the perf tool can understand.
When parsing the information of each branch entry, convert the counter
order to the enabled order, and store the enabled order in the extension
space.
Unconditionally reset LBRs for an LBR event group when it's deleted. The
logged counter information is only valid for the current LBR group. If
another LBR group is scheduled later, the information from the stale
LBRs would be otherwise wrongly interpreted.
Add a sanity check in intel_pmu_hw_config(). Disable the feature if other
counter filters (inv, cmask, edge, in_tx) are set or LBR call stack mode
is enabled. (For the LBR call stack mode, we cannot simply flush the
LBR, since it will break the call stack. Also, there is no obvious usage
with the call stack mode for now.)
Only applying the PERF_SAMPLE_BRANCH_COUNTERS doesn't require any branch
stack setup.
Expose the maximum number of supported counters and the width of the
counters into the sysfs. The perf tool can use the information to parse
the logged counters in each branch.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231025201626.3000228-5-kan.liang@linux.intel.com
Currently, the additional information of a branch entry is stored in a
u64 space. With more and more information added, the space is running
out. For example, the information of occurrences of events will be added
for each branch.
Two places were suggested to append the counters.
https://lore.kernel.org/lkml/20230802215814.GH231007@hirez.programming.kicks-ass.net/
One place is right after the flags of each branch entry. It changes the
existing struct perf_branch_entry. The later ARCH specific
implementation has to be really careful to consistently pick
the right struct.
The other place is right after the entire struct perf_branch_stack.
The disadvantage is that the pointer of the extra space has to be
recorded. The common interface perf_sample_save_brstack() has to be
updated.
The latter is much straightforward, and should be easily understood and
maintained. It is implemented in the patch.
Add a new branch sample type, PERF_SAMPLE_BRANCH_COUNTERS, to indicate
the event which is recorded in the branch info.
The "u64 counters" may store the occurrences of several events. The
information regarding the number of events/counters and the width of
each counter should be exposed via sysfs as a reference for the perf
tool. Define the branch_counter_nr and branch_counter_width ABI here.
The support will be implemented later in the Intel-specific patch.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231025201626.3000228-1-kan.liang@linux.intel.com
Please note that the `libbpf` authoritative source code is developed as part of bpf-next Linux source tree under tools/lib/bpf subdirectory and is periodically synced to Github. As such, all the libbpf changes should be sent to BPF mailing list, please don't open PRs here unless you are changing Github-specific parts of libbpf (e.g., Github-specific Makefile).
test_ima # All of CI is broken on it following 6.3-rc1 merge
lwt_reroute # crashes kernel after netnext merge from 2ab1efad60ad "net/sched: cls_api: complement tcf_tfilter_dump_policy"
tc_links_ingress # started failing after net-next merge from 2ab1efad60ad "net/sched: cls_api: complement tcf_tfilter_dump_policy"
xdp_bonding/xdp_bonding_features # started failing after net merge from 359e54a93ab4 "l2tp: pass correct message length to ip6_append_data"
tc_redirect/tc_redirect_dtime # uapi breakage after net-next commit 885c36e59f46 ("net: Re-use and set mono_delivery_time bit for userspace tstamp packets")
migrate_reuseport/IPv4 TCP_NEW_SYN_RECV reqsk_timer_handler # flaky, under investigation
migrate_reuseport/IPv6 TCP_NEW_SYN_RECV reqsk_timer_handler # flaky, under investigation
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.