1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
|
.. _skiboot-5.11:
skiboot-5.11
============
skiboot v5.11 was released on Friday April 6th 2018. It is the first
release of skiboot 5.11, which is now the new stable release
of skiboot following the 5.10 release, first released February 23rd 2018.
It is *not* expected to keep the 5.11 branch around for long, and instead
quickly move onto a 6.0, which will mark the basis for op-build v2.0 and
will be required for POWER9 systems.
It is expected that skiboot 6.0 will follow very shortly. Consider 5.11
more of a beta release to 6.0 than anything. For POWER9 systems it should
certainly be more solid than previous releases though.
skiboot v5.11 contains all bug fixes as of :ref:`skiboot-5.10.4`
and :ref:`skiboot-5.4.9` (the currently maintained stable releases). There
may be more 5.10.x stable releases, it will depend on demand.
For how the skiboot stable releases work, see :ref:`stable-rules` for details.
Over skiboot-5.10, we have the following changes:
New Platforms
-------------
- Add VESNIN platform support
The Vesnin platform from YADRO is a 4 socked POWER8 system with up to 8TB
of memory with 460GB/s of memory bandwidth in only 2U. Many kudos to the
team from Yadro for submitting their code upstream!
New Features
------------
- fast-reboot: enable by default for POWER9
- Fast reboot is disabled if NPU2 is present or CAPI2/OpenCAPI is used
- PCI tunneled operations on PHB4
- phb4: set PBCQ Tunnel BAR for tunneled operations
P9 supports PCI tunneled operations (atomics and as_notify) that are
initiated by devices.
A subset of the tunneled operations require a response, that must be
sent back from the host to the device. For example, an atomic compare
and swap will return the compare status, as swap will only performed
in case of success. Similarly, as_notify reports if the target thread
has been woken up or not, because the operation may fail.
To enable tunneled operations, a device driver must tell the host where
it expects tunneled operation responses, by setting the PBCQ Tunnel BAR
Response register with a specific value within the range of its BARs.
This register is currently initialized by enable_capi_mode(). But, as
tunneled operations may also operate in PCI mode, a new API is required
to set the PBCQ Tunnel BAR Response register, without switching to CAPI
mode.
This patch provides two new OPAL calls to get/set the PBCQ Tunnel
BAR Response register.
Note: as there is only one PBCQ Tunnel BAR register, shared between
all the devices connected to the same PHB, only one of these devices
will be able to use tunneled operations, at any time.
- phb4: set PHB CMPM registers for tunneled operations
P9 supports PCI tunneled operations (atomics and as_notify) that require
setting the PHB ASN Compare/Mask register with a 16-bit indication.
This register is currently initialized by enable_capi_mode(). But, as
tunneled operations may also work in PCI mode, the ASN Compare/Mask
register should rather be initialized in phb4_init_ioda3().
This patch also adds "ibm,phb-indications" to the device tree, to tell
Linux the values of CAPI, ASN, and NBW indications, when supported.
Tunneled operations tested by IBM in CAPI mode, by Mellanox Technologies
in PCI mode.
- Tie tm-suspend fw-feature and opal_reinit_cpus() together
Currently opal_reinit_cpus(OPAL_REINIT_CPUS_TM_SUSPEND_DISABLED)
always returns OPAL_UNSUPPORTED.
This ties the tm suspend fw-feature to the
opal_reinit_cpus(OPAL_REINIT_CPUS_TM_SUSPEND_DISABLED) so that when tm
suspend is disabled, we correctly report it to the kernel. For
backwards compatibility, it's assumed tm suspend is available if the
fw-feature is not present.
Currently hostboot will clear fw-feature(TM_SUSPEND_ENABLED) on P9N
DD2.1. P9N DD2.2 will set fw-feature(TM_SUSPEND_ENABLED). DD2.0 and
below has TM disabled completely (not just suspend).
We are using opal_reinit_cpus() to determine this setting (rather than
the device tree/HDAT) as some future firmware may let us change this
dynamically after boot. That is not the case currently though.
Power Management
----------------
- SLW: Increase stop4-5 residency by 10x
Using DGEMM benchmark we observed there was a drop of 5-9% throughput with
and without stop4/5. In this benchmark the GPU waits on the cpu to wakeup
and provide the subsequent data block to compute. The wakup latency
accumulates over the run and shows up as a performance drop.
Linux enters stop4/5 more aggressively for its wakeup latency. Increasing
the residency from 1ms to 10ms makes the performance drop <1%
- occ: Set up OCC messaging even if we fail to setup pstates
This means that we no longer hit this bug if we fail to get valid pstates
from the OCC. ::
[console-pexpect]#echo 1 > //sys/firmware/opal/sensor_groups//occ-csm0/clear
echo 1 > //sys/firmware/opal/sensor_groups//occ-csm0/clear
[ 94.019971181,5] CPU ATTEMPT TO RE-ENTER FIRMWARE! PIR=083d cpu @0x33cf4000 -> pir=083d token=8
[ 94.020098392,5] CPU ATTEMPT TO RE-ENTER FIRMWARE! PIR=083d cpu @0x33cf4000 -> pir=083d token=8
[ 10.318805] Disabling lock debugging due to kernel taint
[ 10.318808] Severe Machine check interrupt [Not recovered]
[ 10.318812] NIP [000000003003e434]: 0x3003e434
[ 10.318813] Initiator: CPU
[ 10.318815] Error type: Real address [Load/Store (foreign)]
[ 10.318817] opal: Hardware platform error: Unrecoverable Machine Check exception
[ 10.318821] CPU: 117 PID: 2745 Comm: sh Tainted: G M 4.15.9-openpower1 #3
[ 10.318823] NIP: 000000003003e434 LR: 000000003003025c CTR: 0000000030030240
[ 10.318825] REGS: c00000003fa7bd80 TRAP: 0200 Tainted: G M (4.15.9-openpower1)
[ 10.318826] MSR: 9000000000201002 <SF,HV,ME,RI> CR: 48002888 XER: 20040000
[ 10.318831] CFAR: 0000000030030258 DAR: 394a00147d5a03a6 DSISR: 00000008 SOFTE: 1
mbox based platforms
^^^^^^^^^^^^^^^^^^^^
For platforms using the mbox protocol for host flash access (all BMC based
OpenPOWER systems, most OpenBMC based systems) there have been some hardening
efforts in the event of the BMC being poorly behaved.
- mbox: Reduce default BMC timeouts
Rebooting a BMC can take 70 seconds. Skiboot cannot possibly spin for
70 seconds waiting for a BMC to come back. This also makes the current
default of 30 seconds a bit pointless, is it far too short to be a
worse case wait time but too long to avoid hitting hardlockup detectors
and wrecking havoc inside host linux.
Just change it to three seconds so that host linux will survive and
that, reads and writes will fail but at least the host stays up.
Also refactored the waiting loop just a bit so that it's easier to read.
- mbox: Harden against BMC daemon errors
Bugs present in the BMC daemon mean that skiboot gets presented with
mbox windows of size zero. These windows cannot be valid and skiboot
already detects these conditions.
Currently skiboot warns quite strongly about the occurrence of these
problems. The problem for skiboot is that it doesn't take any action.
Initially I wanting to avoid putting policy like this into skiboot but
since these bugs aren't going away and skiboot barfing is leading to
lockups and ultimately the host going down something needs to be done.
I propose that when we detect the problem we fail the mbox call and punt
the problem back up to Linux. I don't like it but at least it will cause
errors to cascade and won't bring the host down. I'm not sure how Linux
is supposed to detect this or what it can even do but this is better
than a crash.
Diagnosing a failure to boot if skiboot its self fails to read flash may
be marginally more difficult with this patch. This is because skiboot
will now only print one warning about the zero sized window rather than
continuously spitting it out.
Fast Reboot Improvements
------------------------
Around fast-reboot we have made several improvements to harden the fast
reboot code paths and resort to a full IPL if something doesn't look right.
- core/fast-reboot: zero memory after fast reboot
This improves the security and predictability of the fast reboot
environment.
There can not be a secure fence between fast reboots, because a
malicious OS can modify the firmware itself. However a well-behaved
OS can have a reasonable expectation that OS memory regions it has
modified will be cleared upon fast reboot.
The memory is zeroed after all other CPUs come up from fast reboot,
just before the new kernel is loaded and booted into. This allows
image preloading to run concurrently, and will allow parallelisation
of the clearing in future.
- core/fast-reboot: verify mem regions before fast reboot
Run the mem_region sanity checkers before proceeding with fast
reboot.
This is the beginning of proactive sanity checks on opal data
for fast reboot (with complements the reactive disable_fast_reboot
cases). This is encouraged to re-use and share any kind of debug
code and unit test code.
- fast-reboot: occ: Only delete /ibm, opal/power-mgt nodes if they exist
- core/fast-reboot: disable fast reboot upon fundamental entry/exit/locking errors
This disables fast reboot in several more cases where serious errors
like lock corruption or call re-entrancy are detected.
- capp: Disable fast-reboot whenever enable_capi_mode() is called
This patch updates phb4_set_capi_mode() to disable fast-reboot
whenever enable_capi_mode() is called, irrespective to its return
value. This should prevent against a possibility of not disabling
fast-reboot when some changes to enable_capi_mode() causing return of
an error and leaving CAPP in enabled mode.
- fast-reboot: occ: Delete OCC child nodes in /ibm, opal/power-mgt
Fast-reboot in P8 fails to re-init OCC data as there are chipwise OCC
nodes which are already present in the /ibm,opal/power-mgt node. These
per-chip nodes hold the voltage IDs for each pstate and these can be
changed on OCC pstate table biasing. So delete these before calling
the re-init code to re-parse and populate the pstate data.
Debugging/SRESET improvemens
----------------------------
Since :ref:`skiboot-5.11-rc1`:
- core/cpu: Prevent clobbering of stack guard for boot-cpu
Commit 90d53934c2da ("core/cpu: discover stack region size before
initialising memory regions") introduced memzero for struct cpu_thread
in init_cpu_thread(). This has an unintended side effect of clobbering
the stack-guard cannery of the boot_cpu stack. This results in opal
failing to init with this failure message: ::
CPU: P9 generation processor (max 4 threads/core)
CPU: Boot CPU PIR is 0x0004 PVR is 0x004e1200
Guard skip = 0
Stack corruption detected !
Aborting!
CPU 0004 Backtrace:
S: 0000000031c13ab0 R: 0000000030013b0c .backtrace+0x5c
S: 0000000031c13b50 R: 000000003001bd18 ._abort+0x60
S: 0000000031c13be0 R: 0000000030013bbc .__stack_chk_fail+0x54
S: 0000000031c13c60 R: 00000000300c5b70 .memset+0x12c
S: 0000000031c13d00 R: 0000000030019aa8 .init_cpu_thread+0x40
S: 0000000031c13d90 R: 000000003001b520 .init_boot_cpu+0x188
S: 0000000031c13e30 R: 0000000030015050 .main_cpu_entry+0xd0
S: 0000000031c13f00 R: 0000000030002700 boot_entry+0x1c0
So the patch provides a fix by tweaking the memset() call in
init_cpu_thread() to skip over the stack-guard cannery.
- core/lock.c: ensure valid start value for lock spin duration warning
The previous fix in a8e6cc3f4 only addressed half of the problem, as
we could also get an invalid value for start, causing us to fail
in a weird way.
This was caught by the testcases.OpTestHMIHandling.HMI_TFMR_ERRORS
test in op-test-framework.
You'd get to this part of the test and get the erroneous lock
spinning warnings: ::
PATH=/usr/local/sbin:$PATH putscom -c 00000000 0x2b010a84 0003080000000000
0000080000000000
[ 790.140976993,4] WARNING: Lock has been spinning for 790275ms
[ 790.140976993,4] WARNING: Lock has been spinning for 790275ms
[ 790.140976918,4] WARNING: Lock has been spinning for 790275ms
This patch checks the validity of timebase before setting start,
and only checks the lock timeout if we got a valid start value.
Since :ref:`skiboot-5.10`:
- core/opal: allow some re-entrant calls
This allows a small number of OPAL calls to succeed despite re-entering
the firmware, and rejects others rather than aborting.
This allows a system reset interrupt that interrupts OPAL to do something
useful. Sreset other CPUs, use the console, which allows xmon to work or
stack traces to be printed, reboot the system.
Use OPAL_INTERNAL_ERROR when rejecting, rather than OPAL_BUSY, which is
used for many other things that does not mean a serious permanent error.
- core/opal: abort in case of re-entrant OPAL call
The stack is already destroyed by the time we get here, so there
is not much point continuing.
- core/lock: Add lock timeout warnings
There are currently no timeout warnings for locks in skiboot. We assume
that the lock will eventually become free, which may not always be the
case.
This patch adds timeout warnings for locks. Any lock which spins for more
than 5 seconds will throw a warning and stacktrace for that thread. This is
useful for debugging siturations where a lock which hang, waiting for the
lock to be freed.
- core/lock: Add deadlock detection
This adds simple deadlock detection. The detection looks for circular
dependencies in the lock requests. It will abort and display a stack trace
when a deadlock occurs.
The detection is enabled by DEBUG_LOCKS (enabled by default).
While the detection may have a slight performance overhead, as there are
not a huge number of locks in skiboot this overhead isn't significant.
- core/hmi: report processor recovery reason from core FIR bits on P9
When an error is encountered that causes processor recovery, HMI is
generated if the recovery was successful. The reason is recorded in
the core FIR, which gets copied into the WOF.
In this case dump the WOF register and an error string into the OPAL
msglog.
A broken init setting led to HMIs reported in Linux as: ::
[ 3.591547] Harmless Hypervisor Maintenance interrupt [Recovered]
[ 3.591648] Error detail: Processor Recovery done
[ 3.591714] HMER: 2040000000000000
This patch would have been useful because it tells us exactly that
the problem is in the d-side ERAT: ::
[ 414.489690798,7] HMI: Received HMI interrupt: HMER = 0x2040000000000000
[ 414.489693339,7] HMI: [Loc: UOPWR.0000000-Node0-Proc0]: P:0 C:1 T:1: Processor recovery occurred.
[ 414.489699837,7] HMI: Core WOF = 0x0000000410000000 recovered error:
[ 414.489701543,7] HMI: LSU - SRAM (DCACHE parity, etc)
[ 414.489702341,7] HMI: LSU - ERAT multi hit
In future it will be good to unify this reporting, so Linux could
print something more useful. Until then, this gives some good data.
NPU2/NVLink2 Fixes
------------------
- npu2: Add performance tuning SCOM inits
Peer-to-peer GPU bandwidth latency testing has produced some tunable
values that improve performance. Add them to our device initialization.
File these under things that need to be cleaned up with nice #defines
for the register names and bitfields when we get time.
A few of the settings are dependent on the system's particular NVLink
topology, so introduce a helper to determine how many links go to a
single GPU.
- hw/npu2: Assign a unique LPARSHORTID per GPU
This gets used elsewhere to index items in the XTS tables.
- NPU2: dump NPU2 registers on npu2 HMI
Due to the nature of debugging npu2 issues, folk are wanting the
full list of NPU2 registers dumped when there's a problem.
- npu2: Remove DD1 support
Major changes in the NPU between DD1 and DD2 necessitated a fair bit of
revision-specific code.
Now that all our lab machines are DD2, we no longer test anything on DD1
and it's time to get rid of it.
Remove DD1-specific code and abort probe if we're running on a DD1 machine.
- npu2: Disable fast reboot
Fast reboot does not yet work right with the NPU. It's been disabled on
NVLink and OpenCAPI machines. Do the same for NVLink2.
This amounts to a port of 3e4577939bbf ("npu: Fix broken fast reset")
from the npu code to npu2.
- npu2: Use unfiltered mode in XTS tables
The XTS_PID context table is limited to 256 possible pids/contexts. To
relieve this limitation, make use of "unfiltered mode" instead.
If an entry in the XTS_BDF table has the bit for unfiltered mode set, we
can just use one context for that entire bdf/lpar, regardless of pid.
Instead of of searching the XTS_PID table, the NMMU checkout request
will simply use the entry indexed by lparshort id instead.
Change opal_npu_init_context() to create these lparshort-indexed
wildcard entries (0-15) instead of allocating one for each pid. Check
that multiple calls for the same bdf all specify the same msr value.
In opal_npu_destroy_context(), continue validating the bdf argument,
ensuring that it actually maps to an lpar, but no longer remove anything
from the XTS_PID table. If/when we start supporting virtualized GPUs, we
might consider actually removing these wildcard entries by keeping a
refcount, but keep things simple for now.
CAPI/OpenCAPI
-------------
Since :ref:`skiboot-5.11-rc1`:
- capi: Poll Err/Status register during CAPP recovery
This patch updates do_capp_recovery_scoms() to poll the CAPP
Err/Status control register, check for CAPP-Recovery to complete/fail
based on indications of BITS-1,5,9 and then proceed with the
CAPP-Recovery scoms iif recovery completed successfully. This would
prevent cases where we bring-up the PCIe link while recovery sequencer
on CAPP is still busy with casting out cache lines.
In case CAPP-Recovery didn't complete successfully an error is returned
from do_capp_recovery_scoms() asking phb4_creset() to keep the phb4
fenced and mark it as broken.
The loop that implements polling of Err/Status register will also log
an error on the PHB when it continues for more than 168ms which is the
max time to failure for CAPP-Recovery.
Since :ref:`skiboot-5.10`:
- npu2-opencapi: Add OpenCAPI OPAL API calls
Add three OPAL API calls that are required by the ocxl driver.
- OPAL_NPU_SPA_SETUP
The Shared Process Area (SPA) is a table containing one entry (a
"Process Element") per memory context which can be accessed by the
OpenCAPI device.
- OPAL_NPU_SPA_CLEAR_CACHE
The NPU keeps a cache of recently accessed memory contexts. When a
Process Element is removed from the SPA, the cache for the link must be
cleared.
- OPAL_NPU_TL_SET
The Transaction Layer specification defines several templates for
messages to be exchanged on the link. During link setup, the host and
device must negotiate what templates are supported on both sides and at
what rates those messages can be sent.
- npu2-opencapi: Train OpenCAPI links and setup devices
Scan the OpenCAPI links under the NPU, and for each link, reset the card,
set up a device, train the link and register a PHB.
Implement the necessary operations for the OpenCAPI PHB type.
For bringup, test and debug purposes, we allow an NVRAM setting,
"opencapi-link-training" that can be set to either disable link training
completely or to use the prbs31 test pattern.
To disable link training: ::
nvram -p ibm,skiboot --update-config opencapi-link-training=none
To use prbs31: ::
nvram -p ibm,skiboot --update-config opencapi-link-training=prbs31
- npu2-hw-procedures: Add support for OpenCAPI PHY link training
Unlike NVLink, which uses the pci-virt framework to fake a PCI
configuration space for NVLink devices, the OpenCAPI device model presents
us with a real configuration space handled by the device over the OpenCAPI
link.
As a result, we have to train the OpenCAPI link in skiboot before we do PCI
probing, so that config space can be accessed, rather than having link
training being triggered by the Linux driver.
- npu2-opencapi: Configure NPU for OpenCAPI
Scan the device tree for NPUs with OpenCAPI links and configure the NPU per
the initialisation sequence in the NPU OpenCAPI workbook.
- capp: Make error in capp timebase sync a non-fatal error
Presently when we encounter an error while synchronizing capp timebase
with chip-tod at the end of enable_capi_mode() we return an
error. This has an to unintended consequences. First this will prevent
disabling of fast-reboot even though CAPP is already enabled by this
point. Secondly, failure during timebase sync is a non fatal error or
capp initialization as CAPP/PSL can continue working after this and an
AFU will only see an error when it tries to read the timebase value
from PSL.
So this patch updates enable_capi_mode() to not return an error in
case call to chiptod_capp_timebase_sync() fails. The function will now
just log an error and continue further with capp init sequence. This
make the current implementation align with the one in kernel 'cxl'
driver which also assumes the PSL timebase sync errors as non-fatal
init error.
- npu2-opencapi: Fix assert on link reset during init
We don't support resetting an opencapi link yet.
Commit fe6d86b9 ("pci: Make fast reboot creset PHBs in parallel")
tries resetting any PHB whose slot defines a 'run_sm' callback. It
raises an assert when applied to an opencapi PHB, as 'run_sm' calls
the 'freset' callback, which is not yet defined for opencapi.
Fix it for now by removing the currently useless definition of
'run_sm' on the opencapi slot. It will print a message in the skiboot
log because the PHB cannot be reset, which is correct. It will all go
away when we add support for resetting an opencapi link.
- capp: Add lid definition for P9 DD-2.2
Update fsp_lid_map to include CAPP ucode lid for phb4-chipid ==
0x202d1 that corresponds to P9 DD-2.2 chip.
- capp: Disable fast-reboot when capp is enabled
PCI
---
Since :ref:`skiboot-5.11-rc1`:
- phb4: Reset FIR/NFIR registers before PHB4 probe
The function phb4_probe_stack() resets "ETU Reset Register" to
unfreeze the PHB before it performs mmio access on the PHB. However in
case the FIR/NFIR registers are set while entering this function,
the reset of "ETU Reset Register" wont unfreeze the PHB and it will
remain fenced. This leads to failure during initial CRESET of the PHB
as mmio access is still not enabled and an error message of the form
below is logged: ::
PHB#0000[0:0]: Initializing PHB4...
PHB#0000[0:0]: Default system config: 0xffffffffffffffff
PHB#0000[0:0]: New system config : 0xffffffffffffffff
PHB#0000[0:0]: Initial PHB CRESET is 0xffffffffffffffff
PHB#0000[0:0]: Waiting for DLP PG reset to complete...
<snip>
PHB#0000[0:0]: Timeout waiting for DLP PG reset !
PHB#0000[0:0]: Initialization failed
This is especially seen happening during the MPIPL flow where SBE
would quiesces and fence the PHB so that it doesn't stomp on the main
memory. However when skiboot enters phb4_probe_stack() after MPIPL,
the FIR/NFIR registers are set forcing PHB to re-enter fence after ETU
reset is done.
So to fix this issue the patch introduces new xscom writes to
phb4_probe_stack() to reset the FIR/NFIR registers before performing
ETU reset to enable mmio access to the PHB.
Since :ref:`skiboot-5.10`:
- pci: Reduce log level of error message
If a link doesn't train, we can end up with error messages like this: ::
[ 63.027261959,3] PHB#0032[8:2]: LINK: Timeout waiting for electrical link
[ 63.027265573,3] PHB#0032:00:00.0 Error -6 resetting
The first message is useful but the second message is just debug from
the core PCI code and is confusing to print to the console.
This reduces the second print to debug level so it's not seen by the
console by default.
- Revert "platforms/astbmc/slots.c: Allow comparison of bus numbers when matching slots"
This reverts commit bda7cc4d0354eb3f66629d410b2afc08c79f795f.
Ben says:
It's on purpose that we do NOT compare the bus numbers,
they are always 0 in the slot table
we do a hierarchical walk of the tree, matching only the
devfn's along the way bcs the bus numbering isn't fixed
this breaks all slot naming etc... stuff on anything using
the "skiboot" slot tables (P8 opp typically)
- core/pci-dt-slot: Fix booting with no slot map
Currently if you don't have a slot map in the device tree in
/ibm,pcie-slots, you can crash with a back trace like this: ::
CPU 0034 Backtrace:
S: 0000000031cd3370 R: 000000003001362c .backtrace+0x48
S: 0000000031cd3410 R: 0000000030019e38 ._abort+0x4c
S: 0000000031cd3490 R: 000000003002760c .exception_entry+0x180
S: 0000000031cd3670 R: 0000000000001f10 *
S: 0000000031cd3850 R: 00000000300b4f3e * cpu_features_table+0x1d9e
S: 0000000031cd38e0 R: 000000003002682c .dt_node_is_compatible+0x20
S: 0000000031cd3960 R: 0000000030030e08 .map_pci_dev_to_slot+0x16c
S: 0000000031cd3a30 R: 0000000030091054 .dt_slot_get_slot_info+0x28
S: 0000000031cd3ac0 R: 000000003001e27c .pci_scan_one+0x2ac
S: 0000000031cd3ba0 R: 000000003001e588 .pci_scan_bus+0x70
S: 0000000031cd3cb0 R: 000000003001ee74 .pci_scan_phb+0x100
S: 0000000031cd3d40 R: 0000000030017ff0 .cpu_process_jobs+0xdc
S: 0000000031cd3e00 R: 0000000030014cb0 .__secondary_cpu_entry+0x44
S: 0000000031cd3e80 R: 0000000030014d04 .secondary_cpu_entry+0x34
S: 0000000031cd3f00 R: 0000000030002770 secondary_wait+0x8c
[ 73.016947149,3] Fatal MCE at 0000000030026054 .dt_find_property+0x30
[ 73.017073254,3] CFAR : 0000000030026040
[ 73.017138048,3] SRR0 : 0000000030026054 SRR1 : 9000000000201000
[ 73.017198375,3] HSRR0: 0000000000000000 HSRR1: 0000000000000000
[ 73.017263210,3] DSISR: 00000008 DAR : 7c7b1b7848002524
[ 73.017352517,3] LR : 000000003002602c CTR : 000000003009102c
[ 73.017419778,3] CR : 20004204 XER : 20040000
[ 73.017502425,3] GPR00: 000000003002682c GPR16: 0000000000000000
[ 73.017586924,3] GPR01: 0000000031c23670 GPR17: 0000000000000000
[ 73.017643873,3] GPR02: 00000000300fd500 GPR18: 0000000000000000
[ 73.017767091,3] GPR03: fffffffffffffff8 GPR19: 0000000000000000
[ 73.017855707,3] GPR04: 00000000300b3dc6 GPR20: 0000000000000000
[ 73.017943944,3] GPR05: 0000000000000000 GPR21: 00000000300bb6d2
[ 73.018024709,3] GPR06: 0000000031c23910 GPR22: 0000000000000000
[ 73.018117716,3] GPR07: 0000000031c23930 GPR23: 0000000000000000
[ 73.018195974,3] GPR08: 0000000000000000 GPR24: 0000000000000000
[ 73.018278350,3] GPR09: 0000000000000000 GPR25: 0000000000000000
[ 73.018353795,3] GPR10: 0000000000000028 GPR26: 00000000300be6fb
[ 73.018424362,3] GPR11: 0000000000000000 GPR27: 0000000000000000
[ 73.018533159,3] GPR12: 0000000020004208 GPR28: 0000000030767d38
[ 73.018642725,3] GPR13: 0000000031c20000 GPR29: 00000000300b3dc6
[ 73.018737925,3] GPR14: 0000000000000000 GPR30: 0000000000000010
[ 73.018794428,3] GPR15: 0000000000000000 GPR31: 7c7b1b7848002514
This has been seen in the lab on a witherspoon using the device tree
entry point (ie. no HDAT).
This fixes the null pointer deref.
Bugs Fixed
----------
Since :ref:`skiboot-5.11-rc1`:
- cpufeatures: Fix setting DARN and SCV HWCAP feature bits
DARN and SCV has been assigned AT_HWCAP2 (32-63) bits: ::
#define PPC_FEATURE2_DARN 0x00200000 /* darn random number insn */
#define PPC_FEATURE2_SCV 0x00100000 /* scv syscall */
A cpufeatures-aware OS will not advertise these to userspace without
this patch.
- xive: disable store EOI support
Hardware has limitations which would require to put a sync after each
store EOI to make sure the MMIO operations that change the ESB state
are ordered. This is a killer for performance and the PHBs do not
support the sync. So remove the store EOI for the moment, until
hardware is improved.
Also, while we are at changing the XIVE source flags, let's fix the
settings for the PHB4s which should follow these rules :
- SHIFT_BUG for DD10
- STORE_EOI for DD20 and if enabled
- TRIGGER_PAGE for DDx0 and if not STORE_EOI
Since :ref:`skiboot-5.10`:
- xive: fix opal_xive_set_vp_info() error path
In case of error, opal_xive_set_vp_info() will return without
unlocking the xive object. This is most certainly a typo.
- hw/imc: don't access homer memory if it was not initialised
This can happen under mambo, at least.
- nvram: run nvram_validate() after nvram_reformat()
nvram_reformat() sets nvram_valid = true, but it does not set
skiboot_part_hdr. Call nvram_validate() instead, which sets
everything up properly.
- dts: Zero struct to avoid using uninitialised value
- hw/imc: Don't dereference possible NULL
- libstb/create-container: munmap() signature file address
- npu2-opencapi: Fix memory leak
- npu2: Fix possible NULL dereference
- occ-sensors: Remove NULL checks after dereference
- core/ipmi-opal: Add interrupt-parent property for ipmi node on P9 and above.
dtc complains below warning with newer 4.2+ kernels. ::
dts: Warning (interrupts_property): Missing interrupt-parent for /ibm,opal/ipmi
This fix adds interrupt-parent property under /ibm,opal/ipmi DT node on P9
and above, which allows ipmi-opal to properly use the OPAL irqchip.
Other fixes and improvements
----------------------------
- core/cpu: discover stack region size before initialising memory regions
Stack allocation first allocates a memory region sized to hold stacks
for all possible CPUs up to the maximum PIR of the architecture, zeros
the region, then initialises all stacks. Max PIR is 32768 on POWER9,
which is 512MB for stacks.
The stack region is then shrunk after CPUs are discovered, but this is
a bit of a hack, and it leaves a hole in the memory allocation regions
as it's done after mem regions are initialised. ::
0x000000000000..00002fffffff : ibm,os-reserve - OS
0x000030000000..0000303fffff : ibm,firmware-code - OPAL
0x000030400000..000030ffffff : ibm,firmware-heap - OPAL
0x000031000000..000031bfffff : ibm,firmware-data - OPAL
0x000031c00000..000031c0ffff : ibm,firmware-stacks - OPAL
*** gap ***
0x000051c00000..000051d01fff : ibm,firmware-allocs-memory@0 - OPAL
0x000051d02000..00007fffffff : ibm,firmware-allocs-memory@0 - OS
0x000080000000..000080b3cdff : initramfs - OPAL
0x000080b3ce00..000080b7cdff : ibm,fake-nvram - OPAL
0x000080b7ce00..0000ffffffff : ibm,firmware-allocs-memory@0 - OS
This change moves zeroing into the per-cpu stack setup. The boot CPU
stack is set up based on the current PIR. Then the size of the stack
region is set, by discovering the maximum PIR of the system from the
device tree, before mem regions are intialised.
This results in all memory being accounted within memory regions,
and less memory fragmentation of OPAL allocations.
- Make gard display show that a record is cleared
When clearing gard records, Hostboot only modifies the record_id
portion to be 0xFFFFFFFF. The remainder of the entry remains.
Without this change it can be confusing to users to know that
the record they are looking at is no longer valid.
- Reserve OPAL API number for opal_handle_hmi2 function.
- dts: spl_wakeup: Remove all workarounds in the spl wakeup logic
We coded few workarounds in special wakeup logic to handle the
buggy firmware. Now that is fixed remove them as they break the
special wakeup protocol. As per the spec we should not de-assert
beofre assert is complete. So follow this protocol.
- build: use thin archives rather than incremental linking
This changes to build system to use thin archives rather than
incremental linking for built-in.o, similar to recent change to Linux.
built-in.o is renamed to built-in.a, and is created as a thin archive
with no index, for speed and size. All built-in.a are aggregated into
a skiboot.tmp.a which is a thin archive built with an index, making it
suitable or linking. This is input into the final link.
The advantags of build size and linker code placement flexibility are
not as great with skiboot as a bigger project like Linux, but it's a
conceptually better way to build, and is more compatible with link
time optimisation in toolchains which might be interesting for skiboot
particularly for size reductions.
Size of build tree before this patch is 34.4MB, afterwards 23.1MB.
- core/init: Assert when kernel not found
If the kernel doesn't load out of flash or there is nothing at
KERNEL_LOAD_BASE, we end up with an esoteric message as we try to
branch to out of skiboot into nothing ::
[ 0.007197688,3] INIT: ELF header not found. Assuming raw binary.
[ 0.014035267,5] INIT: Starting kernel at 0x0, fdt at 0x3044ad90 13029
[ 0.014042254,3] ***********************************************
[ 0.014069947,3] Fatal Exception 0xe40 at 0000000000000000
[ 0.014085574,3] CFAR : 00000000300051c4
[ 0.014090118,3] SRR0 : 0000000000000000 SRR1 : 0000000000000000
[ 0.014096243,3] HSRR0: 0000000000000000 HSRR1: 9000000000001000
[ 0.014102546,3] DSISR: 00000000 DAR : 0000000000000000
[ 0.014108538,3] LR : 00000000300144c8 CTR : 0000000000000000
[ 0.014114756,3] CR : 40002202 XER : 00000000
[ 0.014120301,3] GPR00: 000000003001447c GPR16: 0000000000000000
This improves the message and asserts in this case: ::
[ 0.014042685,5] INIT: Starting kernel at 0x0, fdt at 0x3044ad90 13049 bytes)
[ 0.014049556,0] FATAL: Kernel is zeros, can't execute!
[ 0.014054237,0] Assert fail: core/init.c:566:0
[ 0.014060472,0] Aborting!
- core: Fix 'opal-runtime-size' property
We are populating 'opal-runtime-size' before calculating actual stack size.
Hence we endup having wrong runtime size (ex: on P9 it shows ~540MB while
actual size is around ~40MB). Note that only device tree property is shows
wrong value, but reserved-memory reflects correct size.
init_all_cpus() calculates and updates actual stack size. Hence move this
function call before add_opal_node().
- mambo: Add fw-feature flags for security related settings
Newer firmwares report some feature flags related to security
settings via HDAT. On real hardware skiboot translates these into
device tree properties. For testing purposes just create the
properties manually in the tcl.
These values don't exactly match any actual chip revision, but the
code should not rely on any exact set of values anyway. We just define
the most interesting flags, that if toggled to "disable" will change
Linux behaviour. You can see the actual values in the hostboot source
in src/usr/hdat/hdatiplparms.H.
Also add an environment variable for easily toggling the top-level
"security on" setting.
- direct-controls: mambo fix for multiple chips
- libflash/blocklevel: Correct miscalculation in blocklevel_smart_erase()
If blocklevel_smart_erase() detects that the smart erase fits entire in
one erase block, it has an early bail path. In this path it miscaculates
where in the buffer the backend needs to read from to perform the final
write.
- libstb/secureboot: Fix logging of secure verify messages.
Currently we are logging secure verify/enforce messages in PR_EMERG
level even when there is no secureboot mode enabled. So reduce the
log level to PR_ERR when secureboot mode is OFF.
Testing / Code coverage improvements
------------------------------------
Improvements in gcov support include support for newer GCCs as well
as easily exporting the area of memory you need to dump to feed to
`extract-gcov`.
- cpu_idle_job: relax a bit
This *dramatically* improves kernel boot time with GCOV builds
from ~3minutes between loading kernel and switching the HILE
bit down to around 10 seconds.
- gcov: Another GCC, another gcov tweak
- Keep constructors with priorities
Fixes GCOV builds with gcc7, which uses this.
- gcov: Add gcov data struct to sysfs
Extracting the skiboot gcov data is currently a tedious process which
involves taking a mem dump of skiboot and searching for the gcov_info
struct.
This patch adds the gcov struct to sysfs under /opal/exports. Allowing the
data to be copied directly into userspace and processed.
|