实验地址

Part A: Multiprocessor Support and Cooperative Multitasking

主要实现这些功能:

  • JOS 支持多 CPU
  • 实现允许普通进程创建额外环境
  • 实现 cooperative round-robin scheduling,允许当前进程让出 CPU 资源的情况下内核切换进程

Multiprocessor Support

JOS 支持 symmetric multiprocessing(SMP) 这种 CPU 平等访问系统资源的多处理模型,在启动阶段 CPU 会分成两类:

  • bootstrap processor(BSP): 初始化系统和启动系统
  • application processors(APs): 系统正常运行时被硬件与 BIOS 的 BSP 激活,JOS 其余代码运行在 BSP 上

SMP 内每个 CPU 都有对应的 local APIC(LAPIC)单元用于传递中断,LAPIC 提供以下功能:

  1. 读取 LAPIC 标识值(APIC ID)获取当前运行的 CPU
  2. 从 BSP 到 APs 发送 STARTUP interprocessor interrput(IPI)
  3. LAPIC 内在计时器去触发时钟中断,用于支持抢占式多任务

APIC 详细资料可参考 Intel 64 and IA-32 Architechtures Software Developer’s Manual 第 8 章

进程通过 memory-mapped I/O(MMIO)访问 LAPIC,LAPIC 起始地址为 0xFE000000 ,而 JOS 通过 MMIOBASE 进行访问

Exercisxe 1

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
//
// Reserve size bytes in the MMIO region and map [pa,pa+size) at this
// location.  Return the base of the reserved region.  size does *not*
// have to be multiple of PGSIZE.
//
void *
mmio_map_region(physaddr_t pa, size_t size)
{
  // Where to start the next region.  Initially, this is the
  // beginning of the MMIO region.  Because this is static, its
  // value will be preserved between calls to mmio_map_region
  // (just like nextfree in boot_alloc).
  static uintptr_t base = MMIOBASE;

  if (base + size > MMIOLIM) {
    panic("mmio_map_region not implemented");
  }

  // Reserve size bytes of virtual memory starting at base and
  // map physical pages [pa,pa+size) to virtual addresses
  // [base,base+size).  Since this is device memory and not
  // regular DRAM, you'll have to tell the CPU that it isn't
  // safe to cache access to this memory.  Luckily, the page
  // tables provide bits for this purpose; simply create the
  // mapping with PTE_PCD|PTE_PWT (cache-disable and
  // write-through) in addition to PTE_W.  (If you're interested
  // in more details on this, see section 10.5 of IA32 volume
  // 3A.)
  //
  // Be sure to round size up to a multiple of PGSIZE and to
  // handle if this reservation would overflow MMIOLIM (it's
  // okay to simply panic if this happens).
  //
  // Hint: The staff solution uses boot_map_region.
  //
  // Your code here:

  size = ROUNDUP((uint32_t)size, PGSIZE);
  pa = ROUNDDOWN((uint32_t)pa, PGSIZE);

  boot_map_region(kern_pgdir, base, size, pa, PTE_PCD|PTE_PWT|PTE_W);

  base += size;

  return (void *)(base - size);
}

Application Processor Bootstrap

BSP 在启动 APs 前会收集各种信息, kern/mpconfig.c:mp_init() 通过读取 BIOS 内存块的 MP configuration table 获取信息

kern/init.c:boot_aps() 驱动 AP 启动过程,APs 在实模式下工作, boot_aps() 主要将 AP entry codec 复制到实模式下的一个内存地址中;将 AP entry code 复制到 0x7000(MPENTRY_PADDR)

boot_aps() 发送 STARTUP 中断信号到 APs 对应的 LAPIC 单元用于激活,并初始化 CS:IP (即 MPENTRY_PADDR ),使 AP 开启保护模式,分页机制并调用 mp_main()boot_aps() 等待 AP 发送 Cpuinfo 结构体中的 CPU_STATED flag,然后启动剩下的 CPU

Exercise 2

boot_aps()mp_main() 了解 APs 的启动流程,并修改 page_init()MPENTRY_PADDRfree_list 删除

CPU 在实模式下只能访问低 640K 的内容,而 MPENTRY_PADDR 位于低 640K 处,因此只要在低处不添加即可

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
void
page_init(void)
{
  // ...
  for (i = 1; i < npages_basemem; ++i) {
    // 当前页地址与 MPENTRY_PADDR 地址一致
    if (page2pa(&pages[i]) == MPENTRY_PADDR) {
      continue;
    }
    pages[i].pp_ref = 0;
    pages[i].pp_link = page_free_list;
    page_free_list = &pages[i];
  }

  // ...
}

Q1

Compare kern/mpentry.S side by side with boot/boot.S. Bearing in mind that kern/mpentry.S is compiled and linked to run above KERNBASE just like everything else in the kernel, what is the purpose of macro MPBOOTPHYS? Why is it necessary in kern/mpentry.S but not in boot/boot.S? In other words, what could go wrong if it were omitted in kern/mpentry.S?

Hint: recall the differences between the link address and the load address that we have discussed in Lab 1.

MPBOOTPHYS() 用于计算绝地地址

由于 boot/boot.S 直接在实模式下工作,VMA 为 0x7c00 ,地址都位于低位所以不用转化;kern/mpentry.S 的 VMA 为 0xf0000000 地址都位于高位所以要转换

Per-CPU State and Initialization

CpuInfo 用于记录每个 CPU 的状态

1
2
3
4
5
6
7
// Per-CPU state
struct CpuInfo {
  uint8_t cpu_id;                 // Local APIC ID; index into cpus[] below
  volatile unsigned cpu_status;   // The status of the CPU
  struct Env *cpu_env;            // The currently-running environment.
  struct Taskstate cpu_ts;        // Used by x86 to find stack for interrupt
};

CPU 状态描述:

  • kernel stack: percpu_kstacks[NCPU][KSTKSIZE] 用于隔离每个 CPU 的 kernel stack,大小为 KSTKSIZE ,kernel stack 呈线性分布:CPU0 从 KSTACKTOP 开始,CPU1 从 CPU0 后面 KSTKSIZE 开始
  • TSS 和 TSS 描述符: Task state segment(TSS) 用于分辨 CPU 对应的 kernel stack ,TSS 存放在 cpus[i].cpu_ts ,TSS 描述符 存放在 gdt[(GD_TSS0 >> 3) + i]
  • 当前环境指针: curenv 用于指向当前 CPU 的当前运行进程 ( cpu[cpunum()].cpu_env / thiscpu->cpuy_env )
  • 系统寄存器: lcr3(), ltr() 等寄存器状态

Exercise 3

根据注释修改 mem_init_mp() 完成 kernel stack 的物理映射

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// Modify mappings in kern_pgdir to support SMP
//   - Map the per-CPU stacks in the region [KSTACKTOP-PTSIZE, KSTACKTOP)
//
static void
mem_init_mp(void)
{
  // Map per-CPU stacks starting at KSTACKTOP, for up to 'NCPU' CPUs.
  //
  // For CPU i, use the physical memory that 'percpu_kstacks[i]' refers
  // to as its kernel stack. CPU i's kernel stack grows down from virtual
  // address kstacktop_i = KSTACKTOP - i * (KSTKSIZE + KSTKGAP), and is
  // divided into two pieces, just like the single stack you set up in
  // mem_init:
  //     * [kstacktop_i - KSTKSIZE, kstacktop_i)
  //          -- backed by physical memory
  //     * [kstacktop_i - (KSTKSIZE + KSTKGAP), kstacktop_i - KSTKSIZE)
  //          -- not backed; so if the kernel overflows its stack,
  //             it will fault rather than overwrite another CPU's stack.
  //             Known as a "guard page".
  //     Permissions: kernel RW, user NONE
  //
  // LAB 4: Your code here:
  size_t i;
  uint32_t kstacktop_i = KSTACKTOP;

  for (i = 0; i < NCPU; ++i) {
    // stack
    boot_map_region(kern_pgdir, kstacktop_i - KSTKSIZE, KSTKSIZE,
        PADDR(percpu_kstacks[i]), PTE_W | PTE_P);
    kstacktop_i -= KSTKSIZE;
    // guard page
    kstacktop_i -= KSTKGAP;
  }
}

Exercise 4

修改 trap_init_percpu() 对每个 CPU 进行 TSS 和 TSS 描述符初始化

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// Initialize and load the per-CPU TSS and IDT
void
trap_init_percpu(void)
{
  // The example code here sets up the Task State Segment (TSS) and
  // the TSS descriptor for CPU 0. But it is incorrect if we are
  // running on other CPUs because each CPU has its own kernel stack.
  // Fix the code so that it works for all CPUs.
  //
  // Hints:
  //   - The macro "thiscpu" always refers to the current CPU's
  //     struct CpuInfo;
  //   - The ID of the current CPU is given by cpunum() or
  //     thiscpu->cpu_id;
  //   - Use "thiscpu->cpu_ts" as the TSS for the current CPU,
  //     rather than the global "ts" variable;
  //   - Use gdt[(GD_TSS0 >> 3) + i] for CPU i's TSS descriptor;
  //   - You mapped the per-CPU kernel stacks in mem_init_mp()
  //   - Initialize cpu_ts.ts_iomb to prevent unauthorized environments
  //     from doing IO (0 is not the correct value!)
  //
  // ltr sets a 'busy' flag in the TSS selector, so if you
  // accidentally load the same TSS on more than one CPU, you'll
  // get a triple fault.  If you set up an individual CPU's TSS
  // wrong, you may not get a fault until you try to return from
  // user space on that CPU.
  //
  // LAB 4: Your code here:

  // Setup a TSS so that we get the right stack
  // when we trap to the kernel.
  thiscpu->cpu_ts.ts_esp0 = (uintptr_t)(percpu_kstacks[thiscpu->cpu_id] + KSTKSIZE);
  thiscpu->cpu_ts.ts_ss0 = GD_KD;
  thiscpu->cpu_ts.ts_iomb = sizeof(struct Taskstate);

  // Initialize the TSS slot of the gdt.
  gdt[(GD_TSS0 >> 3) + thiscpu->cpu_id] = SEG16(STS_T32A,
                                                (uint32_t) (&(thiscpu->cpu_ts)),
                                                sizeof(struct Taskstate) - 1,
                                                0);
  gdt[(GD_TSS0 >> 3) + thiscpu->cpu_id].sd_s = 0;

  // Load the TSS selector (like other segment selectors, the
  // bottom three bits are special; we leave them 0)
  ltr(GD_TSS0 + (thiscpu->cpu_id << 3));

  // Load the IDT
  lidt(&idt_pd);
}

Locking

初始化 AP 后已经多个 CPU 运行内核代码,而为了处理多 CPU 同时运行下的竞争条件需要设计一个 big kernel lock 全局锁用于当且仅当一个任何何用户态进程下的 CPU 切换到内核态

big kernel lock 定义如下

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
struct spinlock {
  unsigned locked;       // Is the lock held?

#ifdef DEBUG_SPINLOCK
  // For debugging:
  char *name;            // Name of lock.
  struct CpuInfo *cpu;   // The CPU holding the lock.
  uintptr_t pcs[10];     // The call stack (an array of program counters)
  // that locked the lock.
#endif
};

extern struct spinlock kernel_lock;

Exercise 5

在合适的位置调用 lock_kernel()unlock_kernel() 调用锁

i386_init() 在 BSP 唤醒 CPU 前获取锁

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
void
i386_init(void)
{
  // ...
  // Lab 4 multitasking initialization functions
  pic_init();

  // Acquire the big kernel lock before waking up APs
  // Your code here:
  lock_kernel();

  // Starting non-boot CPUs
  boot_aps();

  // ...
}

mp_main() 在初始化 AP 后获取锁并调用 sched_yield() 在 AP 上运行对应进程

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
void
mp_main(void)
{
  // ...
  // Now that we have finished some basic setup, call sched_yield()
  // to start running processes on this CPU.  But make sure that
  // only one CPU can enter the scheduler at a time!
  //
  // Your code here:
  lock_kernel();
  sched_yield();
}

trap() 从用户态中转换获取锁

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
void
trap(struct Trapframe *tf)
{
  // ...
  if ((tf->tf_cs & 3) == 3) {
    // Trapped from user mode.
    // Acquire the big kernel lock before doing any
    // serious kernel work.
    // LAB 4: Your code here.
    assert(curenv);
    lock_kernel();
  }
}

env_run() 在切换回用户态前释放锁

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
void
env_run(struct Env *e)
{
  // ...
  lcr3(PADDR(curenv->env_pgdir));

  unlock_kernel();

  env_pop_tf(&curenv->env_tf);

  // ...
}

Q2

It seems that using the big kernel lock guarantees that only one CPU can run the kernel code at a time. Why do we still need separate kernel stacks for each CPU? Describe a scenario in which using a shared kernel stack will go wrong, even with the protection of the big kernel lock.

CPU0 处于正在进入内核态并保存栈信息,CPU1 发生中断也能进入内核态,内核未上锁会造成栈信息混乱

Round-Robin Scheduling

使用循环机制(Round-Robin)使内核中对进程进行切换

  • sched_yield(): 用于选择新的进程,遍历 envs[] 查找 ENV_RUNABLE 进程并调用 env_run() 切换
  • sys_yield(): 能让进程在用户态通知内核,主动让出 CPU 给另外的进程

Exercise 6

实现调度算法

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Choose a user environment to run and run it.
void
sched_yield(void)
{
  struct Env *idle;

  // Implement simple round-robin scheduling.
  //
  // Search through 'envs' for an ENV_RUNNABLE environment in
  // circular fashion starting just after the env this CPU was
  // last running.  Switch to the first such environment found.
  //
  // If no envs are runnable, but the environment previously
  // running on this CPU is still ENV_RUNNING, it's okay to
  // choose that environment.
  //
  // Never choose an environment that's currently running on
  // another CPU (env_status == ENV_RUNNING). If there are
  // no runnable environments, simply drop through to the code
  // below to halt the cpu.

  // LAB 4: Your code here.

  int curenv_id = 0;
  int i;

  if (curenv) {
    curenv_id = ENVX(curenv->env_id) + 1;
  }

  for (i = 0; i < NENV; ++i) {
    int mod = (i + curenv_id) % NENV;
    if (envs[mod].env_status == ENV_RUNNABLE) {
      env_run(&envs[mod]);
      break;
    }
  }

  if (curenv && curenv->env_status == ENV_RUNNING) {
    env_run(curenv);
  }

  // sched_halt never returns
  sched_halt();
}

为了测试相应代码需要添加 sys_yield() 系统调用,同时添加进程

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
// Dispatches to the correct kernel function, passing the arguments.
int32_t
syscall(uint32_t syscallno, uint32_t a1, uint32_t a2, uint32_t a3, uint32_t a4, uint32_t a5)
{
  // ...
  switch (syscallno) {
    // ...
    case SYS_yield:
      sys_yield();
      break;
    // ...
  }
  // ...
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
void
i386_init(void)
{
  // ...
#if defined(TEST)
  // Don't touch -- used by grading script!
  ENV_CREATE(TEST, ENV_TYPE_USER);
#else
  // Touch all you want.
  ENV_CREATE(user_yield, ENV_TYPE_USER);
  ENV_CREATE(user_yield, ENV_TYPE_USER);
  ENV_CREATE(user_yield, ENV_TYPE_USER);
  ENV_CREATE(user_yield, ENV_TYPE_USER);
#endif // TEST*

  //...
}

运行结果

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
6828 decimal is 15254 octal!
Physical memory: 131072K available, base = 640K, extended = 130432K
check_page_free_list() succeeded!
check_page_alloc() succeeded!
check_page() succeeded!
check_kern_pgdir() succeeded!
check_page_free_list() succeeded!
check_page_installed_pgdir() succeeded!
SMP: CPU 0 found 2 CPU(s)
enabled interrupts: 1 2
SMP: CPU 1 starting
[00000000] new env 00001000
[00000000] new env 00001001
[00000000] new env 00001002
[00000000] new env 00001003
Hello, I am environment 00001000.
Hello, I am environment 00001001.
Hello, I am environment 00001002.
Hello, I am environment 00001003.
Back in environment 00001000, iteration 0.
Back in environment 00001001, iteration 0.
Back in environment 00001002, iteration 0.
Back in environment 00001003, iteration 0.
Back in environment 00001000, iteration 1.
Back in environment 00001001, iteration 1.
Back in environment 00001002, iteration 1.
Back in environment 00001003, iteration 1.
Back in environment 00001000, iteration 2.
Back in environment 00001001, iteration 2.
Back in environment 00001002, iteration 2.
Back in environment 00001003, iteration 2.
Back in environment 00001000, iteration 3.
Back in environment 00001001, iteration 3.
Back in environment 00001002, iteration 3.
Back in environment 00001003, iteration 3.
Back in environment 00001000, iteration 4.
Back in environment 00001001, iteration 4.
All done in environment 00001000.
All done in environment 00001001.
[00001000] exiting gracefully
[00001000] free env 00001000
[00001001] exiting gracefully
[00001001] free env 00001001
Back in environment 00001002, iteration 4.
Back in environment 00001003, iteration 4.
All done in environment 00001002.
All done in environment 00001003.
[00001002] exiting gracefully
[00001002] free env 00001002
[00001003] exiting gracefully
[00001003] free env 00001003
No runnable environments in the system!
Welcome to the JOS kernel monitor!
Type 'help' for a list of commands.
K>

Q3

In your implementation of env_run() you should have called lcr3(). Before and after the call to lcr3(), your code makes references (at least it should) to the variable e, the argument to env_run. Upon loading the %cr3 register, the addressing context used by the MMU is instantly changed. But a virtual address (namely e) has meaning relative to a given address context–the address context specifies the physical address to which the virtual address maps. Why can the pointer e be dereferenced both before and after the addressing switch?

e 在创建时会从 kern_pgdir 复制到进程页表中

Q4

Whenever the kernel switches from one environment to another, it must ensure the old environment’s registers are saved so they can be restored properly later. Why? Where does this happen?

trap() 处将 tf 保存到 env_tf

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
void
trap(struct Trapframe *tf)
{
  // ...
  if ((tf->tf_cs & 3) == 3) {
    // ...

    // Copy trap frame (which is currently on the stack)
    // into 'curenv->env_tf', so that running the environment
    // will restart at the trap point.
    curenv->env_tf = *tf;
    // The trapframe on the stack should be ignored from here on.
    tf = &curenv->env_tf;
  }

  // Record that tf is the last real trapframe so
  // print_trapframe can print some additional information.
  last_tf = tf;

  // Dispatch based on what type of trap occurred
  trap_dispatch(tf);

  // If we made it to this point, then no other environment was
  // scheduled, so we should return to the current environment
  // if doing so makes sense.
  if (curenv && curenv->env_status == ENV_RUNNING)
    env_run(curenv);
  else
    sched_yield();
}

System Calls for Environment Creation

通过下面的系统调用实现 fork()

  • sys_exofork(): 创建一个不可运行的且与父环境寄存器相同的进程
  • sys_env_set_status(): 设置进程状态,用于进程中寄存器与内存完全初始化后进行准备标记
  • sys_page_alloc(): 进程中占用物理内存并与对应虚拟内存映射
  • sys_page_map(): 从一个进程中复制页到另一个进程中
  • sys_page_unmap(): 取消页的映射

Exercise 7

  • sys_exofork()

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    
    // Allocate a new environment.
    // Returns envid of new environment, or < 0 on error.  Errors are:
    //  -E_NO_FREE_ENV if no free environment is available.
    //  -E_NO_MEM on memory exhaustion.
    static envid_t
    sys_exofork(void)
    {
      // Create the new environment with env_alloc(), from kern/env.c.
      // It should be left as env_alloc created it, except that
      // status is set to ENV_NOT_RUNNABLE, and the register set is copied
      // from the current environment -- but tweaked so sys_exofork
      // will appear to return 0.
    
      // LAB 4: Your code here.
    
      struct Env *e = NULL;
      int ret = 0;
      if ((ret = env_alloc(&e, curenv->env_id)) < 0) {
        return ret;
      }
    
      e->env_status = ENV_NOT_RUNNABLE;
      e->env_tf = curenv->env_tf;
    
      // Child environment returns zero
      e->env_tf.tf_regs.reg_eax = 0;
    
      return e->env_id;
    }
  • sys_env_set_status()

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    
    // Set envid's env_status to status, which must be ENV_RUNNABLE
    // or ENV_NOT_RUNNABLE.
    //
    // Returns 0 on success, < 0 on error.  Errors are:
    //  -E_BAD_ENV if environment envid doesn't currently exist,
    //      or the caller doesn't have permission to change envid.
    //  -E_INVAL if status is not a valid status for an environment.
    static int
    sys_env_set_status(envid_t envid, int status)
    {
      // Hint: Use the 'envid2env' function from kern/env.c to translate an
      // envid to a struct Env.
      // You should set envid2env's third argument to 1, which will
      // check whether the current environment has permission to set
      // envid's status.
    
      // LAB 4: Your code here.
      if ((status != ENV_RUNNABLE) ||
          (status != ENV_NOT_RUNNABLE)) {
        return -E_INVAL;
      }
    
      struct Env *e = NULL;
      int ret = 0;
    
      if ((ret = envid2env(envid, &e, 1)) < 0) {
        return -E_BAD_ENV;
      }
    
      e->env_status = status;
    
      return 0;
    }
  • sys_page_alloc()

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    
    // Allocate a page of memory and map it at 'va' with permission
    // 'perm' in the address space of 'envid'.
    // The page's contents are set to 0.
    // If a page is already mapped at 'va', that page is unmapped as a
    // side effect.
    //
    // perm -- PTE_U | PTE_P must be set, PTE_AVAIL | PTE_W may or may not be set,
    //         but no other bits may be set.  See PTE_SYSCALL in inc/mmu.h.
    //
    // Return 0 on success, < 0 on error.  Errors are:
    //  -E_BAD_ENV if environment envid doesn't currently exist,
    //      or the caller doesn't have permission to change envid.
    //  -E_INVAL if va >= UTOP, or va is not page-aligned.
    //  -E_INVAL if perm is inappropriate (see above).
    //  -E_NO_MEM if there's no memory to allocate the new page,
    //      or to allocate any necessary page tables.
    static int
    sys_page_alloc(envid_t envid, void *va, int perm)
    {
      // Hint: This function is a wrapper around page_alloc() and
      //   page_insert() from kern/pmap.c.
      //   Most of the new code you write should be to check the
      //   parameters for correctness.
      //   If page_insert() fails, remember to free the page you
      //   allocated!
    
      // LAB 4: Your code here.
      int ret;
      struct Env *e = NULL;
      struct PageInfo *pg = NULL;
    
      if ((ret = envid2env(envid, &e, 1)) < 0) {
        return -E_BAD_ENV;
      }
    
      if (((uintptr_t)va >= UTOP) ||
          (ROUNDDOWN((uintptr_t)va, PGSIZE) != (uintptr_t)va)) {
        return -E_INVAL;
      }
    
      if (perm & ~(PTE_U | PTE_P | PTE_AVAIL | PTE_W)) {
        return -E_INVAL;
      }
    
      if ((ret = page_insert(e->env_pgdir, pg, (void *)va, perm)) < 0) {
        page_free(pg);
        return ret;
      }
    
      return 0;
    }
  • sys_page_map()

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    
    // Map the page of memory at 'srcva' in srcenvid's address space
    // at 'dstva' in dstenvid's address space with permission 'perm'.
    // Perm has the same restrictions as in sys_page_alloc, except
    // that it also must not grant write access to a read-only
    // page.
    //
    // Return 0 on success, < 0 on error.  Errors are:
    //  -E_BAD_ENV if srcenvid and/or dstenvid doesn't currently exist,
    //      or the caller doesn't have permission to change one of them.
    //  -E_INVAL if srcva >= UTOP or srcva is not page-aligned,
    //      or dstva >= UTOP or dstva is not page-aligned.
    //  -E_INVAL is srcva is not mapped in srcenvid's address space.
    //  -E_INVAL if perm is inappropriate (see sys_page_alloc).
    //  -E_INVAL if (perm & PTE_W), but srcva is read-only in srcenvid's
    //      address space.
    //  -E_NO_MEM if there's no memory to allocate any necessary page tables.
    static int
    sys_page_map(envid_t srcenvid, void *srcva,
           envid_t dstenvid, void *dstva, int perm)
    {
      // Hint: This function is a wrapper around page_lookup() and
      //   page_insert() from kern/pmap.c.
      //   Again, most of the new code you write should be to check the
      //   parameters for correctness.
      //   Use the third argument to page_lookup() to
      //   check the current permissions on the page.
    
      // LAB 4: Your code here.
      struct Env *srce, *dste;
      int ret = 0;
    
      if (((ret = envid2env(srcenvid, &srce, 1)) < 0) ||
          ((ret = envid2env(dstenvid, &dste, 1)) < 0)) {
        return -E_BAD_ENV;
      }
    
      if (((uintptr_t)srcva >= UTOP) ||
          (ROUNDDOWN((uintptr_t)srcva, PGSIZE) != (uintptr_t)srcva) ||
          ((uintptr_t)dstva >= UTOP) ||
          (ROUNDDOWN((uintptr_t)dstva, PGSIZE) != (uintptr_t)dstva)) {
        return -E_INVAL;
      }
    
            struct PageInfo *pg = NULL;
            pte_t *pte = NULL;
    
      if (!(pg = page_lookup(srce->env_pgdir, (void *)srcva, &pte))) {
        return -E_INVAL;
      }
    
      if (perm & ~(PTE_P | PTE_U | PTE_AVAIL | PTE_W)) {
        return -E_INVAL;
      }
    
      if ((perm & PTE_W) && !(*pte & PTE_W)) {
        return -E_INVAL;
      }
    
      return page_insert(dste->env_pgdir, pg, (void *)dstva, perm);
    }
  • sys_page_unmap()

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    
    // Unmap the page of memory at 'va' in the address space of 'envid'.
    // If no page is mapped, the function silently succeeds.
    //
    // Return 0 on success, < 0 on error.  Errors are:
    //  -E_BAD_ENV if environment envid doesn't currently exist,
    //      or the caller doesn't have permission to change envid.
    //  -E_INVAL if va >= UTOP, or va is not page-aligned.
    static int
    sys_page_unmap(envid_t envid, void *va)
    {
      // Hint: This function is a wrapper around page_remove().
    
      // LAB 4: Your code here.
      struct Env *e = NULL;
      int ret = 0;
    
      if ((ret = envid2env(envid, &e, 1)) < 0) {
        return -E_BAD_ENV;
      }
    
      if ((uintptr_t)va >= UTOP ||
          (ROUNDDOWN((uintptr_t)va, PGSIZE) != (uintptr_t)va)) {
        return -E_INVAL;
      }
    
      page_remove(e->env_pgdir, (void *)va);
    
      return 0;
    }
  • syscall()

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    
    int32_t
    syscall(uint32_t syscallno, uint32_t a1, uint32_t a2, uint32_t a3, uint32_t a4, uint32_t a5)
    {
      // ...
      switch (syscallno) {
        // ...
        case SYS_exofork:
          ret = sys_exofork();
          break;
        case SYS_env_set_status:
          ret = sys_env_set_status(a1, a2);
          break;
        case SYS_page_alloc:
          ret = sys_page_alloc(a1, (void *)a2, a3);
          break;
        case SYS_page_map:
          ret = sys_page_map(a1, (void *)a2, a3, (void *)a4, a5);
          break;
        case SYS_page_unmap:
          ret = sys_page_unmap(a1, (void *)a2);
          break;
          // ...
      }
    
      return ret;
    }
  • 结果

    运行 make run-dumbfork 后输出结果

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    
    6828 decimal is 15254 octal!
    Physical memory: 131072K available, base = 640K, extended = 130432K
    check_page_free_list() succeeded!
    check_page_alloc() succeeded!
    check_page() succeeded!
    check_kern_pgdir() succeeded!
    check_page_free_list() succeeded!
    check_page_installed_pgdir() succeeded!
    SMP: CPU 0 found 1 CPU(s)
    enabled interrupts: 1 2
    [00000000] new env 00001000
    [00001000] new env 00001001
    0: I am the parent!
    0: I am the child!
    1: I am the parent!
    1: I am the child!
    2: I am the parent!
    2: I am the child!
    3: I am the parent!
    3: I am the child!
    4: I am the parent!
    4: I am the child!
    5: I am the parent!
    5: I am the child!
    6: I am the parent!
    6: I am the child!
    7: I am the parent!
    7: I am the child!
    8: I am the parent!
    8: I am the child!
    9: I am the parent!
    9: I am the child!
    [00001000] exiting gracefully
    [00001000] free env 00001000
    10: I am the child!
    11: I am the child!
    12: I am the child!
    13: I am the child!
    14: I am the child!
    15: I am the child!
    16: I am the child!
    17: I am the child!
    18: I am the child!
    19: I am the child!
    [00001001] exiting gracefully
    [00001001] free env 00001001
    No runnable environments in the system!
    Welcome to the JOS kernel monitor!
    Type 'help' for a list of commands.
    K>

Part B: Copy-on-Write Fork

fork() 会把父进程数据复制至子进程中,但大多数情况下子进程会调用 exec() 启动新程序,上面的复制过程耗时且耗空间

为了解决这个问题提出了 写时拷贝(copy-on-write) 想法:父子进程共享页目录和页表,只有一方执行修改操作才复制对应的物理页: fork() 把父进程中的页表和页目录复制到子进程,并标记对应页为只读(read-only),当其中一方需要进行修改时,内核给对应线程复制新的私有的可写的对应页;从而降低 fork() 在子进程中执行 exec() 的开销

User-level page fault handling

实现写时拷贝的 fork() 需要知道被保护的页

Setting the Page Fault Handler

为了进程能处理自身的页错误,进程自身需要注册对应的处理函数入口点,该操作由 sys_env_set_pgfault_upcall() 提供

  • Exercise 8

    实现系统调用 sys_env_set_pgfault_upcall()

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    
    // Set the page fault upcall for 'envid' by modifying the corresponding struct
    // Env's 'env_pgfault_upcall' field.  When 'envid' causes a page fault, the
    // kernel will push a fault record onto the exception stack, then branch to
    // 'func'.
    //
    // Returns 0 on success, < 0 on error.  Errors are:
    //  -E_BAD_ENV if environment envid doesn't currently exist,
    //      or the caller doesn't have permission to change envid.
    static int
    sys_env_set_pgfault_upcall(envid_t envid, void *func)
    {
      // LAB 4: Your code here.
      int ret = 0;
      struct Env *e = NULL;
    
      if ((ret = envid2env(envid, &e, 1)) < 0) {
        return -E_BAD_ENV;
      }
    
      e->env_pgfault_upcall = func;
    
      return 0;
    }

Normal and Exception Stacks in User Environments

在用户态发生页错误的时候,内核会在 用户异常栈(user exception stack) 重启进程对应的页错误处理函数

用户异常栈开始与 UXSTACKTOP 虚拟地址,其范围在 [UXSTACKTOP-PGSIZE, UXSTACKTOP-1] ,用户级别页错误处理函数能调用对应的系统调用从而重新映射导致页错误的页,然后通过 stub 处理函数返回至导致错误的代码中

Invoking the User Page Fault Handler

JOS 对于已经注册了异常处理函数的进程在异常栈中设置这样的 trap frame

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
                    <-- UXSTACKTOP
trap-time esp
trap-time eflags
trap-time eip
trap-time eax       start of struct PushRegs
trap-time ecx
trap-time edx
trap-time ebx
trap-time esp
trap-time ebp
trap-time esi
trap-time edi       end of struct PushRegs
tf_err (error code)
fault_va            <-- %esp when handler is run

如果进程在执行异常处理函数时发生页错误,会将 trap-time frame 压栈到 tf->tf_esp 之下,需要压入 1 个空的 32-bit 字然后压入 trap-time frame

通过检查 tf->tf_esp 是否在 [UXSTACKTOP-PGSIZE, UXSTACKTOP-1] 这个范围可以判断是否在用户异常栈

  • Exercise 9

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    
    void
    page_fault_handler(struct Trapframe *tf)
    {
      // ...
    
      // Call the environment's page fault upcall, if one exists.  Set up a
      // page fault stack frame on the user exception stack (below
      // UXSTACKTOP), then branch to curenv->env_pgfault_upcall.
      //
      // The page fault upcall might cause another page fault, in which case
      // we branch to the page fault upcall recursively, pushing another
      // page fault stack frame on top of the user exception stack.
      //
      // It is convenient for our code which returns from a page fault
      // (lib/pfentry.S) to have one word of scratch space at the top of the
      // trap-time stack; it allows us to more easily restore the eip/esp. In
      // the non-recursive case, we don't have to worry about this because
      // the top of the regular user stack is free.  In the recursive case,
      // this means we have to leave an extra word between the current top of
      // the exception stack and the new stack frame because the exception
      // stack _is_ the trap-time stack.
      //
      // If there's no page fault upcall, the environment didn't allocate a
      // page for its exception stack or can't write to it, or the exception
      // stack overflows, then destroy the environment that caused the fault.
      // Note that the grade script assumes you will first check for the page
      // fault upcall and print the "user fault va" message below if there is
      // none.  The remaining three checks can be combined into a single test.
      //
      // Hints:
      //   user_mem_assert() and env_run() are useful here.
      //   To change what the user environment runs, modify 'curenv->env_tf'
      //   (the 'tf' variable points at 'curenv->env_tf').
    
      // LAB 4: Your code here.
    
      // check env_pgfault_upcall
      if (curenv->env_pgfault_upcall) {
        uintptr_t uesp;
    
        // fault happens on user exception stack
        if ((tf->tf_esp < UXSTACKTOP) &&
            (tf->tf_esp >= UXSTACKTOP - PGSIZE)) {
          uesp = tf->tf_esp - sizeof(struct UTrapframe) - 4;
        } else {
          uesp = UXSTACKTOP - sizeof(struct UTrapframe);
        }
    
        user_mem_assert(curenv, (void *)uesp, sizeof(struct UTrapframe), PTE_U|PTE_P|PTE_W);
    
        struct UTrapframe *utf = (struct UTrapframe *)uesp;
    
        utf->utf_fault_va = fault_va;
        utf->utf_err = tf->tf_err;
        utf->utf_regs = tf->tf_regs;
        utf->utf_eip = tf->tf_eip;
        utf->utf_eflags = tf->tf_eflags;
        utf->utf_esp = tf->tf_esp;
    
        tf->tf_eip = (uintptr_t)curenv->env_pgfault_upcall;
        tf->tf_esp = uesp;
    
        // re-entry user mode
        env_run(curenv);
      }
    
      // Destroy the environment that caused the fault.
      // ...
    }

User-mode Page Fault Entrypoint

_pgfault_upcall() 作为 sys_env_set_pgfault_upcall() 的参数

  • Exercise 10

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    
    .text
    .globl _pgfault_upcall
    _pgfault_upcall:
      // Call the C page fault handler.
      pushl %esp            // function argument: pointer to UTF
      movl _pgfault_handler, %eax
      call *%eax
      addl $4, %esp         // pop function argument
    
      // Now the C page fault handler has returned and you must return
      // to the trap time state.
      // Push trap-time %eip onto the trap-time stack.
      //
      // Explanation:
      //   We must prepare the trap-time stack for our eventual return to
      //   re-execute the instruction that faulted.
      //   Unfortunately, we can't return directly from the exception stack:
      //   We can't call 'jmp', since that requires that we load the address
      //   into a register, and all registers must have their trap-time
      //   values after the return.
      //   We can't call 'ret' from the exception stack either, since if we
      //   did, %esp would have the wrong value.
      //   So instead, we push the trap-time %eip onto the *trap-time* stack!
      //   Below we'll switch to that stack and call 'ret', which will
      //   restore %eip to its pre-fault value.
      //
      //   In the case of a recursive fault on the exception stack,
      //   note that the word we're pushing now will fit in the
      //   blank word that the kernel reserved for us.
      //
      // Throughout the remaining code, think carefully about what
      // registers are available for intermediate calculations.  You
      // may find that you have to rearrange your code in non-obvious
      // ways as registers become unavailable as scratch space.
      //
      // LAB 4: Your code here.
      // 跳过 utf_fault_va  utf_err
      addl $8, %esp
      // 中断发生的 esp(utf_esp)
      movl 40(%esp), %eax
      // 中断放生的 eip(utf_eip)
      movl 32(%esp), %ebx
      // 将中断发生的 eip 压入原来栈中
      movl %ebx, -4(%eax)
    
      // Restore the trap-time registers.  After you do this, you
      // can no longer modify any general-purpose registers.
      // LAB 4: Your code here.
      // 普通寄存器(regular registers)
      popal
      // 跳过 eip
      addl $4, %esp
    
      // Restore eflags from the stack.  After you do this, you can
      // no longer use arithmetic operations or anything else that
      // modifies eflags.
      // LAB 4: Your code here.
      popfl
    
      // Switch back to the adjusted trap-time stack.
      // LAB 4: Your code here.
      popl %esp
    
      // Return to re-execute the instruction that faulted.
      // LAB 4: Your code here.
      // 压入 eip 没有减 esp 的值,需要 esp-4
      lea -4(%esp), %esp
      ret
  • Exercise 11

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    
    //
    // Set the page fault handler function.
    // If there isn't one yet, _pgfault_handler will be 0.
    // The first time we register a handler, we need to
    // allocate an exception stack (one page of memory with its top
    // at UXSTACKTOP), and tell the kernel to call the assembly-language
    // _pgfault_upcall routine when a page fault occurs.
    //
    void
    set_pgfault_handler(void (*handler)(struct UTrapframe *utf))
    {
      int r;
    
      if (_pgfault_handler == 0) {
        // First time through!
        // LAB 4: Your code here.
    
        /* sys_page_alloc(envid_t envid, void *va, int perm) */
    
        // 当前进程分配异常栈
        if ((r = sys_page_alloc(0, (void *)(UXSTACKTOP - PGSIZE), PTE_W |PTE_U | PTE_P)) < 0) {
          panic("set_pgfault_handler: sys_page_alloc() failed.");
        }
        // 系统调用,设置进程 env_pgfault_upcall
        sys_env_set_pgfault_upcall(0, _pgfault_upcall);
      }
    
      // Save handler pointer for assembly to call.
      _pgfault_handler = handler;
    }

Implementing Copy-on-Write Fork

fork() 流程:

  1. 父进程调用 set_pgfault_handler()pgfault() 设置为 page fault 处理函数
  2. 父进程调用 sys_exofork() 创建子进程
  3. 父进程调用 duppage() 映射子进程内存空间,并把可写页的页状态标记为 COW 状态;异常栈不映射,父进程会为子进程异常栈分配新页
  4. 父进程为子进程设置 page fault 入口点
  5. 父进程标记子进程 RUNNABLE

page fault 处理函数流程:

  1. 内核把 page fault 传递给 _pgfault_upcall ,即调用 pgfault()
  2. pgfault() 检查页错误的错误代码( FEC_WR ),若因为写造成的,且页中 PTE_COW = 1 则继续;否则 panic
  3. pgfault() 分配新页并复制错误页,随后重新映射并改为可读写权限,替换旧页

Exercise 12

实现 fork(), pgfault()duppage()

  • fork()

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    
    //
    // User-level fork with copy-on-write.
    // Set up our page fault handler appropriately.
    // Create a child.
    // Copy our address space and page fault handler setup to the child.
    // Then mark the child as runnable and return.
    //
    // Returns: child's envid to the parent, 0 to the child, < 0 on error.
    // It is also OK to panic on error.
    //
    // Hint:
    //   Use uvpd, uvpt, and duppage.
    //   Remember to fix "thisenv" in the child process.
    //   Neither user exception stack should ever be marked copy-on-write,
    //   so you must allocate a new page for the child's user exception stack.
    //
    envid_t
    fork(void)
    {
      // LAB 4: Your code here.
      // 设置 page fault handler
      set_pgfault_handler(pgfault);
      // 创建子进程,复制当前进程寄存器状态
      envid_t envid = sys_exofork();
    
      if (envid < 0) {  // 错误
        panic("fork(): sys_exofork: %e\n", envid);
      } else if (envid == 0) {  // 子进程
        thisenv = &envs[ENVX(sys_getenvid())];
        return 0;
      }
    
      // 页复制
      uintptr_t va;
    
      for (va = 0; va < USTACKTOP; va += PGSIZE) {
        if ((uvpd[PDX(va)] & PTE_P) &&
            (uvpt[PGNUM(va)] & (PTE_P | PTE_U))) {
          duppage(envid, PGNUM(va));
        }
      }
    
      int ret = 0;
      // 分配异常栈
      if ((ret = sys_page_alloc(envid, (void *)(UXSTACKTOP - PGSIZE), PTE_P | PTE_W | PTE_U)) < 0) {
        panic("fork(): sys_page_alloc: %e\n", ret);
      }
    
      // 设置 _pgfault_upcall
      extern void _pgfault_upcall(void);
      sys_env_set_pgfault_upcall(envid, _pgfault_upcall);
    
      // 设置 ENV_RUNNABLE
      if ((ret = sys_env_set_status(envid, ENV_RUNNABLE)) < 0) {
        panic("fork(): sys_env_set_status: %e\n", ret);
      }
    
      return envid;
    }
  • pgfault()

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    
    //
    // Custom page fault handler - if faulting page is copy-on-write,
    // map in our own private writable copy.
    //
    static void
    pgfault(struct UTrapframe *utf)
    {
      void *addr = (void *) utf->utf_fault_va;
      uint32_t err = utf->utf_err;
      int r;
    
      // Check that the faulting access was (1) a write, and (2) to a
      // copy-on-write page.  If not, panic.
      // Hint:
      //   Use the read-only page table mappings at uvpt
      //   (see <inc/memlayout.h>).
    
      // LAB 4: Your code here.
      // 只处理写入 copy-on-write 地址情况
      if (!(err & FEC_WR) &&
          (uvpt[PGNUM(addr)] & PTE_COW)) {
        panic("pgfault(): write to non-copy-on-write page.");
      }
    
      // Allocate a new page, map it at a temporary location (PFTEMP),
      // copy the data from the old page to the new page, then move the new
      // page to the old page's address.
      // Hint:
      //   You should make three system calls.
    
      // LAB 4: Your code here.
      addr = ROUNDDOWN(addr, PGSIZE);
    
      // 当前进程 PFTEMP 分配新物理页
      if ((r = sys_page_alloc(0, (void *)PFTEMP, PTE_P | PTE_U | PTE_W)) < 0) {
        panic("pgfault(): alloc error %e.\n", r);
      }
      // 将 addr 物理页拷贝到新 PFTEMP 页中
      memmove((void *)PFTEMP, addr, PGSIZE);
    
      // 将 PFTEMP 映射到 addr 中
      if ((r = sys_page_map(0, (void *)PFTEMP,
                            0, addr,
                            PTE_P | PTE_U | PTE_W)) < 0) {
        panic("pgfault(): map error %e.\n", r);
      }
    
      // 解除当前进程与 PFTEMP 的映射
      if ((r = sys_page_unmap(0, (void *)PFTEMP)) < 0) {
        panic("pgfault(): unmap error %e.\n", r);
      }
    
    }
  • duppage()

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    
    //
    // Map our virtual page pn (address pn*PGSIZE) into the target envid
    // at the same virtual address.  If the page is writable or copy-on-write,
    // the new mapping must be created copy-on-write, and then our mapping must be
    // marked copy-on-write as well.  (Exercise: Why do we need to mark ours
    // copy-on-write again if it was already copy-on-write at the beginning of
    // this function?)
    //
    // Returns: 0 on success, < 0 on error.
    // It is also OK to panic on error.
    //
    static int
    duppage(envid_t envid, unsigned pn)
    {
      int r;
    
      // LAB 4: Your code here.
      uintptr_t va = pn * PGSIZE;
    
      // 可写或写时拷贝页
      if ((uvpt[pn] & PTE_W) || (uvpt[pn] & PTE_COW)) {
        if ((r = sys_page_map(thisenv->env_id, (void *)va,
                              envid, (void *)va,
                              PTE_P | PTE_U | PTE_COW)) < 0) {
          panic("duppage(): map from parent to child env error: %e.\n", r);
        }
        if ((r = sys_page_map(thisenv->env_id, (void *)va,
                              thisenv->env_id, (void *)va,
                              PTE_P | PTE_U | PTE_COW)) < 0) {
          panic("duppage(): remap from parent to parent env error: %e.\n", r);
        }
      } else {  // 只读页
        if ((r = sys_page_map(thisenv->env_id, (void *)va,
                              envid, (void *)va,
                              PTE_U | PTE_P)) < 0) {
          panic("duppage(): read-only from parent to child env error: %e.\n", r);
        }
      }
    
      return 0;
    }

Part C: Preemptive Multitasking and Inter-Process communication (IPC)

Clock Interrupts and Preemption

Interrupt discipline

外部中断又称为 IRQ,共有 16 种 IRQ,在 pic_init() 中把 IDQ 映射到 IDT 的 IRQ_OFFSET-IRQ_OFFSET+15

外部中断通过寄存器 eflagsFL_IF 位设置

  • Exercise 13

    初始化 IRQ 在 idt 入口点和 IRQ 处理函数

    • trap_init()

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      
      void
      trap_init(void)
      {
        // ...
      
        // irq
        void irq_handler0();
        void irq_handler1();
        void irq_handler2();
        void irq_handler3();
        void irq_handler4();
        void irq_handler5();
        void irq_handler6();
        void irq_handler7();
        void irq_handler8();
        void irq_handler9();
        void irq_handler10();
        void irq_handler11();
        void irq_handler12();
        void irq_handler13();
        void irq_handler14();
        void irq_handler15();
      
        SETGATE(idt[IRQ_OFFSET + 0], 0, GD_KT, irq_handler0, 3);
        SETGATE(idt[IRQ_OFFSET + 1], 0, GD_KT, irq_handler1, 3);
        SETGATE(idt[IRQ_OFFSET + 2], 0, GD_KT, irq_handler2, 3);
        SETGATE(idt[IRQ_OFFSET + 3], 0, GD_KT, irq_handler3, 3);
        SETGATE(idt[IRQ_OFFSET + 4], 0, GD_KT, irq_handler4, 3);
        SETGATE(idt[IRQ_OFFSET + 5], 0, GD_KT, irq_handler5, 3);
        SETGATE(idt[IRQ_OFFSET + 6], 0, GD_KT, irq_handler6, 3);
        SETGATE(idt[IRQ_OFFSET + 7], 0, GD_KT, irq_handler7, 3);
        SETGATE(idt[IRQ_OFFSET + 8], 0, GD_KT, irq_handler8, 3);
        SETGATE(idt[IRQ_OFFSET + 9], 0, GD_KT, irq_handler9, 3);
        SETGATE(idt[IRQ_OFFSET + 10], 0, GD_KT, irq_handler10, 3);
        SETGATE(idt[IRQ_OFFSET + 11], 0, GD_KT, irq_handler11, 3);
        SETGATE(idt[IRQ_OFFSET + 12], 0, GD_KT, irq_handler12, 3);
        SETGATE(idt[IRQ_OFFSET + 13], 0, GD_KT, irq_handler13, 3);
        SETGATE(idt[IRQ_OFFSET + 14], 0, GD_KT, irq_handler14, 3);
        SETGATE(idt[IRQ_OFFSET + 15], 0, GD_KT, irq_handler15, 3);
      
        // ...
      }
    • trapentry.S

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      
      .text
      
        /*
         * Lab 3: Your code here for generating entry points for the different traps.
         */
        TRAPHANDLER_NOEC(irq_handler0, IRQ_OFFSET);
        TRAPHANDLER_NOEC(irq_handler1, IRQ_OFFSET + 1);
        TRAPHANDLER_NOEC(irq_handler2, IRQ_OFFSET + 2);
        TRAPHANDLER_NOEC(irq_handler3, IRQ_OFFSET + 3);
        TRAPHANDLER_NOEC(irq_handler4, IRQ_OFFSET + 4);
        TRAPHANDLER_NOEC(irq_handler5, IRQ_OFFSET + 5);
        TRAPHANDLER_NOEC(irq_handler6, IRQ_OFFSET + 6);
        TRAPHANDLER_NOEC(irq_handler7, IRQ_OFFSET + 7);
        TRAPHANDLER_NOEC(irq_handler8, IRQ_OFFSET + 8);
        TRAPHANDLER_NOEC(irq_handler9, IRQ_OFFSET + 9);
        TRAPHANDLER_NOEC(irq_handler10, IRQ_OFFSET + 10);
        TRAPHANDLER_NOEC(irq_handler11, IRQ_OFFSET + 11);
        TRAPHANDLER_NOEC(irq_handler12, IRQ_OFFSET + 12);
        TRAPHANDLER_NOEC(irq_handler13, IRQ_OFFSET + 13);
        TRAPHANDLER_NOEC(irq_handler14, IRQ_OFFSET + 14);
        TRAPHANDLER_NOEC(irq_handler15, IRQ_OFFSET + 15);
    • env_alloc()

      1
      2
      3
      4
      5
      
      int env_alloc(struct Env **newenv_store, envid_t parent_id) {
        // Enable interrupts while in user mode.
        // LAB 4: Your code here.
        e->env_tf.tf_eflags |= FL_IF;
      }
    • sched_halt()

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      
      // Halt this CPU when there is nothing to do. Wait until the
      // timer interrupt wakes it up. This function never returns.
      //
      void
      sched_halt(void)
      {
        // ...
      
        // Reset stack pointer, enable interrupts and then halt.
        asm volatile (
          "movl $0, %%ebp\n"
          "movl %0, %%esp\n"
          "pushl $0\n"
          "pushl $0\n"
          // Uncomment the following line after completing exercise 13
          "sti\n"
          "1:\n"
          "hlt\n"
          "jmp 1b\n"
        : : "a" (thiscpu->cpu_ts.ts_esp0));
      }
    • 结果

      在重新编译运行后有断言错误

      1
      
      kernel panic on CPU 0 at kern/trap.c:317: assertion failed: !(read_eflags() & FL_IF)

      通过 简书一份实验报告 得知这个断言错误是由于把 SETGATE()istrap 设置为 1 ,而根据 SETGATE() 中注释中看到当 istrap = 1 会再开始处理中断时会将 FL_IF 重新设置为 1 ,而在 Lab 3 Exercise 4 中 SETGATE()istrap 均为 1 ,所以设置为 0

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      
      // Set up a normal interrupt/trap gate descriptor.
      // - istrap: 1 for a trap (= exception) gate, 0 for an interrupt gate.
      //   see section 9.6.1.3 of the i386 reference: "The difference between
      //   an interrupt gate and a trap gate is in the effect on IF (the
      //   interrupt-enable flag). An interrupt that vectors through an
      //   interrupt gate resets IF, thereby preventing other interrupts from
      //   interfering with the current interrupt handler. A subsequent IRET
      //   instruction restores IF to the value in the EFLAGS image on the
      //   stack. An interrupt through a trap gate does not change IF."
      // - sel: Code segment selector for interrupt/trap handler
      // - off: Offset in code segment for interrupt/trap handler
      // - dpl: Descriptor Privilege Level -
      //    the privilege level required for software to invoke
      //    this interrupt/trap gate explicitly using an int instruction.
      #define SETGATE(gate, istrap, sel, off, dpl)          \
        {                                                   \
          (gate).gd_off_15_0 = (uint32_t) (off) & 0xffff;     \
          (gate).gd_sel = (sel);                            \
          (gate).gd_args = 0;                               \
          (gate).gd_rsv1 = 0;                               \
          (gate).gd_type = (istrap) ? STS_TG32 : STS_IG32;    \
          (gate).gd_s = 0;                                  \
          (gate).gd_dpl = (dpl);                            \
          (gate).gd_p = 1;                                  \
          (gate).gd_off_31_16 = (uint32_t) (off) >> 16;     \
        }

      修改后运行结果

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      
      6828 decimal is 15254 octal!
      Physical memory: 131072K available, base = 640K, extended = 130432K
      check_page_free_list() succeeded!
      check_page_alloc() succeeded!
      check_page() succeeded!
      check_kern_pgdir() succeeded!
      check_page_free_list() succeeded!
      check_page_installed_pgdir() succeeded!
      SMP: CPU 0 found 1 CPU(s)
      enabled interrupts: 1 2
      [00000000] new env 00001000
      I am the parent.  Forking the child...
      [00001000] new env 00001001
      TRAP frame at 0xf02b7000 from CPU 0
      edi  0x00001001
      esi  0x00802000
      ebp  0xeebfdfb0
      oesp 0xf0234fdc
      ebx  0x09a61000
      edx  0x00000000
      ecx  0x00802000
      eax  0x00000000
      es   0x----0023
      ds   0x----0023
      trap 0x00000020 Hardware Interrupt
      err  0x00000000
      eip  0x00800f6e
      cs   0x----001b
      flag 0x00000246
      esp  0xeebfdf88
      ss   0x----0023
      [00001000] free env 00001000
      No runnable environments in the system!
      Welcome to the JOS kernel monitor!
      Type 'help' for a list of commands.
      K>

Handling Clock Interrupts

程序进入用户模式后,除了发生中断,否则 CPU 不会在执行内核代码,因此需要开启时钟中断:在切换程序的时候强迫进入内核

  • Exercise 14

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    
    static void
    trap_dispatch(struct Trapframe *tf)
    {
      // Handle clock interrupts. Don't forget to acknowledge the
      // interrupt using lapic_eoi() before calling the scheduler!
      // LAB 4: Your code here.
      if (tf->tf_trapno == IRQ_OFFSET + IRQ_TIMER) {
        lapic_eoi();
        sched_yield();
        return ;
      }
    }

Inter-Process communication (IPC)

JOS 通过 sys_ipc_recv()sys_ipc_try_send() 两个系统调用让进程能发送一个 32 位的值与一个可选的 page 映射给另外的进程

Sending and Receiving Messages && Transferring Pages

当某进程调用 sys_ipc_recv() 后,该进程会阻塞并等待另外的进程发送信息,而接收单页(page)只要设置 dstva 参数,函数会把 dstva 映射到接收地址中

进程可以通过 sys_ipc_try_send() 给指定进程发送信息,根据指定进程的 sys_ipc_recv() 调用情况,若调用则发送并在最后返回 0 ,否则返回 -E_IPC_NOT_RECV ;传入 srcva 参数表明发送单页(page)

Implementing IPC

  • Exercise 15

    • syscall()

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      
      int32_t
      syscall(uint32_t syscallno, uint32_t a1, uint32_t a2, uint32_t a3, uint32_t a4, uint32_t a5)
      {
        // ...
        switch (syscallno) {
          // ...
          case SYS_ipc_try_send:
            ret = sys_ipc_try_send(a1, a2, (void *)a3, (unsigned int)a4);
            break;
          case SYS_ipc_recv:
            ret = sys_ipc_recv((void *)a1);
            break;
        }
        // ...
      }
    • sys_ipc_recv()

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      
      // Block until a value is ready.  Record that you want to receive
      // using the env_ipc_recving and env_ipc_dstva fields of struct Env,
      // mark yourself not runnable, and then give up the CPU.
      //
      // If 'dstva' is < UTOP, then you are willing to receive a page of data.
      // 'dstva' is the virtual address at which the sent page should be mapped.
      //
      // This function only returns on error, but the system call will eventually
      // return 0 on success.
      // Return < 0 on error.  Errors are:
      //  -E_INVAL if dstva < UTOP but dstva is not page-aligned.
      static int
      sys_ipc_recv(void *dstva)
      {
        // LAB 4: Your code here.
        if ((uintptr_t)dstva < (uintptr_t)UTOP &&
            (uintptr_t)dstva != ROUNDDOWN((uintptr_t)dstva, PGSIZE)) {
          return -E_INVAL;
        }
      
        // set status
        curenv->env_ipc_recving = true;
        curenv->env_ipc_dstva = dstva;
        curenv->env_status = ENV_NOT_RUNNABLE;
        sched_yield();
        return 0;
      }
    • sys_ipc_try_send()

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      43
      44
      45
      46
      47
      48
      49
      50
      51
      52
      53
      54
      55
      56
      57
      58
      59
      60
      61
      62
      63
      64
      65
      66
      67
      68
      69
      70
      71
      72
      73
      74
      75
      76
      77
      78
      79
      80
      81
      82
      83
      84
      85
      86
      87
      
      // Try to send 'value' to the target env 'envid'.
      // If srcva < UTOP, then also send page currently mapped at 'srcva',
      // so that receiver gets a duplicate mapping of the same page.
      //
      // The send fails with a return value of -E_IPC_NOT_RECV if the
      // target is not blocked, waiting for an IPC.
      //
      // The send also can fail for the other reasons listed below.
      //
      // Otherwise, the send succeeds, and the target's ipc fields are
      // updated as follows:
      //    env_ipc_recving is set to 0 to block future sends;
      //    env_ipc_from is set to the sending envid;
      //    env_ipc_value is set to the 'value' parameter;
      //    env_ipc_perm is set to 'perm' if a page was transferred, 0 otherwise.
      // The target environment is marked runnable again, returning 0
      // from the paused sys_ipc_recv system call.  (Hint: does the
      // sys_ipc_recv function ever actually return?)
      //
      // If the sender wants to send a page but the receiver isn't asking for one,
      // then no page mapping is transferred, but no error occurs.
      // The ipc only happens when no errors occur.
      //
      // Returns 0 on success, < 0 on error.
      // Errors are:
      //  -E_BAD_ENV if environment envid doesn't currently exist.
      //      (No need to check permissions.)
      //  -E_IPC_NOT_RECV if envid is not currently blocked in sys_ipc_recv,
      //      or another environment managed to send first.
      //  -E_INVAL if srcva < UTOP but srcva is not page-aligned.
      //  -E_INVAL if srcva < UTOP and perm is inappropriate
      //      (see sys_page_alloc).
      //  -E_INVAL if srcva < UTOP but srcva is not mapped in the caller's
      //      address space.
      //  -E_INVAL if (perm & PTE_W), but srcva is read-only in the
      //      current environment's address space.
      //  -E_NO_MEM if there's not enough memory to map srcva in envid's
      //      address space.
      static int
      sys_ipc_try_send(envid_t envid, uint32_t value, void *srcva, unsigned perm)
      {
        // LAB 4: Your code here.
        int ret = 0;
        struct Env *dstenv = NULL;
      
        if ((ret = envid2env(envid, &dstenv, 0)) < 0) {
          return -E_BAD_ENV;
        }
      
        if (!dstenv->env_ipc_recving) {
          return -E_IPC_NOT_RECV;
        }
      
        if ((uintptr_t)srcva < UTOP) {
          if ((uintptr_t)srcva != ROUNDDOWN((uintptr_t)srcva, PGSIZE)) {
            return -E_INVAL;
          }
      
          if (perm & ~PTE_SYSCALL) {
            return -E_INVAL;
          }
      
          struct PageInfo *pg = NULL;
          pte_t *pte = NULL;
          if (!(pg = page_lookup(curenv->env_pgdir, srcva, &pte))) {
            return -E_INVAL;
          }
      
          if ((perm & PTE_W) &&
              !(*pte & PTE_W)) {
            return -E_INVAL;
          }
      
          if ((ret = page_insert(dstenv->env_pgdir, pg, dstenv->env_ipc_dstva, perm)) < 0) {
            return -E_NO_MEM;
          }
        }
      
        dstenv->env_ipc_recving = false;
        dstenv->env_ipc_from = curenv->env_id;
        dstenv->env_ipc_value = value;
        dstenv->env_ipc_perm = perm;
        dstenv->env_status = ENV_RUNNABLE;
        dstenv->env_tf.tf_regs.reg_eax = 0;
      
        return 0;
      }
    • ipc_recv()

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      43
      44
      45
      
      // Receive a value via IPC and return it.
      // If 'pg' is nonnull, then any page sent by the sender will be mapped at
      //  that address.
      // If 'from_env_store' is nonnull, then store the IPC sender's envid in
      //  *from_env_store.
      // If 'perm_store' is nonnull, then store the IPC sender's page permission
      //  in *perm_store (this is nonzero iff a page was successfully
      //  transferred to 'pg').
      // If the system call fails, then store 0 in *fromenv and *perm (if
      //  they're nonnull) and return the error.
      // Otherwise, return the value sent by the sender
      //
      // Hint:
      //   Use 'thisenv' to discover the value and who sent it.
      //   If 'pg' is null, pass sys_ipc_recv a value that it will understand
      //   as meaning "no page".  (Zero is not the right value, since that's
      //   a perfectly valid place to map a page.)
      int32_t
      ipc_recv(envid_t *from_env_store, void *pg, int *perm_store)
      {
        // LAB 4: Your code here.
        int ret;
        void *va;
      
        if (!pg) {
          pg = (void *)-1;
        } else {
          va = pg;
        }
      
        if ((ret = sys_ipc_recv(va)) < 0) {
          *from_env_store = 0;
          *perm_store = 0;
        }
      
        if (from_env_store) {
          *from_env_store = thisenv->env_ipc_from;
        }
      
        if (perm_store) {
          *perm_store = thisenv->env_ipc_perm;
        }
      
        return thisenv->env_ipc_value;
      }
    • ipc_send()

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      
      // Send 'val' (and 'pg' with 'perm', if 'pg' is nonnull) to 'toenv'.
      // This function keeps trying until it succeeds.
      // It should panic() on any error other than -E_IPC_NOT_RECV.
      //
      // Hint:
      //   Use sys_yield() to be CPU-friendly.
      //   If 'pg' is null, pass sys_ipc_try_send a value that it will understand
      //   as meaning "no page".  (Zero is not the right value.)
      void
      ipc_send(envid_t to_env, uint32_t val, void *pg, int perm)
      {
        // LAB 4: Your code here.
        int ret = 0;
        void *va = NULL;
      
        if (!pg) {
          va = (void *)-1;
        } else {
          va = pg;
        }
      
        while (1) {
          ret = sys_ipc_try_send(to_env, val, va, perm);
      
          if (!ret) {
            break;
          } else if (ret < 0 && ret != -E_IPC_NOT_RECV) {
            panic("ipc_send(): sending error %e\n", ret);
          } else if (ret == -E_IPC_NOT_RECV) {
            sys_yield();
          }
        }
      }