实验地址

File system preliminaries

主要实现一个具有基本特点(创建、读取、写入和删除文件的且具备层级结构)的文件系统

On-Disk File System Structure

JOS 直接把元数据存储至 directory entry 内

文件和目录逻辑上均由 data blocks 组成,这些 blocks 分散在硬盘上,文件系统隐藏这些 blocks 的细节,只提供相应的读写接口。JOS 允许用户直接读目录的元数据,而这种使应用程序过度依赖目录元数据是当前操作系统不采用的原因

Sectors and Blocks

大部分硬盘都是以 sector 为粒度进行读写,JOS 中每个 sector 大小为 512 bytes。文件系统以 block 为单位分配和使用硬盘。sector size 是硬盘的属性,block size 是操作系统使用硬盘的粒度。

Superblocks

文件系统在硬盘中“易于查找”的位置保留某些数据块用于保存文件系统属性的元数据,如块大小、硬盘大小等,这些块叫做 superblock。

1
2
3
4
5
struct Super {
  uint32_t s_magic;		// Magic number: FS_MAGIC
  uint32_t s_nblocks;		// Total number of blocks on disk
  struct File s_root;		// Root directory node
};

JOS 中 superblock 位于 block 1,block 0 用于保存 boot loader 和分区表。

Figure 1: 硬盘结构

Figure 1: 硬盘结构

File Meta-data

描述文件的结构如下

其中 f_direct 数组用于保存前 NDIRECT (10)个 block 号,对于 10*4096=40KB 的小文件而言无需额外空间记录内容 block 号;对于大文件而言需要额外分派 block 来保存,最多保存 40964=1024 个 block 号,所以文件系统支持 1034 个 blocks。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
struct File {
  char f_name[MAXNAMELEN];	// filename
  off_t f_size;			// file size in bytes
  uint32_t f_type;		// file type

  // Block pointers.
  // A block is allocated iff its value is != 0.
  uint32_t f_direct[NDIRECT];	// direct blocks
  uint32_t f_indirect;		// indirect block

  // Pad out to 256 bytes; must do arithmetic in case we're compiling
  // fsformat on a 64-bit machine.
  uint8_t f_pad[256 - MAXNAMELEN - 8 - 4*NDIRECT - 4];
} __attribute__((packed));	// required only on some 64-bit machines
Figure 2: 文件结构

Figure 2: 文件结构

Directories versus Regular Files

File 即是文件也是目录,由 f_type 做区别,文件系统除把目录文件内容视为一系列 File 结构外其他的管理方式与文件一致。

The File System

Disk Access

跟其他操作系统把硬盘单独加入内核并提供系统调用从而让文件系统访问,JOS 通过文件系统进程来作为硬盘驱动。

x86 处理器用 EFLAGS 寄存器的 IOPL 控制保护模式下能否执行设备 I/O 指令

Exercise 1

修改 env_create()typeENV_TYPE_FS 则给该进程 IO 权限

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
void
env_create(uint8_t *binary, enum EnvType type)
{
  // ...
  // If this is the file server (type == ENV_TYPE_FS) give it I/O privileges.
  // LAB 5: Your code here.
  if (type == ENV_TYPE_FS) {
    env->env_tf.tf_eflags |= FL_IOPL_MASK;
  }
}

Q1

Do you have to do anything else to ensure that this I/O privilege setting is saved and restored properly when you subsequently switch from one environment to another? Why?

不用,当陷入中断的时候, env_tf 会保存所以寄存器当前状态

The Block Cache

整个文件系统最大支持 3GB,文件系统进程保留 0x10000000(DISKMAP) - 0xD0000000(DISKMAP+DISKMAX) 内存空间作为硬盘缓存;由于把整个硬盘读取到内存会花费很长时间,所以只会按需加载:只会加载出现页错误的 block 到对应内存区域中。

Exercise 2

主要实现 bc_pgfault()flush_block()

bc_pgfault() 是页错误处理函数,主要硬盘页错误 block 加载到内存中

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
static void
bc_pgfault(struct UTrapframe *utf)
{
  // ...
  // Allocate a page in the disk map region, read the contents
  // of the block from the disk into that page.
  // Hint: first round addr to page boundary. fs/ide.c has code to read
  // the disk.
  //
  // LAB 5: you code here:

  // addr my not be aligned to a block boundary
  addr = ROUNDDOWN(addr, PGSIZE);

  if ((r = sys_page_alloc(0, addr, PTE_U | PTE_W | PTE_P)) < 0) {
    panic("in bc_pgfault, sys_page_alloc: %e", r);
  }

  if ((r = ide_read(blockno * BLKSECTS, addr, BLKSECTS)) < 0) {
    panic("in bc_pgfault, ide_write: %e", r);
  }
}

flush_block() 把 block 写入硬盘中,根据提示实现

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Flush the contents of the block containing VA out to disk if
// necessary, then clear the PTE_D bit using sys_page_map.
// If the block is not in the block cache or is not dirty, does
// nothing.
// Hint: Use va_is_mapped, va_is_dirty, and ide_write.
// Hint: Use the PTE_SYSCALL constant when calling sys_page_map.
// Hint: Don't forget to round addr down.
void
flush_block(void *addr)
{
  uint32_t blockno = ((uint32_t)addr - DISKMAP) / BLKSIZE;
  int r = 0;

  if (addr < (void*)DISKMAP || addr >= (void*)(DISKMAP + DISKSIZE))
    panic("flush_block of bad va %08x", addr);

  // LAB 5: Your code here.
  // round addr down
  addr = ROUNDDOWN(addr, PGSIZE);
  // do nothing if the block isn't even in the block cache (that is, the
  // page isn't mapped) or if it's not dirty.
  if (!va_is_mapped(addr) || !va_is_dirty(addr)) {
    return ;
  }

  if ((r = ide_write(blockno * BLKSECTS, addr, BLKSECTS)) < 0) {
    panic("in flush_block, ide_write: %e", r);
  }

  // clear the PTE_D bit using sys_page_map.
  if ((r = sys_page_map(0, addr,
                        0, addr,
                        uvpt[PGNUM(addr)] & PTE_SYSCALL)) < 0) {
    panic("in flush_block, sys_page_map: %e", r);
  }
}

The Block Bitmap

Exercise 3

实现 alloc_block() 分配从 bitmap 位数组找空闲的 block

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Search the bitmap for a free block and allocate it.  When you
// allocate a block, immediately flush the changed bitmap block
// to disk.
//
// Return block number allocated on success,
// -E_NO_DISK if we are out of blocks.
//
// Hint: use free_block as an example for manipulating the bitmap.
int
alloc_block(void)
{
  // The bitmap consists of one or more blocks.  A single bitmap block
  // contains the in-use bits for BLKBITSIZE blocks.  There are
  // super->s_nblocks blocks in the disk altogether.
  // LAB 5: Your code here.
  uint32_t blockno = 1;

  for (blockno = 1; blockno < super->s_nblocks; ++blockno) {
    if (block_is_free(blockno)) {
      bitmap[blockno / 32] &= ~(1 << (blockno % 32));
      flush_block(&bitmap[blockno / 32]);
      return blockno;
    }
  }

  return -E_NO_DISK;
}

File Operations

fs/fs.c 提供了各种文件系统的基本方法

Exercise 4

file_block_walk() 通过文件内偏移量映射 File 结构体对应的块

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
// Find the disk block number slot for the 'filebno'th block in file 'f'.
// Set '*ppdiskbno' to point to that slot.
// The slot will be one of the f->f_direct[] entries,
// or an entry in the indirect block.
// When 'alloc' is set, this function will allocate an indirect block
// if necessary.
//
// Returns:
//	0 on success (but note that *ppdiskbno might equal 0).
//	-E_NOT_FOUND if the function needed to allocate an indirect block, but
//		alloc was 0.
//	-E_NO_DISK if there's no space on the disk for an indirect block.
//	-E_INVAL if filebno is out of range (it's >= NDIRECT + NINDIRECT).
//
// Analogy: This is like pgdir_walk for files.
// Hint: Don't forget to clear any block you allocate.
static int
file_block_walk(struct File *f, uint32_t filebno, uint32_t **ppdiskbno, bool alloc)
{
  // LAB 5: Your code here.
  int blockno = 0;
  uint32_t *indirects = NULL;

  // out of range (it's >= NDIRECT + NINDIRECT).
  if (filebno >= NDIRECT + NINDIRECT)
    return -E_INVAL;

  if (filebno < NDIRECT) {
    *ppdiskbno = &(f->f_direct[filebno]);
    return 0;
  }

  if (!f->f_indirect) {
    if (!alloc) {
      return -E_NOT_FOUND;
    }

    if ((blockno = alloc_block()) < 0) {
      return blockno;
    }

    flush_block(diskaddr(blockno));
    f->f_indirect = blockno;
  }

  indirects = diskaddr(f->f_indirect);
  *ppdiskbno = &(indirects[filebno - NDIRECT]);

  return 0;
}

file_get_block() 映射实际的硬盘块

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Set *blk to the address in memory where the filebno'th
// block of file 'f' would be mapped.
//
// Returns 0 on success, < 0 on error.  Errors are:
//	-E_NO_DISK if a block needed to be allocated but the disk is full.
//	-E_INVAL if filebno is out of range.
//
// Hint: Use file_block_walk and alloc_block.
int
file_get_block(struct File *f, uint32_t filebno, char **blk)
{
  // LAB 5: Your code here.
  int r = 0;
  uint32_t *pdiskbno = NULL;

  if ((r = file_block_walk(f, filebno, &pdiskbno, 1)) < 0) {
    return r;
  }

  if (!*pdiskbno) {
    if ((r = alloc_block()) < 0) {
      return r;
    }

    *pdiskbno = r;
    flush_block(diskaddr(r));
  }

  *blk = diskaddr(*pdiskbno);
  return 0;
}

The file system interface

当前其他用户程序不能直接调用文件系统进程提供得功能,需要设计一套 RPC 为其他进程提供文件系统服务,其原理如下:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
      Regular env           FS env
   +---------------+   +---------------+
   |      read     |   |   file_read   |
   |   (lib/fd.c)  |   |   (fs/fs.c)   |
...|.......|.......|...|.......^.......|...............
   |       v       |   |       |       | RPC mechanism
   |  devfile_read |   |  serve_read   |
   |  (lib/file.c) |   |  (fs/serv.c)  |
   |       |       |   |       ^       |
   |       v       |   |       |       |
   |     fsipc     |   |     serve     |
   |  (lib/file.c) |   |  (fs/serv.c)  |
   |       |       |   |       ^       |
   |       v       |   |       |       |
   |   ipc_send    |   |   ipc_recv    |
   |       |       |   |       ^       |
   +-------|-------+   +-------|-------+
           |                   |
           +-------------------+

RPC 与借助 IPC 机制实现,普通进程通过 IPC 向文件系统简称发送操作命令及对应数据,文件系统执行相应操作并把结果反馈给普通进程。

文件系统通过 serve 函数内的无限循环不断监听接收的 IPC,并把接收到的信息进行分发,最后反馈结果。

客户端即普通进程通过 32 位作为请求类型,并发送 union Fsipc 参数作为请求参数,这个参数通过 IPC 页共享传递,客户端中这个参数位于 fsipcbuf ,服务端即文件系统可以通过 fsreq(0x0fffff000) 进行访问。

服务端即文件系统给客户端返回 32 位值作为返回码, FSREQ_READFSREQ_STAT 两种类型会通过 IPC 返回额外数据。

Exercise 5

serve_read()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// Read at most ipc->read.req_n bytes from the current seek position
// in ipc->read.req_fileid.  Return the bytes read from the file to
// the caller in ipc->readRet, then update the seek position.  Returns
// the number of bytes successfully read, or < 0 on error.
int
serve_read(envid_t envid, union Fsipc *ipc)
{
  struct Fsreq_read *req = &ipc->read;
  struct Fsret_read *ret = &ipc->readRet;

  if (debug)
    cprintf("serve_read %08x %08x %08x\n", envid, req->req_fileid, req->req_n);

  // Lab 5: Your code here:
  int r = 0;
  struct OpenFile *o = NULL;

  if ((r = openfile_lookup(envid, req->req_fileid, &o)) < 0) {
    return r;
  }

  if ((r = file_read(o->o_file, ret->ret_buf, req->req_n,
                     o->o_fd->fd_offset)) < 0) {
    return r;
  }

  o->o_fd->fd_offset += r;

  return r;
}

Exercise 6

serve_write()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// Write req->req_n bytes from req->req_buf to req_fileid, starting at
// the current seek position, and update the seek position
// accordingly.  Extend the file if necessary.  Returns the number of
// bytes written, or < 0 on error.
int
serve_write(envid_t envid, struct Fsreq_write *req)
{
  if (debug)
    cprintf("serve_write %08x %08x %08x\n", envid, req->req_fileid, req->req_n);

  // LAB 5: Your code here.
  int r = 0;
  struct OpenFile *o = NULL;

  if ((r = openfile_lookup(envid, req->req_fileid, &o)) < 0) {
    return r;
  }


  if ((r = file_write(o->o_file, req->req_buf, req->req_n,
                      o->o_fd->fd_offset)) < 0) {
    return r;
  }

  o->o_fd->fd_offset += r;

  return r;
}

devfile_write()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
// Write at most 'n' bytes from 'buf' to 'fd' at the current seek position.
//
// Returns:
//	 The number of bytes successfully written.
//	 < 0 on error.
static ssize_t
devfile_write(struct Fd *fd, const void *buf, size_t n)
{
  // Make an FSREQ_WRITE request to the file system server.  Be
  // careful: fsipcbuf.write.req_buf is only so large, but
  // remember that write is always allowed to write *fewer*
  // bytes than requested.
  // LAB 5: Your code here

  fsipcbuf.write.req_fileid = fd->fd_file.id;
  fsipcbuf.write.req_n = n;

  memmove(fsipcbuf.write.req_buf, buf, n);

  return fsipc(FSREQ_WRITE, NULL);
}

Spawning Processes

spawn() 创建新的子进程,从文件系统中加载程序并运行子进程, spawn()fork() 后立即执行 exec() 相似。

Exercise 7

实现 sys_env_set_trapframe() 用于初始化进程状态

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Set envid's trap frame to 'tf'.
// tf is modified to make sure that user environments always run at code
// protection level 3 (CPL 3), interrupts enabled, and IOPL of 0.
//
// Returns 0 on success, < 0 on error.  Errors are:
//	-E_BAD_ENV if environment envid doesn't currently exist,
//		or the caller doesn't have permission to change envid.
static int
sys_env_set_trapframe(envid_t envid, struct Trapframe *tf)
{
  // LAB 5: Your code here.
  // Remember to check whether the user has supplied us with a good
  // address!
  int r = 0;
  struct Env *e = NULL;

  if ((r = envid2env(envid, &e, 1)) < 0) {
    return -E_BAD_ENV;
  }

  // run at code protection level 3(CPL 3)
  tf->tf_cs = GD_UT | 3;
  // interrupts enabled
  tf->tf_eflags = FL_IF;
  // IOPL of 0
  tf->tf_eflags &= ~FL_IOPL_MASK;

  e->env_tf = *tf;
  return 0;

}

Sharing library state across fork and spawn

Unix 中文件描述符包含管道,控制台 I/O 等,JOS 把每种设备都对应到 struct Dev 中,里面包含指向读写等功能的函数指针。

struct Fd 代表设备类型, lib/fd.c 是对 struct Dev 的简单封装。

lib/fd.c 为每个进程都维护率文件描述符表,从 FDTABLE 开始,大小为 4KB,最多有 MAXFD(32) 个文件描述符。在任意时刻,一个文件描述符当且仅当处于使用情况才被映射。

这里主要是把 PTE_SHARE 状态的 PTE 在 fork()spawn() 过程中直接把 PTE 拷贝到子进程页表中,从而让父进程和子进程共享相同的页映射关系。

Exercise 8

duppage()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
static int
duppage(envid_t envid, unsigned pn)
{
  // ...
  if (uvpt[pn] & PTE_SHARE) {
    if ((r = sys_page_map(thisenv->env_id, (void *)va,
                          envid, (void *)va,
                          uvpt[pn] & PTE_SYSCALL)) < 0) {
      panic("duppage(): PTE_SHARE copy error: %e.\n", r);
    }
    return 0;
  }
}

copy_shared_pages()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
// Copy the mappings for shared pages into the child address space.
static int
copy_shared_pages(envid_t child)
{
  // LAB 5: Your code here.
  int r = 0;
  uintptr_t va = 0;

  for (va = 0; va < USTACKTOP; va += PGSIZE) {
    if ((uvpd[PDX(va)] & PTE_P) &&
        (uvpt[PGNUM(va)] & (PTE_P | PTE_U)) &&
        (uvpt[PGNUM(va)] & PTE_SHARE)) {
      if ((r = sys_page_map(0, (void *)va,
                            child, (void *)va,
                            uvpt[PGNUM(va)] & PTE_SYSCALL)) < 0) {
        panic("copy_shared_pages: sys_page_map: %e", r);
        return r;
      }
    }
  }
  return 0;
}

The keyboard interface

这里主要处理键盘输入时显示到控制台对应的系统中断

Exercise 9

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
static void
trap_dispatch(struct Trapframe *tf)
{
  // Handle keyboard and serial interrupts.
  // LAB 5: Your code here.
  if (tf->tf_trapno == IRQ_OFFSET + IRQ_KBD) {
    kbd_intr();
    return ;
  }

  if (tf->tf_trapno == IRQ_OFFSET + IRQ_SERIAL) {
    serial_intr();
    return ;
  }

}

The Shell

主要完成 shell 的 I/O 重定向功能

Exercise 10

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
void
runcmd(char* s)
{
  // ...
  while (1) {
    switch ((c = gettoken(0, &t))) {

    case '<':	// Input redirection
      // Grab the filename from the argument list
      if (gettoken(0, &t) != 'w') {
        cprintf("syntax error: < not followed by word\n");
        exit();
      }
      // Open 't' for reading as file descriptor 0
      // (which environments use as standard input).
      // We can't open a file onto a particular descriptor,
      // so open the file as 'fd',
      // then check whether 'fd' is 0.
      // If not, dup 'fd' onto file descriptor 0,
      // then close the original 'fd'.

      // LAB 5: Your code here.
      if ((fd = open(t, O_RDONLY)) < 0) {
        cprintf("open %s for write: %e", t, fd);
        exit();
      }
      if (fd) {
        dup(fd, 0);
        close(fd);
      }
      break;
    }
  }
}