零基础学Python:文件(2)

上一节,对文件有了初步认识。要牢记,文件无非也是一种类型的数据。

文件的状态

很多时候,我们需要获取一个文件的有关状态(也称为属性),比如创建日期,访问日期,修改日期,大小,等等。在os模块中,有这样一个方法,专门让我们查看文件的这些状态参数的。

>>> import os
>>> file_stat = os.stat("131.txt")      #查看这个文件的状态
>>> file_stat                           #文件状态是这样的。从下面的内容,有不少从英文单词中可以猜测出来。
posix.stat_result(st_mode=33204, st_ino=5772566L, st_dev=2049L, st_nlink=1, st_uid=1000, st_gid=1000, st_size=69L, st_atime=1407897031, st_mtime=1407734600, st_ctime=1407734600)

>>> file_stat.st_ctime                  #这个是文件创建时间
1407734600.0882277

这是什么时间?看不懂!别着急,换一种方式。在python中,有一个模块time,是专门针对时间设计的。

>>> import time                         
>>> time.localtime(file_stat.st_ctime)  #这回看清楚了。
time.struct_time(tm_year=2014, tm_mon=8, tm_mday=11, tm_hour=13, tm_min=23, tm_sec=20, tm_wday=0, tm_yday=223, tm_isdst=0)

read/readline/readlines

上节中,简单演示了如何读取文件内容,但是,在用dir(file)的时候,会看到三个函数:read/readline/readlines,它们各自有什么特点,为什么要三个?一个不行吗?

在读者向下看下面内容之前,请想一想,如果要回答这个问题,你要用什么方法?注意,我问的是用什么方法能够找到答案,不是问答案内容是什么。因为内容,肯定是在某个地方存放着呢,关键是用什么方法找到。

搜索?是一个不错的方法。

还有一种,就是在交互模式下使用的,你肯定也想到了。

>>> help(file.read)

用这样的方法,可以分别得到三个函数的说明:

read(...)
    read([size]) -> read at most size bytes, returned as a string.

    If the size argument is negative or omitted, read until EOF is reached.
    Notice that when in non-blocking mode, less data than what was requested
    may be returned, even if no size parameter was given.

readline(...)
    readline([size]) -> next line from the file, as a string.

    Retain newline.  A non-negative size argument limits the maximum
    number of bytes to return (an incomplete line may be returned then).
    Return an empty string at EOF.

readlines(...)
    readlines([size]) -> list of strings, each a line from the file.

    Call readline() repeatedly and return a list of the lines so read.
    The optional size argument, if given, is an approximate bound on the
    total number of bytes in the lines returned.

对照一下上面的说明,三个的异同就显现了。

EOF什么意思?End-of-file。在维基百科中居然有对它的解释:

In computing, End Of File (commonly abbreviated EOF[1]) is a condition in a computer operating system where no more data can be read from a data source. The data source is usually called a file or stream. In general, the EOF is either determined when the reader returns null as seen in Java's BufferedReader,[2] or sometimes people will manually insert an EOF character of their choosing to signal when the file has ended.

明白EOF之后,就对比一下:

  • read:如果指定了参数size,就按照该指定长度从文件中读取内容,否则,就读取全文。被读出来的内容,全部塞到一个字符串里面。这样有好处,就是东西都到内存里面了,随时取用,比较快捷;“成也萧何败萧何”,也是因为这点,如果文件内容太多了,内存会吃不消的。文档中已经提醒注意在“non-blocking”模式下的问题,关于这个问题,不是本节的重点,暂时不讨论。
  • readline:那个可选参数size的含义同上。它则是以行为单位返回字符串,也就是每次读一行,依次循环,如果不限定size,直到最后一个返回的是空字符串,意味着到文件末尾了(EOF)。
  • readlines:size同上。它返回的是以行为单位的列表,即相当于先执行readline(),得到每一行,然后把这一行的字符串作为列表中的元素塞到一个列表中,最后将此列表返回。

依次演示操作,即可明了。有这样一个文档,名曰:you.md,其内容和基本格式如下:

分别用上述三种函数读取这个文件。

>>> f = open("you.md")
>>> content = f.read()
>>> content
'You Raise Me Up\nWhen I am down and, oh my soul, so weary;\nWhen troubles come and my heart burdened be;\nThen, I am still and wait here in the silence,\nUntil you come and sit awhile with me.\nYou raise me up, so I can stand on mountains;\nYou raise me up, to walk on stormy seas;\nI am strong, when I am on your shoulders;\nYou raise me up: To more than I can be.\n'
>>> f.close()

提示:养成一个好习惯,只要打开文件,不用该文件了,就一定要随手关闭它。如果不关闭它,它还驻留在内存中,后面又没有对它的操作,是不是浪费内存空间了呢?同时也增加了文件安全的风险。

请仔细观察,得到的就是一个大大的字符串,但是这个字符串里面包含着一些符号\n,因为原文中有换行符。如果用print输出这个字符串,就是这样的了,其中的\n起作用了。

>>> print content
You Raise Me Up
When I am down and, oh my soul, so weary;
When troubles come and my heart burdened be;
Then, I am still and wait here in the silence,
Until you come and sit awhile with me.
You raise me up, so I can stand on mountains;
You raise me up, to walk on stormy seas;
I am strong, when I am on your shoulders;
You raise me up: To more than I can be.

readline()读取,则是这样的:

>>> f = open("you.md")
>>> f.readline()
'You Raise Me Up\n'
>>> f.readline()
'When I am down and, oh my soul, so weary;\n'
>>> f.readline()
'When troubles come and my heart burdened be;\n'
>>> f.close()

显示出一行一行读取了,每操作一次f.readline(),就读取一行,并且将指针向下移动一行,如此循环。显然,这种是一种循环,或者说可迭代的。因此,就可以用循环语句来完成对全文的读取。

#!/usr/bin/env python
# coding=utf-8

f = open("you.md")

while True:
    line = f.readline()
    if not line:         #到EOF,返回空字符串,则终止循环
        break
    print line ,         #注意后面的逗号,去掉print语句后面的'\n',保留原文件中的换行

f.close()                #别忘记关闭文件

将其和文件”you.md”保存在同一个目录中,我这里命名的文件名是12701.py,然后在该目录中运行python 12701.py,就看到下面的效果了:

~/Documents$ python 12701.py 
You Raise Me Up
When I am down and, oh my soul, so weary;
When troubles come and my heart burdened be;
Then, I am still and wait here in the silence,
Until you come and sit awhile with me.
You raise me up, so I can stand on mountains;
You raise me up, to walk on stormy seas;
I am strong, when I am on your shoulders;
You raise me up: To more than I can be.

也用readlines()来读取此文件:

>>> f = open("you.md")
>>> content = f.readlines()
>>> content
['You Raise Me Up\n', 'When I am down and, oh my soul, so weary;\n', 'When troubles come and my heart burdened be;\n', 'Then, I am still and wait here in the silence,\n', 'Until you come and sit awhile with me.\n', 'You raise me up, so I can stand on mountains;\n', 'You raise me up, to walk on stormy seas;\n', 'I am strong, when I am on your shoulders;\n', 'You raise me up: To more than I can be.\n']

返回的是一个列表,列表中每个元素都是一个字符串,每个字符串中的内容就是文件的一行文字,含行末的符号。显而易见,它是可以用for来循环的。

>>> for line in content:
...     print line ,
... 
You Raise Me Up
When I am down and, oh my soul, so weary;
When troubles come and my heart burdened be;
Then, I am still and wait here in the silence,
Until you come and sit awhile with me.
You raise me up, so I can stand on mountains;
You raise me up, to walk on stormy seas;
I am strong, when I am on your shoulders;
You raise me up: To more than I can be.
>>> f.close()

读很大的文件

前面已经说明了,如果文件太大,就不能用read()或者readlines()一次性将全部内容读入内存,可以使用while循环和readlin()来完成这个任务。

此外,还有一个方法:fileinput模块

>>> import fileinput
>>> for line in fileinput.input("you.md"):
...     print line ,
... 
You Raise Me Up
When I am down and, oh my soul, so weary;
When troubles come and my heart burdened be;
Then, I am still and wait here in the silence,
Until you come and sit awhile with me.
You raise me up, so I can stand on mountains;
You raise me up, to walk on stormy seas;
I am strong, when I am on your shoulders;
You raise me up: To more than I can be.

我比较喜欢这个,用起来是那么得心应手,简洁明快,还用for。

对于这个模块的更多内容,读者可以自己在交互模式下利用dir()help()去查看明白。

还有一种方法,更为常用:

>>> for line in f:
...     print line ,
... 
You Raise Me Up
When I am down and, oh my soul, so weary;
When troubles come and my heart burdened be;
Then, I am still and wait here in the silence,
Until you come and sit awhile with me.
You raise me up, so I can stand on mountains;
You raise me up, to walk on stormy seas;
I am strong, when I am on your shoulders;
You raise me up: To more than I can be.

之所以能够如此,是因为file是可迭代的数据类型,直接用for来迭代即可。

seek

这个函数的功能就是让指针移动。特别注意,它是以字节为单位进行移动的。比如:

>>> f = open("you.md")
>>> f.readline()
'You Raise Me Up\n'
>>> f.readline()
'When I am down and, oh my soul, so weary;\n'

现在已经移动到第四行末尾了,看seek()的能力:

>>> f.seek(0)

意图是要回到文件的最开头,那么如果用f.readline()应该读取第一行。

>>> f.readline()
'You Raise Me Up\n'

果然如此。此时指针所在的位置,还可以用tell()来显示,如

>>> f.tell()
17L
>>> f.seek(4)

f.seek(4)就将位置定位到从开头算起的第四个字符后面,也就是”You “之后,字母”R”之前的位置。

>>> f.tell()
4L

tell()也是这么说的。这时候如果使用readline(),得到就是从当前位置开始到行末。

>>> f.readline()
'Raise Me Up\n'
>>> f.close()

seek()还有别的参数,具体如下:

whence的值:

  • 默认值是0,表示从文件开头开始计算指针偏移的量(简称偏移量)。这是offset必须是大于等于0的整数。
  • 是1时,表示从当前位置开始计算偏移量。offset如果是负数,表示从当前位置向前移动,整数表示向后移动。
  • 是2时,表示相对文件末尾移动。