莫名其妙的空行

jackphil · 2023-02-13 13:21:32

#!/usr/bin/env python
# -*- coding: utf-8 -*-

def updat_file(ss):
     with open('test.txt', 'w', encoding='utf-8') as f:
               f.write('\r\n'.join(ss))

k = 0

if __name__== "__main__" :
     while k < 6:
          with open('test.txt', 'r', encoding='utf-8') as f:
               ss = [i[0:7] for i in f.readlines()]
           
           ss.pop(0)
           updat_file(ss)

           k += 1

test.txt:

当 test.txt 中没有空行时，一切正常，如果有一个空行，会多出很多连续的空行，不知何故？有大佬能帮忙看一下吗。

Watermelon.Rei · 2023-02-14 11:30:20

官方手册：https://docs.python.org/3/library/functions.html#open

open(file, mode='r', buffering=- 1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
------------------------------------------
--------------中间部分省略--------------
------------------------------------------
newline determines how to parse newline characters from the stream. It can be None, '', '\n', '\r', and '\r\n'. It works as follows:
When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated.
When writing output to the stream, if newline is None, any '\n' characters written are translated to the system default line separator, os.linesep. If newline is '' or '\n', no translation takes place. If newline is any of the other legal values, any '\n' characters written are translated to the given string.

open函数在text mode下若不指定newline变量的值，默认None情况下会将三个平台的换行符 '\n', '\r'，或 '\r\n'全部翻译为'\n'处理，写入的时候将所有的'\n'翻译为os.linesep，对应你平台的换行符，这样保证了不同平台的一致性。需要注意的是在python内部处理时（text mode，文本模式），换行符全部按'\n'处理：https://docs.python.org/3/library/os.html#os.linesep

os.linesep
The string used to separate (or, rather, terminate) lines on the current platform. This may be a single character, such as '\n' for POSIX, or multiple characters, for example, '\r\n' for Windows. Do not use os.linesep as a line terminator when writing files opened in text mode (the default); use a single '\n' instead, on all platforms.

因此你在写入更新后的ss列表时，加入的换行符应该是

 f.write('\n'.join(ss))

我在windows平台下实验python 3.11的结果是，如果文本中含有'\r\n'，write函数在写入时会把'\r\n'处理为两个换行符。

多个空行的原因：
你打印一下带有空行的文件经过readlines()读取的结果，我这有个示例：

['0000001', '0000002', '0000003', '\n', '0000004', '0000005', '0000006', '0000007', '0000008', '0000009', '0000010', '0000011', '0000012']

带有空行部分写入到ss列表中，再通过.join()连接后空行数量就变多了，每次循环每个换行符左右两侧会各添加一个换行符。

jackphil · 2023-02-16 14:58:03

@Watermelon.Rei 非常感谢

write函数在写入时会把'\r\n'处理为两个换行符，那就解释得通我实际碰到的问题了，只是似乎又不是每次都这样，所以我不得不在生产程序中添加一个条件检测条目是否为空，原始数据文件没有空行，每次读取后马上用 rstrip 而不是切片去换行，但有时候文件会出现空行，但程序中对列表除了 pop 和在函数间传来传去，没有其他操作

再次感谢！

最近编辑记录 jackphil (2023-02-16 15:02:20)

Watermelon.Rei · 2023-02-16 16:54:27

再研究一下readlines()究竟在干嘛，cpython源码里Lib/_pyio.py有readlines()：

    def readlines(self, hint=None):
        """Return a list of lines from the stream.
        hint can be specified to control the number of lines read: no more
        lines will be read if the total size (in bytes/characters) of all
        lines so far exceeds hint.
        """
        if hint is None or hint <= 0:
            return list(self)
        n = 0
        lines = []
        for line in self:
            lines.append(line)
            n += len(line)
            if n >= hint:
                break
        return lines

实际上不指定参数readlines()就是一个list(IOBase)

Doc/library/io.rst 写了IOBase类的细节：

:class:`IOBase` (and its subclasses) supports the iterator protocol, meaning that an :class:`IOBase` object can be iterated over yielding the lines in a stream. Lines are defined slightly differently depending on whether the stream is a binary stream (yielding bytes), or a text stream (yielding character strings). See :meth:`~IOBase.readline` below.

对IOBase类进行枚举，枚举调用的函数就是readline()，它会按照换行符分割行并返回给枚举量。

你读入的行的命令

ss = [i[0:7] for i in f.readlines()]

是提取每一行的前面7个字符，抛弃第八个换行符。当只有换行符（空行）的情况下，对i这个字符串越界调用（i[0:7]）只会返回字符串合法内容，导致ss列表里面出现了换行符。
干脆全读了，在pop的过程中处理空行和去除换行符

又或者对于你的文件：

0000001
0000002
0000003
...

它的文件存储是
“0000001\n0000002\n0000003\n....”
出现空行的结果是多个'\n'重复，做个脚本把输入文件里的重复'\n'改成单个应该能解决空行

最近编辑记录 Watermelon.Rei (2023-02-16 16:58:43)

xiao80 · 2023-02-18 18:18:07

jackphil 说：

@Watermelon.Rei 非常感谢
write函数在写入时会把'\r\n'处理为两个换行符，那就解释得通我实际碰到的问题了，只是似乎又不是每次都这样，所以我不得不在生产程序中添加一个条件检测条目是否为空，原始数据文件没有空行，每次读取后马上用 rstrip 而不是切片去换行，但有时候文件会出现空行，但程序中对列表除了 pop 和在函数间传来传去，没有其他操作
再次感谢！

造成多行的原因并不是在写入时会把 '\r\n' 处理为两个换行符，而是在于读入空行的时候，得到的是'\n'。

直观的说，假如初始文本读成列表后是
['1', '2', '\n', '3']
即使用 '\n' 连接，连接之后就成了
'1\n2\n\n\n3\n'，
写入之后再读就成了
['1', '2', '\n', '\n', '3']
注意到没有，一个换行就成了两个换行，再下去两个换行就会成4个换行，以此类推。

一个简单的解决方式就是判断一下读入的行，假如是'\n'的时候替换成''，
ss = ['' if i == '\n' else i[0:7] for i in f.readlines()]

这样原始有多少空行就会写入多少空行，而不是翻倍。

依云 · 2023-02-19 10:41:12

不如直接 i.strip()。另外不需要 .readlines()，文件对象本身就能迭代。

xiao80 · 2023-02-19 22:10:45

i.strip() 看上去更简洁，只是这里没有必要仅仅因为要判断是否空行，对每行做一个相对耗时的 strip 操作。

依云 · 2023-02-20 10:50:52

20048 ~tmp
>>> python -m timeit -s 'a = "123456\n"' 'a.strip()'
5000000 loops, best of 5: 47.7 nsec per loop
20049 ~tmp
>>> python -m timeit -s 'a = "123456\n"' 'a[0:7]'
5000000 loops, best of 5: 46.8 nsec per loop
20050 ~tmp
>>> python -m timeit -s 'a = "123456\n"' '"" if a == "\n" else a[0:7]'
5000000 loops, best of 5: 64.6 nsec per loop

Arch Linux

#1 2023-02-13 13:21:32

莫名其妙的空行

#2 2023-02-14 11:30:20

Re: 莫名其妙的空行

#3 2023-02-16 14:58:03

Re: 莫名其妙的空行

#4 2023-02-16 16:54:27

Re: 莫名其妙的空行

#5 2023-02-18 18:18:07

Re: 莫名其妙的空行

#6 2023-02-19 10:41:12

Re: 莫名其妙的空行

#7 2023-02-19 22:10:45

Re: 莫名其妙的空行

#8 2023-02-20 10:50:52

Re: 莫名其妙的空行

页脚