Skip to content Skip to sidebar Skip to footer

Python: Most Optimal Way To Read File Line By Line

I have a large input file I need to read from so I don't want to use enumerate or fo.readlines(). for line in fo: in the traditional way won't work and I'll state why, but I feel s

Solution 1:

This might not be the shortest possible solution but I believe it is “pretty optimal”.

def parse_number(stream):
    return int(next(stream).partition('#')[0].strip())

def parse_coords(stream, count):
    return [parse_number(stream) for i in range(count)]

def parse_test(stream):
    count = parse_number(stream)
    return list(zip(parse_coords(stream, count), parse_coords(stream, count)))

def parse_file(stream):
    for i in range(parse_number(stream)):
        yield parse_test(stream)

It will eagerly parse all coordinates of a single test but each test will only be parsed lazily as you ask for it.

You can use it like this to iterate over the tests:

if __name__ == '__main__':
    withopen('input.txt') as istr:
        for test in parse_file(istr):
            print(test)

Better function names might be desired to better distinguish eager from lazy functions. I'm experiencing a lack of naming creativity right now.

Solution 2:

how about this with the grouper recipe

from itertools import zip_longest

defgrouper(iterable, n, fillvalue=None):
    """Collect data into fixed-length chunks or blocks
        grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"""
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

withopen(input_file) as archi:
    T = int(next(archi))
    N = int(next(archi))
    points = [ g for g in grouper(map(int,archi),N) ]
    print(points) # [(1, 2, 3), (2, 4, 6)]
    result = list( zip(*points) )
    print(result) #  [(1, 2), (2, 4), (3, 6)]

here I use grouper to read N lines at the time getting a list of tuples with all the x and all the y, then use zip to pair all those together

Solution 3:

It sounds like you're not really trying to "read a file line by line." It sounds like you want to skip around the file, treating it like a large list/array but without triggering excessive memory consumption.

Have you looked at the mmap module? With that you can use methods like .find() to find newlines, optionally starting at some offset (such as just past your current test header) and .seek() to move the file pointer to the nth item you've found and then .readline() and so on.

An mmap object shares some methods and properties of a string or byte array and some from file like objects. So you can use a mixture of methods like .find() (normal for strings and byte arrays) and .seek() (for files).

Additionally the Python memory mapping uses your operating system's features for mapping files into memory. (On Linux and similar systems this is the same mechanism by which your shared libraries are mapped into the address space for all of your running processes, for example). The key point is that you memory is only used as a cache for the contents of the file, and the operating system will transparently perform the necessary I/O for loading and release memory buffers with the file's contents.

I don't see a method for finding the "nth" occurrence of some character or string ... so there's no way to skip to some line. As far as I can tell you'll have to loop over .find() but then you can skip back to any such line using Python's slice notation. You could write a utility class/object to scan for 1000 line terminators at a time, storing them in an index/list. Then you can use values from that in slices of the memory mapping.

Post a Comment for "Python: Most Optimal Way To Read File Line By Line"