Skip to content Skip to sidebar Skip to footer

What Is A Very General Way To Read-in .csv In Python And Pandas?

I have a .csv file with rows with multiple columns lengths. import pandas as pd df = pd.read_csv(infile, header=None) returns the ParserError: Error tokenizing data. C error: Ex

Solution 1:

OK, somewhat inspired by this related question: Pandas variable numbers of columns to binary matrix

So read in the csv but override the separator to a tab so it doesn't try to split the names:

In[7]:
import pandas as pd
import io
t="""Anne,Beth,Caroline,Ernie,Frank,Hannah
Beth,Caroline,David,Ernie
Caroline,Hannah
David,,Anne,Beth,Caroline,Ernie
Ernie,Anne,Beth,Frank,George
Frank,Anne,Caroline,Hannah
George,
Hannah,Anne,Beth,Caroline,David,Ernie,Frank,George"""
df = pd.read_csv(io.StringIO(t), sep='\t', header=None)
df

Out[7]: 
                                                   00              Anne,Beth,Caroline,Ernie,Frank,Hannah
1                          Beth,Caroline,David,Ernie
2                                    Caroline,Hannah
3                    David,,Anne,Beth,Caroline,Ernie
4                       Ernie,Anne,Beth,Frank,George
5                         Frank,Anne,Caroline,Hannah
6                                            George,
7  Hannah,Anne,Beth,Caroline,David,Ernie,Frank,Ge...

We can now use str.split with expand=True to expand the names into their own columns:

In[8]:
df[0].str.split(',', expand=True)

Out[8]: 
          012345670      Anne      Beth  Caroline     Ernie     Frank  Hannah   NoneNone1      Beth  Caroline     David     Ernie      NoneNoneNoneNone2  Caroline    Hannah      NoneNoneNoneNoneNoneNone3     David                Anne      Beth  Caroline   Ernie   NoneNone4     Ernie      Anne      Beth     Frank    George    NoneNoneNone5     Frank      Anne  Caroline    Hannah      NoneNoneNoneNone6    George                NoneNoneNoneNoneNoneNone7    Hannah      Anne      Beth  Caroline     David   Ernie  Frank  George

So just to be clear modify your read_csv line to this:

df = pd.read_csv(infile, header=None, sep='\t')

and then do the str.split as above

Solution 2:

One can do some manipulation with the csv before using pandas.

# load data into listwithopen('new_data.txt', 'r') as fil:
    data = fil.readlines()

# remove line breaks from string entries
data = [ x.replace('\r\n', '') for x in data]
data = [ x.replace('\n', '') for x in data]

# calculate the number of columns
total_cols = max([x.count(',') for x in data])

# add ',' to end of list depending on how many are needed
new_data = [x + ','*(total_cols-x.count(',')) for x in data]

# save datawithopen('save_data.txt', 'w') as outp:
    outp.write('\n'.join(new_data))

# read it in as you did.
pd.read_csv('save_data.txt', header=None)

This is some rough python, but should work. I'll clean this up when I have time.

Or use the other answer, it's neat as it is.

Post a Comment for "What Is A Very General Way To Read-in .csv In Python And Pandas?"