What Is A Very General Way To Read-in .csv In Python And Pandas?
I have a .csv file with rows with multiple columns lengths. import pandas as pd df = pd.read_csv(infile, header=None) returns the ParserError: Error tokenizing data. C error: Ex
Solution 1:
OK, somewhat inspired by this related question: Pandas variable numbers of columns to binary matrix
So read in the csv but override the separator to a tab so it doesn't try to split the names:
In[7]:
import pandas as pd
import io
t="""Anne,Beth,Caroline,Ernie,Frank,Hannah
Beth,Caroline,David,Ernie
Caroline,Hannah
David,,Anne,Beth,Caroline,Ernie
Ernie,Anne,Beth,Frank,George
Frank,Anne,Caroline,Hannah
George,
Hannah,Anne,Beth,Caroline,David,Ernie,Frank,George"""
df = pd.read_csv(io.StringIO(t), sep='\t', header=None)
df
Out[7]:
00 Anne,Beth,Caroline,Ernie,Frank,Hannah
1 Beth,Caroline,David,Ernie
2 Caroline,Hannah
3 David,,Anne,Beth,Caroline,Ernie
4 Ernie,Anne,Beth,Frank,George
5 Frank,Anne,Caroline,Hannah
6 George,
7 Hannah,Anne,Beth,Caroline,David,Ernie,Frank,Ge...
We can now use str.split
with expand=True
to expand the names into their own columns:
In[8]:
df[0].str.split(',', expand=True)
Out[8]:
012345670 Anne Beth Caroline Ernie Frank Hannah NoneNone1 Beth Caroline David Ernie NoneNoneNoneNone2 Caroline Hannah NoneNoneNoneNoneNoneNone3 David Anne Beth Caroline Ernie NoneNone4 Ernie Anne Beth Frank George NoneNoneNone5 Frank Anne Caroline Hannah NoneNoneNoneNone6 George NoneNoneNoneNoneNoneNone7 Hannah Anne Beth Caroline David Ernie Frank George
So just to be clear modify your read_csv
line to this:
df = pd.read_csv(infile, header=None, sep='\t')
and then do the str.split
as above
Solution 2:
One can do some manipulation with the csv before using pandas.
# load data into listwithopen('new_data.txt', 'r') as fil:
data = fil.readlines()
# remove line breaks from string entries
data = [ x.replace('\r\n', '') for x in data]
data = [ x.replace('\n', '') for x in data]
# calculate the number of columns
total_cols = max([x.count(',') for x in data])
# add ',' to end of list depending on how many are needed
new_data = [x + ','*(total_cols-x.count(',')) for x in data]
# save datawithopen('save_data.txt', 'w') as outp:
outp.write('\n'.join(new_data))
# read it in as you did.
pd.read_csv('save_data.txt', header=None)
This is some rough python, but should work. I'll clean this up when I have time.
Or use the other answer, it's neat as it is.
Post a Comment for "What Is A Very General Way To Read-in .csv In Python And Pandas?"