"本文总结使用 Python 对常见的数据文件进行读写操作。帖子中的内容有些来自于他人，这里做了些整理，希望能有帮助。常见的数据文件类型如下： * txt * csv * excel(xls\xlsx) * 在线网页数据 * pdf\wor ...."

luwenjun

Rpa 179 号会员
文件处理 • 2 回帖 • 2.1K 浏览 • 2019-07-31 23:16:59

Python 实现对各种数据文件的操作

本文总结使用 Python 对常见的数据文件进行读写操作。帖子中的内容有些来自于他人，这里做了些整理，希望能有帮助。

常见的数据文件类型如下：
* txt
* csv
* excel(xls\xlsx)
* 在线网页数据
* pdf\word
* 其他数据软件格式

1 txt 文件

文件读取

# 文件input
file_txt = os.path.join(workdir,'Data/demo_text.txt')

# 打开文件
f = open(file_txt, encoding='utf-8')

# 将每行的文本读取，并存为列表
# 此处使用.rstrip()去除空格、换行符
lines_raw = [x.rstrip() for x in f]
# 或者
# lines_raw = [l.rstrip() for l in f.readlines()]

print(lines_raw)

# 关闭文件
f.close()

也可以用 pandas 来读取

df_txt = pd.read_csv(file_txt, names=['txt'], encoding='utf-8')
df_txt.head()

文件输出

# 文件output
file_out = os.path.join(workdir,'Data/out_text.txt')

f_out = open(file_out, encoding='utf-8',mode = 'w')

f_out.writelines(lines_raw)
f_out.close()

2 csv 文件

csv 文件的读入和写出相对简单，直接调用 pandas 的函数即可。

# 定义文件路径
file_csv = os.path.join(workdir,'Data/demo_csv.csv')

# pandas.read_csv()函数来读取文件
df_csv = pd.read_csv(file_csv,sep=',',encoding='utf-8')

# dataframe.to_csv()保存csv文件
df_csv.to_csv('out_csv.csv',index=False,encoding='utf-8')

# 查看dataframe前3行
df_csv.head(3)

这里也可以把 csv 当做文本文件来读取，不过处理过程稍微复杂点，尤其是字段内的取值中含有分隔符 (比如逗号) 时，例如上面的name字段。

3 excel(xls\xlsx) 文件

pandas 工具包中也提供了相应的函数来读写 excel 文件 (pandas.read_excel()和dataframe.to_excel())。

不同于 csv 文件，xlsx 文件中会有多个 sheet，pandas.read_excel 函数默认读取第一个 sheet.

# 定义文件路径
file_excel = os.path.join(workdir,'Data/demo_xlsx.xlsx')

# pandas.read_excel()函数来读取文件
# sheet_name=0表示读取第一个sheet，也可以指定要读取的sheet的名称(字符串格式)
# header=0 表示使用第一行作为表头(列名)
# 如果数据中没有列名(表头)，可以设置header=None,同时names参数来指定list格式的列名
df_excel = pd.read_excel(file_excel,sheet_name=0,header=0,encoding='utf-8')

# dataframe.to_csv()保存csv文件
df_excel.to_excel('out_excel.xlsx',index=False,encoding='utf-8')

# 查看dataframe前3行
df_excel.head(3)

如果我们是想在单元格颗粒度上进行操作，可以考虑两个工具包, 我在之前的帖子有讲过可以查看：

xlwings, https://www.xlwings.org/
openpyxl, https://openpyxl.readthedocs.io/en/stable/

import xlwings as xw

file_excel = os.path.join(workdir,'Data/demo_填表.xlsx')

# 打开excel文件的时候不要展示页面
app = xw.App(visible=False)

# 打开工作簿
wb = xw.Book(file_excel)

# 打开工作表
# 可以用index，可以指定sheet的名称
ws = wb.sheets[0]

# 读取对应单元格的值
print(ws.range('A1').value)

ws.range('B1').value = 'Ahong'
ws.range('B2').value  = '男'
ws.range('B3').value  = 'Pyhon'

# 保存工作簿
wb.save() 
# 也可以保存为新的文件名，e.g.wb.save('new.xlsx')

# 关闭工作簿
wb.close()

如果要批量从多个统一格式的 excel 文件中读取多个单元格或者写入数据，不妨考虑此方法。

4 在线网页数据

在线网页数据通常需要网络爬虫来抓取，同时网页是半结构化的数据，需要整理为结构化的数据。

注：关于网络爬虫可以参考 O’REILLY 的书 Web Scraping with Python: Collecting More Data from the Modern Web).

网页数据的爬取和解析常会用到的工具包：

requests, https://2.python-requests.org//zh_CN/latest/user/quickstart.html
BeautifulSoup, https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.html
lxml**,** https://lxml.de/, 笔者最喜欢用的工具包之一
re, h__ttps://docs.python.org/3/library/re.html，正则化是数据清洗中必学的技能之一，更多参考 _https://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html_
json, http__s://docs.python.org/3/library/json.html_,_ 处理 json 格式数据
pandas, http__s://pandas.pydata.org/pandas-docs/stable/index.html，将数据保存为 dataframe

通常网络爬虫的步骤如下：

分析网页请求规范，比如是 get 还是 post，请求的 url 是啥，返回的数据是什么格式 (json? 静态 html?)，header 参数，url 或者 post 中的变量有什么等；
获取网页数据，使用 requests 包；
解析网页数据 (将半结构化的网页数据转化为结构化数据)，BeautifulSoup、lxml、re、json 齐上阵；
整合数据并存档，使用 pandas 对数据进行整合并初步清洗。

5 PDF\Word

5.1 读取 PDF 文件

对于 pdf 文件而言，如果要对文档操作 (比如合并、筛选、删除页面等)，建议使用的工具包：

PyPDF2, http://mstamy2.github.io/PyPDF2/
pdfrw, https://github.com/pmaupin/pdfrw

更多官方参考：https://www.binpress.com/manipulate-pdf-python/

处理 pdf 文件时，要注意文件需要是“已解密”或者“无密码”状态，“加密”状态的文件处理时会报错。

pdf 解密工具推荐：

这里举例说明 PyPDF2 的用法，筛选奇数页面并保存为新文档。

import PyPDF2

# 读入文件路径
file_in = os.path.join(workdir,'Data/demo_pdf.pdf')
# 打开要读取的pdf文件
f_in = open(file_in,'rb') 

# 读取pdf文档信息
pdfReader = PyPDF2.PdfFileReader(f_in)

# pdf文件页面数
page_cnt = pdfReader.getNumPages()

pdfWriter = PyPDF2.PdfFileWriter()

# 筛选奇数页面
for page_idx in range(0,page_cnt,2):
    page = pdfReader.getPage(page_idx)
    pdfWriter.addPage(page)

# 输出文档
file_out = open('pdf_out.pdf', 'wb')
pdfWriter.write(file_out)

# 关闭输出的文件
file_out.close()

# 关闭读入的文件
pdf_file.close()

如果要解析 pdf 文件的页面数据 (文件上都写了啥)，推荐的工具包为：

textract, https://textract.readthedocs.io/en/stable/, 该工具包支持多种格式文件的数据提取
pdfminer.six, https://github.com/pdfminer/pdfminer.six，使用方法同 pdfminer 是一样的。pdfminer 的使用方法参考 _http://www.unixuser.org/~euske/python/pdfminer/_

安装好pdfminer.six后，直接在命令行中调用如下命令即可：

pdf2txt.py demo_pdf.pdf -o demo_pdf.txt

或者参考 _https://stackoverflow.com/questions/26494211/extracting-text-from-a-pdf-file-using-pdfminer-in-python_ 可以自定义一个函数批量对 pdf 进行转换 (文末附有该函数)。

textract使用示例如下：

import textract

# 文件路径
file_pdf = os.path.join(workdir,'Data/demo_pdf.pdf')

# 提取文本
text_raw = textract.process(file_pdf)
# 转码
text = text_raw.decode('utf-8')

5.2 读取 Word 文件

可以使用工具包python-docx,https://python-docx.readthedocs.io/en/latest/

操作 word 的场景相对少见，参考网站的示例即可。

6 其他数据软件文件

比如 SAS, SPSS,Stata 等分析软件导出的数据格式。

可以使用的工具包pyreadstat, https://github.com/Roche/pyreadstat

# 使用Python读取.sav文件
# https://github.com/Roche/pyreadstat
import pyreadstat

# 文件路径
file_data = os.path.join(workdir,'Data/demo_sav.sav')

# 读取文件
df,meta = pyreadstat.read_sav(file_data)
# df就是转化后的数据框

# 查看编码格式
print(meta.file_encoding)

df.head()

# 使用 Python 读取.sav 文件
# https://github.com/Roche/pyreadstat
import pyreadstat

# 文件路径
file_data = os.path.join(workdir,‘Data/demo_sav.sav’)

# 读取文件
df,meta = pyreadstat.read_sav(file_data)
# df 就是转化后的数据框

# 查看编码格式
print(meta.file_encoding)

df.head()

# ref: https://stackoverflow.com/questions/26494211/extracting-text-from-a-pdf-file-using-pdfminer-in-python

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text