一,介绍
在Jupyter NootBook中,从20个PDF文档里,查找出有用的信息数据,自动的填充到Excel表格的对应位置处。
二,一些问题、报错及解决方式
2.1 pandas读取csv处理时报错:ParserError: Error tokenizing data. C error: Expected 1 fields in line 29, saw 2
1.问题
2.解决
3.结果:成功的读取到:
- 对了,注意事项的第二条,要起个有意义的name~
2.2 转换成日期:convert all values in the column to Pandas dates
1.问题
2.解决
3.我的结果
成功转换:
2.3 求5%百分位数:Compute the 5th percentile value of the AUDIT_FEES
column
1.问题
什么是百分位数?
如何计算?
2.解决
- 比如,特殊的分位数:中位数:
2.4 删除满足某一条件的所有行:drop函数、索引、bool表达式
3.解决
4.
我的结果
如下的9800行:变成了9318行:
2.5 怎么查询某一行的数据?查不到是因为不完全一致
原因:待查询的字段,没有与原文的字段完全一致。
解决:可以先去数据集Excel或txt中,手动的先搜索一下,以确认。
分析:
2.6 将浮点数转换为整数,再转换为int 64
1.浮点数转换为整数
2.将整数,再转换为int 64
下面的案例,第七行,反过来即可:
3.我的操作:
2.7 如何判断一列中的任何元素是否在某个范围内?用between函数。
1.
2.
3.我的操作
2.8 pandas 删除指定列中为nan的行
1.
2.我的操作
2.9 Pandas删除满足特定数值的列:本质,用条件来正向筛选
1.
2.10 如何对列中的很多元素进行单位的转换
1.
2.
列表推导式:不行,因为你得到的结果仍然是一个列表:
无法赋值给原来的DataFrame:
3.那就还是用基础的循环语句吧:
三、使用Python操作PDF文件
3.1 Anaconda自己的prompt命令行中:安装库PyPDF2
因为必须使用PyPDF2库,才能操作PDF文件:
3.2 python打开文件,路径存在'\t'的报错
1.
2.
3.
因为Python中的反斜杠\
,它已被系统首先的内定为一个转义字符。
- 比如:
\n
是这个字母搭配,转义成人类的换行的意思。 - 比如:
\t
是这个字母搭配,转义成人类的tab栏的首行缩进的意思。
在上面的内定前提下,如果是再表示文件路径的意思,此时就要进行一些额外的操作。比如,上述的四种方法。
3.3 从PDF中提取文本 Extract Text(附例题)
1.
- 先拿到指定的页
- 再对该页进行文本提取。(只能一页一页的提取)
借鉴:https://developer.aliyun.com/article/1056664
2.例题
以一篇论文文档为例,展示PyPDF2如何提取PDF文件中的内容。
论文《ImageNet Classification with Deep Convolutional Neural Networks》,一共9页,其首页布局为:
Python脚本代码:
from PyPDF2 import PdfReader
# 1.
#早期版本里叫PdfFileReader,已经过时,改名为PdfReader了,见:
#https://pypdf2.readthedocs.io/en/latest/_modules/PyPDF2/_reader.html?highlight=PdfFileReader#
reader = PdfReader(pdf_path)
number_of_pages = len(reader.pages)
#1.28.0版本之前用numPages,已经过时,见:
#https://pypdf2.readthedocs.io/en/latest/modules/PdfReader.html#PyPDF2.PdfReader.numPages
print(number_of_pages) #打印页数
# 2.
page = reader.pages[0]
#1.28.0版本之前用getPage(pageNumber),已经过时,见:
#https://pypdf2.readthedocs.io/en/latest/modules/PdfReader.html#PyPDF2.PdfReader.getPage
print(page) #打印“PDF第一页”这个Page<PyPDF2._page.Page>对象
text = page.extract_text()
#1.28.0版本之前用extractText(),已经过时,见:
#https://pypdf2.readthedocs.io/en/latest/modules/PageObject.html#PyPDF2._page.PageObject.extractText
print(text) #提取出第一页的文字
输出:
9
{'/Contents': IndirectObject(13, 0), '/Parent': IndirectObject(1, 0), '/Type': '/Page', '/Resources': IndirectObject(14, 0), '/MediaBox': [0, 0, 612, 792]}
ImageNet Classication with Deep Convolutional
Neural Networks
Alex Krizhevsky
University of Toronto
kriz@cs.utoronto.ca
Ilya Sutskever
University of Toronto
ilya@cs.utoronto.ca
Geoffrey E. Hinton
University of Toronto
hinton@cs.utoronto.ca
Abstract
We trained a large, deep convolutional neural network to classify the 1.2 million
high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-
ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5%
and 17.0% which is considerably better than the previous state-of-the-art. The
neural network, which has 60 million parameters and 650,000 neurons, consists
of ve convolutional layers, some of which are followed by max-pooling layers,
and three fully-connected layers with a nal 1000-way softmax. To make train-
ing faster, we used non-saturating neurons and a very efcient GPU implemen-
tation of the convolution operation. To reduce overtting in the fully-connected
layers we employed a recently-developed regularization method called fidropoutfl
that proved to be very effective. We also entered a variant of this model in the
ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%,
compared to 26.2% achieved by the second-best entry.
1 Introduction
Current approaches to object recognition make essential use of machine learning methods. To im-
prove their performance, we can collect larger datasets, learn more powerful models, and use bet-
ter techniques for preventing overtting. Until recently, datasets of labeled images were relatively
small Š on the order of tens of thousands of images (e.g., NORB [16], Caltech-101/256 [8, 9], and
CIFAR-10/100 [12]). Simple recognition tasks can be solved quite well with datasets of this size,
especially if they are augmented with label-preserving transformations. For example, the current-
best error rate on the MNIST digit-recognition task (<0.3%) approaches human performance [4].
But objects in realistic settings exhibit considerable variability, so to learn to recognize them it is
necessary to use much larger training sets. And indeed, the shortcomings of small image datasets
have been widely recognized (e.g., Pinto et al. [21]), but it has only recently become possible to col-
lect labeled datasets with millions of images. The new larger datasets include LabelMe [23], which
consists of hundreds of tho usands of fully-segmented images, and ImageNet [6], which consists of
over 15 million labeled high-resolution images in over 22,000 categories.
To learn about thousands of objects from millions of images, we need a model with a large learning
capacity. However, the immense complexity of the object recognition task means that this prob-
lem cannot be specied even by a dataset as large as ImageNet, so our model should also have lots
of prior knowledge to compensate for all the data we don't have. Convolutional neural networks
(CNNs) constitute one such class of models [16, 11, 13, 18, 15, 22, 26]. Their capacity can be con-
trolled by varying their depth and breadth, and they also make strong and mostly correct assumptions
about the nature of images (namely, stationarity of statistics and locality of pixel dependencies).
Thus, compared to standard feedforward neural networks with similarly-sized layers, CNNs have
much fewer connections and parameters and so they are easier to train, while their theoretically-best
performance is likely to be only slightly worse.
1
可以看到页数和PDF中的文字都能正确提取出来。
3.4 正则表达式,匹配目标文本片段
1.
- 问题:我不太懂,这个星号、问号都是数量词,不会重复吗?
2.
我的操作:
能成功的提取到年份:
3.4 pdf查找文本的函数:match
1.
2.
注意1:
正确:
注意2:
3.5 python re涵盖多行:re.S标志,包括换行符
- 如果是一般情况,目标字段只能在一行中。不能跨行。
- 如果是加上了re.S标志,那么目标字段可以是多行跨行的。
3.6 去除字符串中所有的空格和tab换行符
- 黑猫白猫,抓住老鼠就是好猫~
3.7 元组的切片访问
3.8 python如何给字典赋值
- 新建一个空字典
- 然后,赋值键值呗~
3.9 at方法:选择,获取和更改单个元素的值
1.
- 就是说,在第一步读取csv文件时,默认是在变成DataFrame时,会额外的加上一个序号的索引。
- 如果此时,想指定谁是索引,那就如下操作;
2.
3.10 python 中,空的list是否等于None?不是。
1.
2.
3.11 Python字符串转换为Int
- 问题:1921,直接用数字格式比如整型int保存不行吗,为什么要加上前后双引号以字符串的形式保存呢?
3.12 pandas中关于DataFrame行,列显示不完全(省略)的解决办法
1.问题:
奇怪:右侧的Jupyter NoteBook能够显示所有的列。但是,左边的PyCharm中怎么显示不全呢?还是因为数据类型的比较特殊的原因,无法读取呢?
2.PyCharm中,其实是默认的显示不全。设置以下参数即可。
3.13 为什么以下几个公司,其年报中的没能提取出目标数据呢?因为格式不同。最终是误把城市看成审计公司名
1.问题:奇怪,为什么以下几个公司,其年报中的没能提取出目标数据呢?
2.原因:
格式,跟之前的那些年报,是不一样的;
3.
应该是一行一行的来正则匹配,最好不要跨行。
因为跨行的匹配格式,不一定适用于所有的几十个PDF。
3.14 将Series对象转换成整型:s1.astype(int)
1.
2.
Comments | NOTHING