TA的每日心情 | 开心 2019-8-21 08:44 |
---|
签到天数: 163 天 [LV.7]常住居民III
状元
- 积分
- 14980
|
本帖最后由 mikeee 于 2018-11-25 18:08 编辑
有一个办法应该可行:先用 Abbyy Finereader 转成 docx,docx再转成 htm。
我机器里没装Finereader,用在线 https://finereaderonline.com 做了十页(每天在线只能OCR十页),效果不错:htm里的页头自动消失。两列变成了单列,粗体保留,好像原pdf换行时的 hyphen 都去掉了,但原pdf里跨页的段落好像没有合并。
Chrome Devtools 大致看了看:css selector: p.Bodytext21 可定位所有的释义
css selector:p.Bodytext21>span.Bodytext2Bold 可定位释义里的粗体
贴不了图,发个 docx 和 htm 文件(仅10页) 百度盘链接: https://pan.baidu.com/s/15Qc4tQeWcePy7AhTJLiJXQ 提取码: encg
折腾了一阵,这个 python3 码处理上面说的 htm 得到的东西大致可以做成 mdx
- '''word and phrase orgins test
- '''
- from pyquery import PyQuery as pq
- file = r'WordandPhraseOrigins.htm'
- try:
- html = open(file, 'rt', encoding='utf8').read()
- except Exception as exc:
- print('error: {}. Trying gb2312...'.format(exc))
- try:
- html = open(file, 'rt', encoding='gb2312').read()
- print('Looks good')
- except Exception as exc:
- SystemExit('error: {}. Giving up...'.format(exc))
- doc = pq(html)
- css_text = 'p.Bodytext21'
- css_bold = 'p.Bodytext21>span.Bodytext2Bold'
- items = doc(css_text)
- text = doc(css_text).map(lambda idx, elm: pq(elm)(
- 'span.Bodytext2Bold').text() + ('(hw)\n' if pq(elm)('span.Bodytext2Bold').text() else '\n') + pq(elm)('span.Bodytext20').text())
- print('\n\n'.join(text[:60]))
复制代码 上面码的输出大致这个样子:。。。
A-Rod.(hw)
People who have little or no knowledge of baseball might have trouble with these initials. They are short for Alex Rodriguez, the famous Yankee baseball star.
around Cape Horn.(hw)
An expression once used in whaling communities to mean “being away on a whaling voyage.” One old poem went:
“I’ll tell your father, boys,” I cried To lads at play upon my lawn.
They chorused back, “You’ll have to go Around Cape Horn.”
around the horn.(hw)
In the days of the tall ships any sailor who had sailed around Cape Horn was entitled to spit to windward; otherwise, it was a serious infraction of nautical rules of conduct. Thus, the permissible practice of spitting to windward was called Cape Horn isn’t so named because it is shaped like a horn. Captain Schouten, the Dutch navigator who first rounded it in 1616, named it after Hoorn, his birthplace in northern Holland.
arrant thief; knight errant.(hw)
was originally just a variation of nomadic or vagabond, the word best known in a knight who roamed the country performing good deeds. But from its persistent use in expressions such as an a thief who roamed the countryside holding up victims, came to mean thorough, downright, or out-
。。。
顺便安利一下 pyquery,是不是完爆正则、bs4、lxml?
|
评分
-
1
查看全部评分
-
|