拉丁图书馆【扒站+加长音】

许一诺 · 发表于 2020-3-6 08:47:24

本帖最后由许一诺于 2020-3-6 11:07 编辑

拉丁语图书馆 www.thelatinlibrary.com 全站网页加长音
武汉大学德语系许一诺制作

拉丁图书馆加长音
链接: https://pan.baidu.com/s/1JtbB9vEfxr2wzO1Zq02MrQ 提取码: 8whi

使用说明：

0. 本资料基于:
http://www.alatius.com/macronizer/
以及
https://github.com/Alatius/latin-macronizer
本人实现了本地拉丁语网页批量加长音的技术。

1. 本文件夹是先对www.thelatinlibrary.com进行扒站，然后加了长音。最方便的方式是直接打开文件夹，根据文件名来浏览，带有macronized的就是加过长音的网页。

2. 下面介绍一个繁琐一些的方法：
打开www.thelatinlibrary.com文件夹里的index.htm，这是拉丁图书馆的主页，可以随意点击跳转到无长音的页面。
在任意一个内容为拉丁语的无长音页面，比如：
file:///D:/www.thelatinlibrary.com/caesar/gall1.shtml.htm
（此处是本地存储的网页，www.thelatinlibrary.com是文件夹的名字，不是线上的网站的名字，此处是假设解压后的文件在D盘下）
在最后一个扩展名前加上“_macronized”
就能看到加过长音的网页。
比如改为：
file:///D:/www.thelatinlibrary.com/caesar/gall1.shtml_macronized.htm

3. 本文件夹在windows的chrome浏览器和ubuntu的firefox浏览器下测试均无乱码。

4. 加长音所用的软件链接：
拉丁图书馆加长音
链接: https://pan.baidu.com/s/1JtbB9vEfxr2wzO1Zq02MrQ 提取码: 8whi
若链接失效，可发信到[email protected]索取

5. 源码比较乱，请见谅：

#!usr/bin/python3
def method(text):
macron = Macronizer()
macron.macronize(text)
macroned = macron.tokenization.detokenize(True)
del macron
return macroned
def macron_paragraph(html):
html = re.sub ('{}','☺',html)
string_list = re.findall("[^\<\>]+|[\<]|[\>]", html, re.UNICODE)
to_macron = []
for i in range(len(string_list)):
if i>0 and i<len(string_list)-1 and string_list[i-1] is '>' and string_list[i+1] is '<':
to_macron.append (string_list[i])
string_list[i] = '{}'
string_list = ''.join(string_list)
smile_string = '☺'.join(to_macron)
if langid.classify(smile_string)[0]=='en':
print ('The file is not latin.')
raise Exception
to_macron = (method(smile_string)).split('☺')
html = re.sub ('☺','{}',string_list.format(*to_macron))
return html
def output_file_path (file_path):
filepath,tempfilename = os.path.split(file_path)
filename, extension = os.path.splitext(tempfilename)
return os.path.join(filepath,filename+'_macronized'+extension)
def macron_file(i, file_path):
print (f'We are doing {i}th file: {file_path}')
try:
original = open(file_path, mode='rb').read()
encoding = chardet.detect (original)['encoding']
original = str (original, encoding=encoding)
webpage = macron_paragraph (original)
ofp = output_file_path (file_path)
with open(ofp, mode='w', encoding='utf-8') as macronized_html:
macronized_html.write (webpage)
print (f'We have done {i}th file: {ofp}')
except:
print (f'{i}th file failed: {ofp}')
def get_file_paths():
file_paths = []
with open('/home/xyn/latin-macronizer/latin_test/index.txt', encoding='utf-8') as index:
# index.txt是存放需要加长音的拉丁语网页地址的文本文件
for line in index:
file_paths.append(line.rstrip('\r\n'))
print (f'{len(file_paths)} files in total.')
return file_paths
def main(file_paths):
pool = multiprocessing.Pool(processes = 8)
for i,fp in enumerate (file_paths):
pool.apply_async(macron_file, (i,fp, ))
pool.close()
pool.join()
print ('All Done!')
main(get_file_paths())

复制代码

许一诺 · 发表于 2020-3-6 21:50:34

利用加过长音的文件，可以得到拉丁语长音形式词频表。

Process · 发表于 2020-3-6 10:14:35

本帖最后由 Process 于 2020-3-6 12:21 编辑

多谢楼主辛勤工作！

ryanlo713 · 发表于 2020-3-7 04:52:09

谢谢楼主, 拉丁语的资源在国内还不多(*´ω｀*)

许一诺 · 发表于 2020-3-25 16:14:29

本帖最后由许一诺于 2020-3-25 18:30 编辑

今天又在以下网站找到一种自动生成拉丁语词频表的方式。在此把词频表上传一下。
http://perseus.uchicago.edu/Latin.html

[拉丁语] 拉丁图书馆【扒站+加长音】

评分

评分