|
比如以下面add词条为例,希望通过正则表达式提取出例句mp3文件名和英文例句。
add
<link type="text/css" rel="stylesheet" href="LDAE5.css"/>
<div id="LDAE5_add_1"><span class="entry" id="add"><span class="entryhead"><span class="hwd">add</span><span class="hyphenation frequent">add</span> <proncodes><span class="neutral"> /</span><span class="pron">æd</span><span class="neutral">/</span></proncodes><span class="level"> ●●●</span><span class="pos"> verb</span><span class="gram"> [transitive]</span> <a class="jp-play" href="sound://hwd/ame/a/ad1.mp3"><img src="img/spkr_b.png"></a><span class="buttons"><a class="popup-button" href="entry://@etymologies_u2fc098491a42200a.-5b7eb3a7.13b877f5061.-6775">Word Origin</a> <a class="popup-button" href="entry://@verbs_u2fc098491a42200a.-5b7eb3a7.13b877f5061.-6775">Verb Table</a> <a class="popup-button" href="entry://@collocations_add">Collocations</a> <a class="popup-button" href="entry://@thesaurus_add">Thesaurus</a> </span></span><span class="sense"><span class="sensenum">1</span><span class="def">to put something with something else, or with a group of other things</span><span class="neutral">: </span><span class="example"><a class="jp-play" href="sound://exa/ame/e/p032-000480813.mp3"><img src="img/spkr_g.png"></a> Continue mixing, then add flour.</span><span class="example"><a class="jp-play" href="sound://exa/ame/9/p032-000063988.mp3"><img src="img/spkr_g.png"></a> Do you want to <span class="colloinexa">add</span> your name <span class="colloinexa">to</span> the mailing list?</span></span><span class="sense"><span class="sensenum">2</span><span cat="math" class="topic"><span class="topic">math</span></span><span class="def"> to put numbers or amounts together and then calculate the total</span><span class="neutral">: </span><span class="example"><a class="jp-play" href="sound://exa/ame/a/p032-000064001.mp3"><img src="img/spkr_g.png"></a> If you add 5 and 3, you get 8.</span><span class="example"><a class="jp-play" href="sound://exa/ame/5/p032-000480814.mp3"><img src="img/spkr_g.png"></a> The interest will be added to your savings every six months.</span></span><span class="sense"><span class="sensenum">3</span><span class="def">to say something extra about what you have just said</span><span class="neutral">: </span><span class="example"><a class="jp-play" href="sound://exa/ame/1/p032-000064004.mp3"><img src="img/spkr_g.png"></a> The judge <span class="colloinexa">added that</span> this case was one of the worst she had ever tried.</span><span class="thesbox display" type="auto" id="add_s1"><span class="heading">THESAURUS</span><span class="section last"><span class="exponent inline" chosen="u2fc098491a42200a.-5b7eb3a7.13b877f5061.-675b"><span class="exp display">say</span></span><span class="exponent inline" chosen="u2fc098491a42200a.-5b7eb3a7.13b877f5061.-6759"><span class="neutral">, </span><span class="exp display">mention</span></span><span class="exponent inline" chosen="u2fc098491a42200a.-5b7eb3a7.13b877f5061.-6757"><span class="neutral">, </span><span class="exp display">state</span></span><span class="thesref"><span class="thesaurus">►</span> see <span class="thesaurus">thesaurus</span> at <a goto="say_1+say_1_s1"><span class="refhwd">say</span><span
通过观察可以看到(如下),文件名总是以.mp3结尾,很容易用正则判断出来。但是,例句就很麻烦了,尤其难以判断什么时候例句结束。可能是各种标点,也可能没有标点,而且中间存在的各种<span>也引起很多混乱。
<span class="example"><a class="jp-play" href="sound://exa/ame/e/p032-000480813.mp3"><img src="img/spkr_g.png"></a> Continue mixing, then add flour.</span><span class="example"><a class="jp-play" href="sound://exa/ame/9/p032-000063988.mp3"><img src="img/spkr_g.png"></a> Do you want to <span class="colloinexa">add</span> your name <span class="colloinexa">to</span> the mailing list?</span></span>
目前想到的笨办法就是先尽量把例句中嵌套的<span>删除,然后讲</a>和</span>之间认为是完整例句。但这种方法很容易遗漏和出错。
恳请指点,如何准确、简单地提出例句文本?非常感谢! |
|