Python + Regular Expressions

Ich bin die Sache nochmal komplett neu angegangen, da ich bei deiner Lösung die Erweiterung für Leerstellen nicht hinbekommen habe, und meine alte Lösung ab 2-3 zu ersetzenden Begriffen deutlich langsamer wurde.

Hier die aktuelle Lösung:

from string import find, replace
from BeautifulSoup import BeautifulSoup
import re


def ireplace(self,old,new,count=0):
	''' Behaves like string.replace(), but does so in a case-insensitive
	fashion. '''
	pattern = re.compile(re.escape(old),re.I)
	return re.sub(pattern,new,self,count)


def parse_text(string, tags):

	soup = BeautifulSoup(string)
	link_soup = soup.findAll('a')
	for item in soup.findAll('a'):
		soup.find('a').replaceWith('[[[LINK]]]')

	img_soup = soup.findAll('img')
	for item in soup.findAll('img'):
		soup.find('img').replaceWith('[[[IMG]]]')

	soup2 = str(soup)

	i_tag = 0
	for tag in tags:
			soup2 = ireplace(soup2, tag.name, '[[[LINK'+str(i_tag)+']]]')
			i_tag += 1

	for item in link_soup:
			soup2 = replace(soup2, '[[[LINK]]]', str(item), 1)

	for item in img_soup:
			soup2 = replace(soup2, '[[[IMG]]]', str(item), 1)

	i_tag = 0
	for tag in tags :
			slugname = slugify(tag.name)
			soup2 = soup2.replace('[[[LINK'+str(i_tag)+']]]', '<a href="/glossary/'+str(tag.id)+'-'+str(slugname)+'.html>'+str(tag.name)+'</a>')
			i_tag += 1

	return soup2

Mit Hilfe von BeautifulSoup ersetze ich erstmal alle Links und Bilder durch Platzhalter, damit ich mir späteres Regex-Gewurschtel spare, danach wird der übrig gebliebene Inhalt durch Links erweitert. Zum Schluss die Platzhalter wieder durch Links/Bilder ersetzen, und fertig... ich hoffe, es zeigt sich im Einsatz keine Schwäche 😀

snafu1

Anmeldungsdatum:
5. September 2007

Beiträge: 2133

Wohnort: Gelsenkirchen

Zitieren

10. September 2008 14:31

Habe meinen Code auch nochmal überarbeitet und jetzt gehen auch Leerzeichen. Wär super wenn du mir Rückmeldung geben könntest wie er im "täglichen Gebrauch" funktioniert (sofern du ihn einsetzen möchtest).

def set_anchors(soup, termlist):
    '''set_anchors(soup, termlist)
    
    Inspect all elements of a BeautifulSoup instance whether they are not 
    included by an a- or img-tag. In that case compare each word of the element 
    with the entries in termlist. Change each matching word to an anchor where 
    URL is /description/%s.html (%s = word).
    
    NOTE: You need to define a head tag, e.g.: soup.p 
    '''
    for elem in soup.contents:
        try:
            if not elem.a or not elem.img:
                continue
        except AttributeError:
            pass # no attributes means we got a "pure" string and that's okay
        anchors = _set_anchors(elem, termlist)
        elem.replaceWith(anchors)
    return soup


def _set_anchors(elem, termlist):
    for entry in termlist:
        if entry in elem:
            site = entry.lower().replace(' ', '_')
            anchor = '<a href="/description/%s.html">%s</a>' % (site, entry)
            elem = elem.replace(entry, anchor)
    return elem

>>> import souptool
>>> from BeautifulSoup import BeautifulSoup
>>> string = '<p><a href="link1.html">link1</a><a href=/bar/bar.html>bar</a>foo <a href=/bar/foo.html>bar</a>die Katze, die im Garten ist, faucht</p>'
>>> soup = BeautifulSoup(string)
>>> termlist = ['foo', 'bar', 'Katze', 'im Garten']
>>> souptool.set_anchors(soup, termlist) # hier wird nichts verändert weil er die ganze Suppe als ein Element ansieht
<p><a href="link1.html">link1</a><a href="/bar/bar.html">bar</a>foo <a href="/bar/foo.html">bar</a>die Katze, die im Garten ist, faucht</p>
>>> souptool.set_anchors(soup.p, termlist) # mit dem p-Tag als "Kopf" funkioniert es dann ;)
<p><a href="link1.html">link1</a><a href="/bar/bar.html">bar</a><a href="/description/foo.html">foo</a> <a href="/bar/foo.html">bar</a>die <a href="/description/katze.html">Katze</a>, die <a href="/description/im_garten.html">im Garten</a> ist, faucht</p>
>>> print souptool.set_anchors.__doc__
set_anchors(soup, termlist)
    
    Inspect all elements of a BeautifulSoup instance whether they are not 
    included by an a- or img-tag. In that case compare each word of the element 
    with the entries in termlist. Change each matching word to an anchor where 
    URL is /description/%s.html (%s = word).
    
    NOTE: You need to define a head tag, e.g.: soup.p

Evolis2k2

(Themenstarter)

Anmeldungsdatum:
5. August 2007

Beiträge: 69

Zitieren

10. September 2008 14:57

Danke für deine Hilfe,

ich lasse jetzt vorerst meine Version laufen (und wenn das ganze in die Knie geht, teste ich deine Version aus).

« Vorherige 12Nächste »

Antworten |

« Vorheriges Thema Nächstes Thema »