Python docx

python docx




SebasSBM commented on Aug 16, 2014

Thanks for your reply, scanny. I’ve just read it right now and started researching. The command you posted revealed an inner file structure, so I used the Ubuntu’s tool for compressed files and noticed that all of them are XML files. I’m used to XML through I made apps for Android and I used some SOAP webservices, so XML is not new for me. Althrough, you have a point about it may become a steep learning curve, through the XML structure seems to be quite complex.Anyway, analyzing it I figured out some things in just less than half an hour: it seems that styles are defined in “styles.xml”. There is also a file for the fonts, I just don’t get why it seems there are 4 fonts in a test.docx file I created with LibreOffice in which I just used the default font and a hyperlink (which it seems it has it’s own style defined), but I don’t think extra fonts are relevant for now.I’ve taken a look to the “document.xml” file and noticed a difference between a normal paragraph and a hyperlink paragraph: this would be a normal paragraph structure: Prueba jajajjajajajaja On the other hand, this would be a hyperlink paragraph: In other words, it seems that the tags contain the whole rich text structure that is supposed to be the hyperlink, with an id which would point to the actual URL stored somewhere in the XML file system, I guess. It seems quite interesting, unfortunately, I don’t have much spare time lately, because I’m very busy with web developing.Anyways, if I ever have some spare time, I’d like to research how your python API reads the paragraphs, and make their .text() method able to recognize the tag as text container. I'll keep you informed if I make any relevant progress.

👍 1



SebasSBM commented on Aug 17, 2014

I think here’s the problem: check the class CT_P at master/docx/oxml/ . If you take a look at the initial variables (lines 36 and 37) it seems this class (which I suppose it handles objects) doesn't handle objects at all. I think that's why text inside hyperlinks are not returned in the [text]( property. I don't know much about the structure of the whole project -not yet-, but I think this is the way to go to resolve the problem.

docx内部实现使用了大量的meta programming,搞得很难分析


_add_list_getter (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\site-packages\docx\oxml\
populate_class_members (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\site-packages\docx\oxml\
__init__ (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\site-packages\docx\oxml\
<module> (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\site-packages\docx\oxml\
<module> (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\site-packages\docx\oxml\
<module> (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\site-packages\docx\opc\
<module> (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\site-packages\docx\opc\
<module> (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\site-packages\docx\
<module> (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\site-packages\docx\
<module> (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\site-packages\docx\
<module> (g:\sw\Python36\
_run_code (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\
_run_module_code (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\
run_path (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\
_run_code (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\
_run_module_as_main (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\


class CT_Body(BaseOxmlElement):
    ``<w:body>``, the container element for the main document story in
    p = ZeroOrMore('w:p', successors=('w:sectPr',))
    tbl = ZeroOrMore('w:tbl', successors=('w:sectPr',))
    sectPr = ZeroOrOne('w:sectPr', successors=())

而class BaseOxmlElement实现了

BaseOxmlElement = MetaOxmlElement(
    'BaseOxmlElement', (etree.ElementBase,), dict(_OxmlElementBase.__dict__)


class MetaOxmlElement(type):
    Metaclass for BaseOxmlElement
    def __init__(cls, clsname, bases, clsdict):
        dispatchable = (
            OneAndOnlyOne, OneOrMore, OptionalAttribute, RequiredAttribute,
            ZeroOrMore, ZeroOrOne, ZeroOrOneChoice
        for key, value in clsdict.items():
            if isinstance(value, dispatchable):
                value.populate_class_members(cls, key)




class ZeroOrMore(_BaseChildElement):
    Defines an optional repeating child element for MetaOxmlElement.
    def populate_class_members(self, element_cls, prop_name):
        Add the appropriate methods to *element_cls*.
        super(ZeroOrMore, self).populate_class_members(
            element_cls, prop_name
        delattr(element_cls, prop_name)


    def _add_list_getter(self):
        Add a read-only ``{prop_name}_lst`` property to the element class to
        retrieve a list of child elements matching this type.
        prop_name = '%s_lst' % self._prop_name
        property_ = property(self._list_getter, None, None)
        setattr(self._element_cls, prop_name, property_)
class Paragraph(Parented):
    里面包含self._p类型为class CT_P(BaseOxmlElement)

为什么CT_P可以 for child in self

temp solution


class Paragraph(Parented):
    def text(self):
        String formed by concatenating the text of each run in the paragraph.
        Tabs and line breaks in the XML are mapped to ``\\t`` and ``\\n``
        characters respectively.

        Assigning text to this property causes all existing paragraph content
        to be replaced with a single run containing the assigned text.
        A ``\\t`` character in the text is mapped to a ``<w:tab/>`` element
        and each ``\\n`` or ``\\r`` character is mapped to a line break.
        Paragraph-level formatting, such as style, is preserved. All
        run-level formatting, such as bold or italic, is removed.
        text = ''
        for run in self.runs:
            text += run.text
        return text
    def text_2(self):
        String formed by concatenating the text of each run in the paragraph.
        Tabs and line breaks in the XML are mapped to ``\\t`` and ``\\n``
        characters respectively.

        Assigning text to this property causes all existing paragraph content
        to be replaced with a single run containing the assigned text.
        A ``\\t`` character in the text is mapped to a ``<w:tab/>`` element
        and each ``\\n`` or ``\\r`` character is mapped to a line break.
        Paragraph-level formatting, such as style, is preserved. All
        run-level formatting, such as bold or italic, is removed.
        text = self._p.text_2
        return text

class CT_P(BaseOxmlElement):
    ``<w:p>`` element, containing the properties and text for a paragraph.
    pPr = ZeroOrOne('w:pPr')
    r = ZeroOrMore('w:r')
    hyperlink = ZeroOrMore('w:hyperlink')

    def text_2(self):
        A string representing the textual content of this run, with content
        child elements like ``<w:tab/>`` translated to their Python
        text = ''
        for child in self:
            if child.tag == qn('w:t'):
                t_text = child.text
                text += t_text if t_text is not None else ''
            elif child.tag == qn('w:hyperlink'):
                for c in child:
                    t_text = c.text
                    text += t_text if t_text is not None else ''
            elif child.tag == qn('w:r'):
                t_text = child.text
                text += t_text if t_text is not None else ''
            elif child.tag == qn('w:tab'):
                text += '\t'
            elif child.tag in (qn('w:br'), qn('w:cr')):
                text += '\n'
        return text    

Powered by Jekyll and Theme by solid
