Python docx
python docx
问题
issue
https://github.com/python-openxml/python-docx/issues/85
分析
SebasSBM commented on Aug 16, 2014
Thanks for your reply, scanny. I’ve just read it right now and started researching. The command you posted revealed an inner file structure, so I used the Ubuntu’s tool for compressed files and noticed that all of them are XML files. I’m used to XML through I made apps for Android and I used some SOAP webservices, so XML is not new for me. Althrough, you have a point about it may become a steep learning curve, through the XML structure seems to be quite complex.Anyway, analyzing it I figured out some things in just less than half an hour: it seems that styles are defined in “styles.xml”. There is also a file for the fonts, I just don’t get why it seems there are 4 fonts in a test.docx file I created with LibreOffice in which I just used the default font and a hyperlink (which it seems it has it’s own style defined), but I don’t think extra fonts are relevant for now.I’ve taken a look to the “document.xml” file and noticed a difference between a normal paragraph and a hyperlink paragraph: this would be a normal paragraph structure: Prueba jajajjajajajaja
On the other hand, this would be a hyperlink paragraph: http://www.google.com/
In other words, it seems that the tags
👍 1
Author
SebasSBM commented on Aug 17, 2014
I think here’s the problem: check the class CT_P at master/docx/oxml/text.py . If you take a look at the initial variables (lines 36 and 37) it seems this class (which I suppose it handles
docx内部实现使用了大量的meta programming,搞得很难分析
比如如下callstack
_add_list_getter (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\site-packages\docx\oxml\xmlchemy.py:325)
populate_class_members (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\site-packages\docx\oxml\xmlchemy.py:557)
__init__ (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\site-packages\docx\oxml\xmlchemy.py:105)
<module> (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\site-packages\docx\oxml\document.py:26)
<module> (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\site-packages\docx\oxml\__init__.py:75)
<module> (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\site-packages\docx\opc\part.py:13)
<module> (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\site-packages\docx\opc\package.py:9)
<module> (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\site-packages\docx\package.py:9)
<module> (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\site-packages\docx\api.py:14)
<module> (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\site-packages\docx\__init__.py:3)
<module> (g:\sw\Python36\test_docx.py:1)
_run_code (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\runpy.py:86)
_run_module_code (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\runpy.py:96)
run_path (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\runpy.py:263)
_run_code (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\runpy.py:86)
_run_module_as_main (d:\Users\cutep\AppData\Local\Programs\Python\Python38-32\Lib\runpy.py:193)
具体就是如果class从BaseOxmlElement继承
class CT_Body(BaseOxmlElement):
"""
``<w:body>``, the container element for the main document story in
``document.xml``.
"""
p = ZeroOrMore('w:p', successors=('w:sectPr',))
tbl = ZeroOrMore('w:tbl', successors=('w:sectPr',))
sectPr = ZeroOrOne('w:sectPr', successors=())
而class BaseOxmlElement实现了
BaseOxmlElement = MetaOxmlElement(
'BaseOxmlElement', (etree.ElementBase,), dict(_OxmlElementBase.__dict__)
)
所以最终跑到
class MetaOxmlElement(type):
"""
Metaclass for BaseOxmlElement
"""
def __init__(cls, clsname, bases, clsdict):
dispatchable = (
OneAndOnlyOne, OneOrMore, OptionalAttribute, RequiredAttribute,
ZeroOrMore, ZeroOrOne, ZeroOrOneChoice
)
for key, value in clsdict.items():
if isinstance(value, dispatchable):
value.populate_class_members(cls, key)
其中clsdict内容如下
也就是CT_Body中声明的变量p,。。。的如下函数被调用
class ZeroOrMore(_BaseChildElement):
"""
Defines an optional repeating child element for MetaOxmlElement.
"""
def populate_class_members(self, element_cls, prop_name):
"""
Add the appropriate methods to *element_cls*.
"""
super(ZeroOrMore, self).populate_class_members(
element_cls, prop_name
)
self._add_list_getter()
self._add_creator()
self._add_inserter()
self._add_adder()
self._add_public_adder()
delattr(element_cls, prop_name)
而_add_list_getter代码如下,会生成相关的property,比如p_lst,…
def _add_list_getter(self):
"""
Add a read-only ``{prop_name}_lst`` property to the element class to
retrieve a list of child elements matching this type.
"""
prop_name = '%s_lst' % self._prop_name
property_ = property(self._list_getter, None, None)
setattr(self._element_cls, prop_name, property_)
class Paragraph(Parented):
里面包含self._p,类型为class CT_P(BaseOxmlElement)
为什么CT_P
可以 for child in self
?
temp solution
增加text_2属性
class Paragraph(Parented):
@property
def text(self):
"""
String formed by concatenating the text of each run in the paragraph.
Tabs and line breaks in the XML are mapped to ``\\t`` and ``\\n``
characters respectively.
Assigning text to this property causes all existing paragraph content
to be replaced with a single run containing the assigned text.
A ``\\t`` character in the text is mapped to a ``<w:tab/>`` element
and each ``\\n`` or ``\\r`` character is mapped to a line break.
Paragraph-level formatting, such as style, is preserved. All
run-level formatting, such as bold or italic, is removed.
"""
text = ''
for run in self.runs:
text += run.text
return text
@property
def text_2(self):
"""
String formed by concatenating the text of each run in the paragraph.
Tabs and line breaks in the XML are mapped to ``\\t`` and ``\\n``
characters respectively.
Assigning text to this property causes all existing paragraph content
to be replaced with a single run containing the assigned text.
A ``\\t`` character in the text is mapped to a ``<w:tab/>`` element
and each ``\\n`` or ``\\r`` character is mapped to a line break.
Paragraph-level formatting, such as style, is preserved. All
run-level formatting, such as bold or italic, is removed.
"""
text = self._p.text_2
return text
class CT_P(BaseOxmlElement):
"""
``<w:p>`` element, containing the properties and text for a paragraph.
"""
pPr = ZeroOrOne('w:pPr')
r = ZeroOrMore('w:r')
hyperlink = ZeroOrMore('w:hyperlink')
@property
def text_2(self):
"""
A string representing the textual content of this run, with content
child elements like ``<w:tab/>`` translated to their Python
equivalent.
"""
text = ''
for child in self:
if child.tag == qn('w:t'):
t_text = child.text
text += t_text if t_text is not None else ''
elif child.tag == qn('w:hyperlink'):
for c in child:
t_text = c.text
text += t_text if t_text is not None else ''
elif child.tag == qn('w:r'):
t_text = child.text
text += t_text if t_text is not None else ''
elif child.tag == qn('w:tab'):
text += '\t'
elif child.tag in (qn('w:br'), qn('w:cr')):
text += '\n'
return text