Tuesday, August 20, 2013

Parsing Stackoverflow Posts.xml data dump file crashes the program, gives ascii encoding error

Parsing Stackoverflow Posts.xml data dump file crashes the program, gives
ascii encoding error

I have downloaded Stackoverflow June 2013 data dump and now in the process
of parsing the XML files and storing in MySQL database. I am using Python
ElementTree to do it and it keeps crashing and giving me encoding errors.
Snippet of parse code:
#Parse XML directly from the file path
tree = xml.parse(("post.xml").encode('utf-8').strip())
#Get the root node
row = tree.findall("row")
It's giving me following errors:
'ascii' codec can't encode character u'\u2019' in position 248: ordinal
not in range(128)
I also tried using the following but the problem persists.
.encode('ascii', 'ignore')
Any advise to fix the problem will be appreciated. Also, if anyone has
link to the clean data will also help.
Also, my final goal is to convert the data into RDF, so if anyone has
StackOverflow data dump in RDF format, I'll be grateful.
Thanks in advance!
p.s This is the XML row that causes problem and crashes the program:
<row Id="99" PostTypeId="2" ParentId="88"
CreationDate="2008-08-01T14:55:08.477" Score="2"
Body="&lt;blockquote&gt;&#xD;&#xA; &lt;p&gt;The actual resolution of
gettimeofday() depends on the hardware architecture. Intel processors as
well as SPARC machines offer high resolution timers that measure
microseconds. Other hardware architectures fall back to the system's
timer, which is typically set to 100 Hz. In such cases, the time
resolution will be less accurate.
&lt;/p&gt;&#xD;&#xA;&lt;/blockquote&gt;&#xD;&#xA;&#xD;&#xA;&lt;p&gt;I
obtained this answer from &lt;a
href=&quot;http://www.informit.com/guides/content.aspx?g=cplusplus&amp;amp;seqNum=272&quot;
rel=&quot;nofollow&quot;&gt;High Resolution Time Measurement and Timers,
Part I&lt;/a&gt;&lt;/p&gt;" OwnerUserId="25"
LastActivityDate="2008-08-01T14:55:08.477" />

No comments:

Post a Comment